Changes in tables from Unicode 5.0 to 5.1
Kenneth Whistler
kenw at sybase.com
Wed Mar 19 19:50:00 CET 2008
> At 7:15 AM +0100 3/19/08, Patrik Fältström wrote:
> >I have checked what changes we will get when we go from Unicode 5.0
> >to 5.1 (given version of 5.1 that existed last Friday on the Unicode
> >Site):
> >
> >1. There is one codepoint that is DISALLOWED in 5.0 and PVALID in 5.1
> >
> >In 5.0:
> >02EC;MODIFIER LETTER VOICING;Sk;0;ON;;;;;N;;;;;
> >
> >In 5.1:
> >02EC;MODIFIER LETTER VOICING;Lm;0;ON;;;;;N;;;;;
> >
> >Reason for this is that it changes from being GeneralCategory Sk to
> >General Category Lm. This in turn make the codepoint from not being
> >in any of the categories in IDNA200X to be Category A.
>
> That seems of some concern. This seems to be a character that we
> would not want in IDNA200x.
Why not? Base the RFC on Unicode 5.1, as we have been suggesting,
and the issue goes away.
> Can people who understand this character
> comment on it? Like, why was the category changed?
Sure.
The modifier letters U+02C6..U+02CF have long been gc=Lm
(and hence included in identifiers), as a result of their
known use for tone marks in various orthographies of East
and Southeast Asia and of Africa.
In 2006, Lorna Priest of SIL submitted a proposal to encode
a MODIFIER LETTER LOW CIRCUMFLEX ACCENT (L2/06-244), demonstrating
its use in orthographies for Akha and Lahu (languages used
in Southeast Asia).
But in the context of that document, she also demonstrated
that the existing modifier letter U+02EC MODIFIER LETTER VOICING
was also part of these orthographies. And the change
to gc=Lm for that character was to make its use and
treatment in identifiers consistent with U+02C6..U+02CF.
The reason it wasn't originally designated gc=Lm, but instead
as gc=Sk was that its primary source was as an IPA diacritic
for voicing. Such IPA diacritics aren't normally parts of
language orthographies, unlike the tone marks, and so they
get gc=Sk. The discovery of the use of the same character
as part of a significant language orthography pushed the
case to the other side, and the general category was changed
to gc=Lm for Unicode 5.1.
--Ken
>
> >This codepoint because of this would be forced to be added to
> >category G IF this draft had been posted as an RFC:
> >
> > Category G - Backward compatibility
>
> Glad we have that, yes.
More information about the Idna-update
mailing list