Changes in tables from Unicode 5.0 to 5.1

Kenneth Whistler kenw at sybase.com
Wed Mar 19 19:50:00 CET 2008


> At 7:15 AM +0100 3/19/08, Patrik Fältström wrote:
> >I have checked what changes we will get when we go from Unicode 5.0 
> >to 5.1 (given version of 5.1 that existed last Friday on the Unicode 
> >Site):
> >
> >1. There is one codepoint that is DISALLOWED in 5.0 and PVALID in 5.1
> >
> >In 5.0:
> >02EC;MODIFIER LETTER VOICING;Sk;0;ON;;;;;N;;;;;
> >
> >In 5.1:
> >02EC;MODIFIER LETTER VOICING;Lm;0;ON;;;;;N;;;;;
> >
> >Reason for this is that it changes from being GeneralCategory Sk to 
> >General Category Lm. This in turn make the codepoint from not being 
> >in any of the categories in IDNA200X to be Category A.
> 
> That seems of some concern. This seems to be a character that we 
> would not want in IDNA200x.

Why not? Base the RFC on Unicode 5.1, as we have been suggesting,
and the issue goes away.

> Can people who understand this character 
> comment on it? Like, why was the category changed?

Sure.

The modifier letters U+02C6..U+02CF have long been gc=Lm
(and hence included in identifiers), as a result of their
known use for tone marks in various orthographies of East
and Southeast Asia and of Africa.

In 2006, Lorna Priest of SIL submitted a proposal to encode
a MODIFIER LETTER LOW CIRCUMFLEX ACCENT (L2/06-244), demonstrating
its use in orthographies for Akha and Lahu (languages used
in Southeast Asia). 

But in the context of that document, she also demonstrated
that the existing modifier letter U+02EC MODIFIER LETTER VOICING
was also part of these orthographies. And the change
to gc=Lm for that character was to make its use and
treatment in identifiers consistent with U+02C6..U+02CF.
The reason it wasn't originally designated gc=Lm, but instead
as gc=Sk was that its primary source was as an IPA diacritic
for voicing. Such IPA diacritics aren't normally parts of
language orthographies, unlike the tone marks, and so they
get gc=Sk. The discovery of the use of the same character
as part of a significant language orthography pushed the
case to the other side, and the general category was changed
to gc=Lm for Unicode 5.1.

--Ken

> 
> >This codepoint because of this would be forced to be added to 
> >category G IF this draft had been posted as an RFC:
> >
> >    Category G - Backward compatibility
> 
> Glad we have that, yes.



More information about the Idna-update mailing list