Unicode 5.2 -> 6.0

Thu Oct 14 23:17:34 CEST 2010

Patrik,

> We will shortly see Unicode 6.0 released. 

Actually, it already *has* been released, as of Monday,
October 11:

http://www.unicode.org/versions/Unicode6.0.0/

See below for my response to your suggestion about
what the action of the IETF should be for the
three property changes that lead to incompatibilities
in the derivation of the table for IDNA2008.

> There are incompatible changes in three codepoints:
> 
> 1. The following two that go to PVALID from DISALLOWED:
> 
> U+0CF1 KANNADA SIGN JIHVAMULIYA
> U+0CF2 KANNADA SIGN UPADHMANIYA
> 
> This because they go from General Category So to Lo.
> 
> 2. This moves from PVALID to DISALLOWED:
> 
> U+19DA NEW TAI LUE THAM DIGIT ONE
> 
> It has changed GeneralCategory from Nd to No.

> There are two alternatives for the IETF:
> 
> A) Accept the change and stay aligned with Unicode
> 
> The changes made are all "bugs" in the tables that are resolved.
> The most troublesome of the three codepoints would be U+19DA 
> as that goes from PVALID to DISALLOWED, as that potentially
>  would make domain names registered with that codepoint be invalid.
> 
> B) Add these three as exceptions for backward compatibility.
> 
> One can add the three (or subset thereof) to section 2.7 in 
> an updated version of RFC 5892:
> 
> > 2.7.  BackwardCompatible (G)
> > 
> >    G: cp is in {}
> 
> This set is in RFC 5892 empty, but characters can be added.
> Characters with explicit derived property value. This would
> require an IETF action.

> My personal suggestion is that if noone can show that domain 
> names are in fact registered or used with U+19DA according to 
> IDNA2008, IETF should accept the incompatible changes, and 
> stay completely aligned with Unicode 6.0.

I would parse the situation somewhat differently than you
have, and advise a different response than that which
you have advocated (and which Markus Scherer has agreed with).

The options are:

A) Add nothing to Section 2.7 BackwardCompatible (G)

B) Add all 3 characters to Section 2.7 BackwardCompatible (G),
   to wit:

   G: cp is in {0CF1, 0CF2, 19DA}

   0CF1; DISALLOWED   # KANNADA SIGN JIHVAMULIYA
   0CF2; DISALLOWED   # KANNADA SIGN UPADHMANIYA
   19DA; PVALID       # NEW TAI LUE THAM DIGIT ONE

C) Add only the *bad* transition character to Section 2.7
   BackwardCompatiblity (G), to wit:

   G: cp is in {19DA}

   19DA; PVALID       # NEW TAI LUE THAM DIGIT ONE

Option A, which you and Markus have advocated, is the least
work for the IETF, obviously, and involves a tacit acceptance
of the advisability of the property changes. It could be
characterized, as you have, as "staying aligned with Unicode",
but terming it that way somewhat misrepresents the situation.

First, it puts IDNA2008 into precisely the kind of unstable
state that Section 2.7 was written to avoid. Even if there
are no IDN registrations anywhere involving U+19DA, so no
harm would seem to be done by just letting it quietly slide
into DISALLOWED status, the *implementers* of the library
code will notice. An implementation based on the 5.2 table
will run afoul of the rules when it attempts to upgrade
to the 6.0 table, for precisely this one character, U+19DA,
that changes in a way it isn't supposed to.

Second, I think the concept of "staying aligned with Unicode"
here is improperly fetishizing the General_Category
property, which we have been reminding everyone all along
does not have full stability guarantees, while ignoring the
*stable* Unicode Character Database properties associated
with identifiers, which *do* have the appropriate stability
guarantees involved. In particular, U+19DA for Unicode 6.0
was very carefully also given the Other_ID_Continue property,
so that the derived and *stable* identifier-related
properties ID_Continue and XID_Continue stayed unchanged
for this character, as designed. I think if the maintainers
of IDNA2008 wish for the table used in the protocol to
remain stable, they need to pay attention to the behavior
of the designated *stable* properties in the Unicode Character 
Database.

Option B is the option which would keep the table values
for all 3 characters in question absolutely unchanged as
the table listing is updated from Unicode 5.2 to Unicode 6.0.
However, I think that would be an over-reaction to the
changes. The transition of a Unicode code point from
DISALLOWED to PVALID in the table is a normal, allowable
transition expected as new characters are added to the
repertoire. It might also result from specific requests
based on local examination of the needs for IDN's in a
particular script, for example. And it is entirely consistent
with the intended impact of the property change for
0CF1 and 0CF2, which also made them part of the Unicode
identifiers, per the UCD property values. As noted above,
the *bad* transition is for 19DA PVALID --> DISALLOWED.
That is the one for which an exception clearly *should* be
made.

Option C is therefore the correct option, in my opinion.
Adding the exception for 19DA in Section 2.7 G to keep
the IDNA2008 table stable would correctly reflect the
reason for having Section 2.7 G in the RFC in the first
place. This is *also* the correct way for IDNA2008
to "stay aligned with Unicode", because it would be keeping
the *exception* list for IDNA label-related table derivation precisely
aligned with the Unicode *exception* list for identifier
derivation. *That* is the alignment you are actually in
need of, and now is the time to bite the bullet and
proceed to maintain IDNA2008 the way it should be maintained
for this kind of issue. Failing to establish the correct
precedent now will just drag IDNA2008 into a permanent
state of niggling onesy-twosy character incompatibilities
that will *never* go away.

--Ken