NEW TAI LUE THAM DIGIT ONE and a mapping layer (was: Re: I-D Action:draft-faltstrom-5892bis-02.txt)

John C Klensin klensin at jck.com
Tue Feb 22 15:14:53 CET 2011



--On Tuesday, February 22, 2011 09:33 +0100 Simon Josefsson
<simon at josefsson.org> wrote:

>...
> I don't understand what you mean here.  To me it will be a lot
> of work if we _don't_ add the exception -- applications need a
> mapping layer to deal with the newly introduced backwards
> incompatibility in IDNA2008.

Simon, I don't see how you come to this conclusion.  There is no
mapping or mapping layer involved and it is not clear to me how
a mapping layer would help anyway.  This is the one place where
the "edge case" argument is relevant because there is no
evidence of a single label having been registered that uses the
character that lies in the middle of this controversy (I'm sure
someone can rush out and register one to prove a point, but that
isn't the issue).  

Let's consider the cases:

-- Someone is trying to register the character.  If the registry
and registrar are using Unicode 5.2-based tables (i.e., RFC 5892
without changes and 5.2 properties), U+0CF1 and U+0CF2 (and
several hundred UNASSIGNED) character cannot be registered;
U+19DA can be.  But, if the would-be registrant or
registry/registrar understand the principles behind the IDNA2008
rule structure and the New Tai Lue script, it won't be accepted
for registration because U+19DA isn't really a digit -- anything
else would be counterintuitive.

Remember that the expectation of both IDNA2008 and, fwiw, ICANN
is that registries will make tables of what they will actually
register that are a subset of what IDNA2008 itself permits and
will do so based on an understanding of the languages and
scripts involved.  If a registry needed U+0CF1 or U+0CF2, I
assume that both we and Unicode would have been hearing screams
of protest about the misclassifcation.   But, if it were
considering registrations in New Tai Lue, a non-digit would
presumably just not end up on its list of registerable
characters, where Unicode (temporarily) believed it was a digit
or not.  

Put differently, getting a label with U+19DA registered would
have required a registry/registrar failure _in addition to_ the
Unicode mistake.

If the registry and registrar are using Unicode 6.0-based rules
with 5892bis-03 (i.e., RFC 5892 without changes and 6.0
properties) they can now register U+0CF1 and U+0CF2 plus a bunch
of newly-assigned characters.  And they cannot register U+19DA
which they had no business registering in the first place.

-- Now someone has somehow gotten hold of a string that they
think is a label and that contains U+19DA.  Either they won't
look it up (because it is now DISALLOWED) or they will look it
up and not find it (because no one registered it).  Either way,
the user isn't going to find any data.

The _only_ case in which there is a problem is if a registry
--via the naive error of registering a string in a script that
they didn't understand or through some strange form of malice--
actually registered a string containing this character that was
associated with valid data.   I don't see it as useful to go out
of our way to support that case -- if Unicode really made an
error, I'd rather have IDNA reflect the correction and make
things as clear as possible to all concerned, thereby reducing
the odds of those naive registration errors in the future.
YMMD, but it seems to me that is the issue we have here, not
some theory about stability.  

In any event, mapping doesn't help: one would have to map U+19DA
to a character that was/remained PVALID to somehow preserve the
impression that it was PVALID.  And, because the target would be
a different character, that decision would almost certainly
introduce even more harm.

It would be different if Unicode moved "0" from "Nd" to "No".
If they did, I'd be in the front of the "use Section G" line,
but less because "0" is very frequently used than because, in
IDNA terms and in terms of predictability to user expectations,
zero is a digit regardless of what Unicode or anyone else
decides to call it.   But they aren't likely to do that.

    john



More information about the Idna-update mailing list