Unicode 5.2 -> 6.0

Mark Davis ☕ mark at macchiato.com
Fri Oct 15 18:37:34 CEST 2010


A few comments.

> they would have known perfectly well that U+19DA wasn't a numeral

That is simply untrue. The character is a numeral; it is just not part of
the standard decimal digits. IDNA2008 chose to disallow alternate digits,
but that doesn't mean that the average person will know the precise
formulation in IDNA2008.


>The day that LATIN SMALL LETTER I changes class, I'll be happy to put
something in category G. This is not that day.

Unicode was hammered by IETF people for making changes in
*phenomenally*obscure sequences of characters; it is ironic that
people on this list are
making comments that imply they really only care about ASCII stability.


> The downside is that, for ever more, every IDNA implementation has to deal
with this exception.

There is already a table of *41* exceptions in
http://tools.ietf.org/html/rfc5892#section-2.6:
[\u302E\u302F \u0640 \u07FA \u30FB \u00B7 \u05F3 \u05F4 \u0F0B \u0375 \u303B
\u3031-\u3035 \u0660\u06F0 \u0661\u06F1 \u0662\u06F2 \u0663\u06F3
\u0664\u06F4 \u0665\u06F5 \u0666\u06F6 \u0667\u06F7 \u0668\u06F8
\u0669\u06F9 \u00DF \u03C2 \u06FD \u06FE \u3007]

For that matter there are *2* exceptions in
http://tools.ietf.org/html/rfc5892#section-2.4:
[\u200C\u200D]

And there are *357* exceptions in
http://tools.ietf.org/html/rfc5892#section-2.8:
[\u1100-\u115E \uA960-\uA97C \u115F-\u11A7 \uD7B0-\uD7C6 \u11A8-\u11FF
\uD7CB-\uD7FB]

Adding a few exceptions more to guarantee stability across versions is
hardly a real issue.

====

Bottom line is that the IETF can choose stability or instability. The tools
for stability are there, and they are not particularly onerous. I would
recommend stability in order to maintain compatibility across versions, and
thus give people confidence that adopting IDNA2008 will not cause problems
in the future. Or the committee could have protracted discussions of the
probabilities that particularly characters will or will not occur -- with
most participants basing conclusions on limited or incorrect understanding
of the characters involved, and on *no* real data.

Compare what has happend in Unicode. We instituted a grandfathering
mechanism for Unicode Identifiers set up to automatically add characters
that would become invalid because of property changes. (The definition is
quite close to what IDNA2008 has). Since we started guaranteeing stability,
we have added exactly the following characters to be grandfathered in
Unicode Identifiers.

# ================================================
2118          ; Other_ID_Start # Sm       SCRIPT CAPITAL P
212E          ; Other_ID_Start # So       ESTIMATED SYMBOL
309B..309C    ; Other_ID_Start # Sk   [2] KATAKANA-HIRAGANA VOICED SOUND
MARK..KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
# Total code points: 4
# ================================================
00B7          ; Other_ID_Continue # Po       MIDDLE DOT
0387          ; Other_ID_Continue # Po       GREEK ANO TELEIA
1369..1371    ; Other_ID_Continue # No   [9] ETHIOPIC DIGIT ONE..ETHIOPIC
DIGIT NINE
19DA          ; Other_ID_Continue # No       NEW TAI LUE THAM DIGIT ONE
# Total code points: 12
# ================================================

We introduced Unicode identifiers first in March 2002 (Unicode 3.2).
Including it and Unicode 6.0, there have been 8 versions since then: 3.2,
4.0, 4.0.1, 4.1.0, 5.0, 5.1, 5.2, and 6.0. We had additions to the
grandfathering sets in 4 of those versions, the last being ~2.5 years ago.
This mechanism is computable, and has been easy to maintain; we simply
compare each successive version of Unicode to derive it. And it is a formal
property in Unicode, so implementations simply pick it up; we've never had a
problem with that.

We can expect roughly the same frequency for IDNA2008. This expected
frequency was discussed during the development of IDNA2008. And as was said
many times during the development of IDNA2008, these are the kind of
expected small adjustments that were inevitable for character properties --
particularly inevitable once the choice not to disallow historic scripts for
domain names was taken. It isn't unusual for some of the edge cases for
lesser-known scripts of Asia to take a while to shake down in actual
implementations, for example.

That was the reason that a grandfathering clause was put in, to allow
IDNA2008 to preserve absolute stability. Because the WG opted to make this a
manual process, that means that periodically the table needs to be amended.

Mark

*— Il meglio è l’inimico del bene —*


On Fri, Oct 15, 2010 at 08:37, John C Klensin <klensin at jck.com> wrote:

>
> --On Friday, October 15, 2010 10:22 -0500 Pete Resnick
> <presnick at qualcomm.com> wrote:
>
> > On 10/14/10 4:01 PM, Mark Davis ☕ wrote:
> >> The stability of domain names is far more important -- that
> >> once a  domain name is valid, that it remain so.
> >
> > So I wish to disagree with the above statement and therefore
> > *disagree* with the suggestion that we adopt (c) adding U+19DA
> > to G. I am in favor of (a).
> >
> > We made a design decision in IDNA2008 that domain names that
> > contained other than some small set of letters, digits, and a
> > small set of punctuation were more trouble than they were
> > worth. We made it clear to folks that acceptable domain names
> > should only contain certain classes of characters. What I take
> >...
> > The day that LATIN SMALL LETTER I changes class, I'll be happy
> > to put something in category G. This is not that day.
>
> Pete,
>
> To reinforce this from a slightly different perspective, we
> really want people to create domain name labels because the
> strings have some mnemonic significance for them.  If, instead,
> someone puts a character in a string because they found it
> somewhere, thought it was cute, or wanted to play games with an
> edge case, I don't think we should warp our general rules to
> keep them amused.
>
> If someone was actually using a New Tai Lue character, the
> assumptions of the standard are such that it is rational for us
> to assume that they actually understood the script and the
> characters in it.  That implies that they would have known
> perfectly well that U+19DA wasn't a numeral even if it took the
> Unicode Standard until 6.0 to catch the problem and make the
> correction.
>
> We never said it explicitly, but perhaps one of our criteria
> should be that we are not obligated to keep labels that violate
> the principles of the standard stable when it is discovered that
> the violations are based on simple classification mistakes.
>
>    john
>
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20101015/40285cfb/attachment-0001.html>


More information about the Idna-update mailing list