Unicode 5.2 -> 6.0

Fri Oct 15 16:31:42 CEST 2010

--On Friday, October 15, 2010 09:59 +0100 Gervase Markham
<gerv at mozilla.org> wrote:

> On 14/10/10 23:08, Kenneth Whistler wrote:
>> Furthermore, the main community which is concerned with
>> the New Tai Lue script is in a very remote prefecture
>> in Yunnan Province in China:
>> 
>> http://en.wikipedia.org/wiki/Xishuangbanna_Dai_Autonomous_Pre
>> fecture
>> 
>> It would be very difficult to find out what that community
>> may *already* be doing with the New Tai Lue encoding,
>> for IDNs or anything else. Certainly I wouldn't bet on
>> no one among the nearly a million residents from  being
>> able to render U+19CA.
> 
> Let me see if I understand this:
> 
> - There is a character added to Unicode in 5.0/5.2, which was
> not valid in IDNA2003.
> 
> - For a short time, it was valid in IDNA2008.
> 
> - It was never registerable in any of the top-level domains
> who use a character whitelist (as far as I know).
> 
> - It was never renderable in Firefox (as no registry on the
> whitelist included it). It may have been renderable in other
> browsers if they had updated their code to IDNA2008 to allow
> it (which I don't think anyone has done yet), if their
> language-related or registry-related policies allowed it, and
> if the person viewing the address had a font installed
> including it. But I think it's most likely that no browser
> could display it.

As a matter of principle and theory, I wonder about this and
would like it if you explained it a bit further.  Keep in mind
that IDNA2003 essentially treats unregistered code points as
valid on lookup.   And, as far as I know, most browsers today
permit me to install and use my one fonts, at least for the
purpose of rendering content.   So, if you encountered one of
these code points, encoded in an A-label, in a pre-IDNA2008
version of the browser, in a deep subdomain of a TLD you
considered careful enough, would you not decode the A-label to a
U-label and then pass it off to whatever does rendering... and
which might include a pass through a font set (or a
locally-modified system font) that you didn't actually know
anything about?

>...

> And now, we are concerned that someone in the Xishuangbanna
> Dai Autonomous Prefecture, or someone with business there,
> might have, in the small time window available, created a DNS
> entry using this name, using some editor to edit their zone
> file containing a font which can display it, and using some
> client or service which is able to display it to their users,
> and now would be significantly inconvenienced by changing it?
> 
> This seems unlikely to me. But I guess I don't live there.
> 
> The downside is that, for ever more, every IDNA implementation
> has to deal with this exception. Perhaps it's no big deal
> because the size of the exception list will inevitably
> eventually grow beyond zero, but it seems a shame.

And that, rather than semi-philosophical discussions about what
is maximally compatible with Unicode, is what troubles me about
the question of whether we should now increase the size of the
Exception table from zero to one or three.   We adopted the
model of having that table with the understanding that we would
certainly have to use it sooner or later but that, while we
needed to review blocks of new code points and every code point
for which relevant properties changed, it would be rare that we
would actually need to use the table.  The advantage of using it
is that we can maintain strict compatibility, although for a
fairly obscure case (or three cases).  The disadvantage is that
every use of that table implies a character that might need to
be treated slightly differently in IDNA then it would be in
running text or other i18n applications.  

One could, for example, examine collation algorithms that tried
to make generic distinctions among numerals, letters, and other
types of symbols and marks.  Ideally, such algorithms would
handle domain names in the same way that they handle other text,
especially given the difficulty of reliably identifying domain
names in running text (and, given the discussion in
draft-iab-i18n-encoding, distinguishing public domain names from
similar-looking identifiers that might use different rules).
Consistency with a rule that says, essentially, "this is a
numeral after all" would require special-cases in such
algorithms.

I don't think that case is blocking.  It may not even be
significant in practice.  But, once again, we need to remember
that there are tradeoffs in any of these decisions and that
there are advantages, including the one Gerv discusses, to using
the exception list as little as possible and perhaps only for
really compelling cases.

    john