Comments on the Unicode Codepoints and IDNA Internet-Draft

Wed Jul 30 12:25:45 CEST 2008

--On Tuesday, 29 July, 2008 16:18 +0200 Kent Karlsson
<kent.karlsson14 at comhem.se> wrote:

> Frank Ellermann wrote:
> 
>> >> There is a reason why many code points such as
>> >> "mathematical fraktur capital B" or "black-letter capital
>> >> C" are disallowed.
> ...
>> I'd guess that those two code points are not really letters in
>> an Unicode sense,
> 
> 
> They are letters, but they cannot be part of a string in NFKC
> form since they have compatibility mappings to the
> corresponding nominal uppercase letters. They also have
> to-lower mappings to the corresponding "faktur" lowercase
> letters (which also have compatibility mappings to the
> corresponding nominal lowercase letters).

Kent,

This is obviously correct, but is the answer to the question  a
different question, one about, I think, "how Unicode works"
rather than "what is appropriate for domain names".  The
question Stephane poses, as I understand it, is closer to the
latter, especially the "why not just allow every codepoint in
Unicode and let the user sort it out"? That second question is
much closer to the more general Unicode topic/issue about a
profile for identifiers in programming languages and similar
situations but, because of the needs of the DNS, not quite the
same.  

The answer is ultimately that the purpose of the DNS is to
identify various types of Internet resources (including, but not
limited, to hosts and their identifiers) and to permit building
and using other types of identifiers.  It is not simply to be
able to put something into a database and retrieve it or to
build a marketplace in a somewhat artificial resource.  To
"work" for identifiers in a CCS in which some characters can be
represented in more than one way, there must be, at minimum,
some mechanism for either making sure those character forms
compare equal or are mapped together, or to permit only one of
them.  Note that, in this discussion, I really mean "same
character" -- nothing about subjectively visually similar,
security issues, or anything related.

One simply cannot have, e.g., "a with a ring above it" and "a
with a combining ring above" treated as not equal and still have
an identifier system, especially since a variety of systems and
operating environments outside the DNS freely map them back and
forth without asking.   Even with that qualification, "same
character" is a little subjective.  Are upper and lower case "a"
the same character for matching purposes?  Windows generally
says "yes".   Unix generally says "no".   The DNS says "yes" and
enforces that "yes" (for ASCII labels) in the actual matching
algorithm on DNS servers -- if I decide they are different, I'm
going to lose because there is no way I can persuade the DNS (at
least with a standard, non-extended, query in Class=IN).  So
we've had "no, you can't treat those two code points as
different" rules since the [ASCII] beginnings of the DNS and the
hostname/ host table environment.

Note that the difference between NFC and NFKC is ultimately two
different rules about characters that are considered "different"
or the "same".

Coming back to Hangul, it seems to me that the question is
exactly the one Edmon suggests: if the Jamo are really combining
objects, used to build up characters and with the possibility of
representing the same character in either precomposed form or as
a sequence of Jamo, then we need to ban (at the protocol level)
either the Jamo or the precomposed forms.    "Archaic
characters" and comments about Cuneiform have nothing to do with
this except insofar and some characters weren't coded into
Unicode in precomposed form because they are not in contemporary
use so, if one wants to have those characters, one has to have
the Jamo.  One also doesn't want to have a character built out
of some combination of Jamo floating around when a precomposed
form of that character is added with Unicode 20.2.  The fact
that most combinations of most of the Jamo map out with NFC or
NFKC and are therefore prohibited by IDNA2008 (and
IDNA2003)actually supports for disallowing them to make that
prohibition complete.

IMO, Edmon is also correct, IMO, about Chinese radicals.  There
are no combining radicals in Unicode.  If there were, we would
presumably be having exactly they same discussion about allowing
some or all of them to permit constructing obscure or obsolete
characters or DISALLOWing the lot.  Conversely, the fact that
combining radicals didn't get codepoints may raise questions, at
least for IDNA, about why the combining Jamo were (I think I
know the reason, and it makes sense, but the question is still
interesting).

Relative to these prohibitions preventing things like like SRV
if they had been applied to the base DNS, two comments:

* Virtually every time in the history of software design that
someone has tried to use trick naming conventions as a
substitute for serious data typing, it has been a mistake.  One
of the greater problems with MIME implementations involves
systems that think that file suffixes are a better indication of
data types than content-type headers, and that is just one
example among many.  So, going out of our way to enable SRV-like
tricks may not be a good idea.

However, that isn't the issue.  IDNA is about host names.
Probably it should have been called "IHNA" although the
distinction would have been lost on almost anyone not following
these discussions.  But, we should be designing an IDNA which
works well (and that means identifiers with predictable
behavior) for host names.  If someone needs to internationalize
other kinds of uses of the DNS, nothing prevents _them_ from
making up a new model, or adapting the ??-- prefix model from
IDNA with their own prefix and rules.

    johnbb