Minimal IDNAbis requirements

Mon Dec 17 20:34:31 CET 2007

Erik, Mark, and others,

We've been working on the IDNA revision on the assumption that
there will be many more names, much more heavily used, in the
future  than there are now and that, as a consequence, a
protocol that is as clean and kludge- and special-case-free is
an appropriate target, even if it breaks some current uses of
IDNs in contexts that are not strictly DNS registration or
resolution.  We have also been trying to address the subset of
issues raised in RFC 4690 that are amenable to solution in the
protocol.

I observe from your data that possibly-problematic label strings
occur in small fractions of a percent of the DNS names you
searched.  The people who most fervently believe in IDNs as a
major enabler of Internet use are convinced that they are likely
to constitute a majority of domain names after they are fully
deployed, supported in all important applications, and working
well.  The ratio between the fractions of a percent you are
seeing today and those expectations makes, IMO, an extremely
strong case for getting things right now, even if the transition
process is a little bumpy (which I continue to believe will
occur in a very small number of cases).   YMMD, of course.

However, given the tone of several recent messages, I want to
try, as an intellectual exercise at least, to see if we can
identify the list of changes that would need to be made to
IDNA2003 to respond to "this doesn't work" and "this causes
interoperability problems" issues, rather than worrying about
the broader issues raised in 4690, where those issues include
providing a better platform for making presentation to the user
culturally and linguistically correct.

I think the list, in no particular order, is:

(1) Unicode version independence.  While I hope all of us would
agree that making IDNs possible from additional scripts is
highly desirable, the absence of those scripts is not an
interoperability problem in IDNA2003.  If we adopt the
minimalist view, we need Unicode version independence because,
in practice, applications rely on libraries and do not have much
control over the version of Unicode they are using.  Version
independence drives us toward tables and operations defined in
terms of properties, rather than a version-specific mapping
table like Stringprep.   We might consider getting the extra
scripts to be a desirable side-effect of the change in
definitional method rather than a primary goal.

The version-independence issue raises one more problem.  The
position is untenable that we can define some small list of
characters as a group and then apply some special action to
them, assuming that no characters of that group will ever be
introduced in the future.   Taking the dot/ stop mapping list as
an example (more about that specific issue below), I do not
believe there is any way to guarantee that some script will not
be added in the future whose users will insist that they cannot
type an ASCII full stop or any of the existing characters on the
list, but need to use their own.  Of course, one could try to
say to them that, by virtue of being added late, they are
unimportant and just lose, but I don't think that position is
tenable either.  One might be able to work around this by
inventing a special property that identified that group, and
only that group, but then we had better be _very_ sure that
IDNA-unaware applications don't need to know about the property
or the group.

(2) We have to unambiguously ban, for  both registration and
lookup, any character that is unassigned in the version of
Unicode the application is using.  As you have pointed out, not
doing so requires either a rigid and enforceable freeze at some
particular version of Unicode or a firm and enforceable
guarantee that no compatibility characters will ever again be
added to TUS.  The latter condition is true whether we continue
to map in the protocol or not: if we do not, then compatibility
characters are prohibited, if we do, then adding such a
character would change the NFKC mapping of the code point from
itself to a new target code point.

(3) We need to get rid of the dot-mapping (or full stops, or
whatever one chooses to call them).  If we are to take the "no
DNS
changes" approach, then IDNA-unaware applications and systems
must be able to parse domain names into labels and unambiguously
convert back and forth from dot-separated-label form to
length-value-list form.  Now, in principle, one could simply
write a much stronger rule than now appears in RFC 3490,
forbidding non-ASCII dots in a domain name unless at least one
U-label occurs ("at least one IDN" isn't sufficient because LDH
labels are still IDNs in 3490).  The problem with that approach,
especially in context with 3490's definition of an IDN, is that
things leak.  They may leak even if dot-mapping is pushed out to
UIs, but having non-ASCII full stops banned as just members of
the set of punctuation characters makes a much cleaner detection
mechanism.

(4) We need to very carefully re-examine case-mapping and the
logic behind the IDNA2003 rules.  We know about the "i" and
Eszett problems.  We know that a major reason for mapping Eszett
to "ss" was the absence of an upper-case character, but  I
gather that pressure is continuing to mount to assign a code
point to precisely such a character.  While we are having this
argument about the handling of Eszett, I don't believe that any
of the other traditional "Fraktur" ligatures have been assigned
in Unicode.  Is it possible that pressure will occur to include
some of them, or a non-spacing joiner suitable for use with
Latin characters, and that the pressure will mount sufficiently
that the characters are added to 10646 (I am assuming that such
characters would be more likely to enter the code space via
ISO/IEC JTC1/SC2 than via the Unicode Consortium, but I could be
wrong)?   It is likely?   No, but I believe it is possible
eventually (Unicode version 15 anyone?) and that further
reinforces my resistance to being told that one should just
special-case a few characters rather than trying to develop
general, and stable, rules.

  Similarly, can we guarantee, forever, that no other cases
similar to the dotless "i" one will arise?  We know that several
of the basic Latin characters are decorated versions of earlier
forms, sometimes with the decorations occurring to make it
easier to distinguish from other characters.  Suppose we
discover a language that uses a bar-less "f" and that has
organized keyboards so that the only rational way to type a
standard Latin "f" is with small letter bar-less f followed by a
combining mid-height horizontal bar. Are these possibilities
likely?  Nope.  But I think we have to design on the basis that
Unicode and IDNs will be around for a _very_ long time and, if
we handle things as special-case exceptions now, we had better
have a plan about expanding the exception list.  

We need to remember that we got case-mapping in ASCII partially
because it is completely unambiguous and mechanical (i.e., it
can be done by an OR operation and does not require a table with
per-character rules).  Some of the IDNA (and Unicode) case rules
appear to be based on the assumption that we know which scripts
have case distinctions and that there will not be any more.
But, at the risk of being facetious, after the Martians invade
and insist on a plane of Unicode for their ideographic script
which has three cases (called, perhaps, "sunlight", "night", and
"flooded"), I'd like to believe that IDNs won't self-destruct. 

The easy path out of this is to get case-mapping out of the
standard so the Turks and Martians can make UI rules that make
sense to them.  But other solutions are certainly possible.
b
(5) NFKC mapping in the protocol is _not_ a problem in itself,
But, if we don't eliminate it, we need to solve two other
problems. I don't know the solutions, but they may be out there:

(5a) Compatibility mappings are occasionally gotten wrong and
need correction, whether to a different mapping or to declare
something that was classified as a compatibility character as an
independent one.  If one bans compatibility characters at the
protocol level, then the character is invalid (and must either
be mapped out in a UI or rejected) regardless of what it maps
to.  And, if it is reclassified out of the compatibility
category, then it becomes a valid character with no conflicts in
that version of Unicode.   The alternative is no changes, ever,
under any circumstances and that has already proven unrealistic
(and hence isn't on the rather narrow list in Patrik's "what has
to be stable" note).

(5b) If the protocol permits and enforces compatibility
mappings, we need to be very specific about contexts in which
only reduced-form labels (those for which
ToUnicode(ToASCII(ToUnicode(label)))== label) are permitted.
For example, we already have security-related identifiers that
use domain names or email addresses.  The relevant protocols
specify that the domain names MUST be in lower-case.  If they
are going to permit IDNs, then they will need to be restricted
to either A-label use (very bad for identifiers used by people)
or to reduced-form labels.

(6) I personally believe that "prohibit symbols and punctuation"
belongs on the list too, but there may still be controversy
about that.

Now, in each of these cases, we can devise a clean and general
solution and figure out how to deal with incompatibilities and
transitions or we can try to find some sort of "characters in
Unicode 3.2 are special and are treated differently than
subsequent additions" mechanism.   One could even retain the
punctuation and symbols that way, since some of the worst
concerns about them is that they are not a permanently closed
set.  Of course, that wouldn't solve the nomenclature and
identification parts of the symbol problem, but maybe that is
"just" a tradeoff.

One thing that I find interesting about this exercise is that I
think that the best tradeoff selection for each of the
requirements ends up taking us very close to where the IDNA200X
proposal is today -- one does not need to add many more
requirements or criteria to get there.    But one could make
other decisions and get other results.

    john