Minimal IDNAbis requirements

Erik van der Poel erikv at
Sat Dec 22 03:35:50 CET 2007


I agree that this was a useful exercise, and that it takes us pretty
close to where we are today with IDNA200X. I think it might also be
useful, not only for ourselves but also for future readers, to do the
opposite exercise, i.e. starting from the rules that Patrik, Ken and
Mark are writing, try to come up with a rationale for each rule.

I believe that the mathematical slashes (division/fraction signs) were
the ones that really got us to think about disallowing symbols and
punctuation, but that example is not sufficient to explain the whole
set away. Instead, we might say that the initial set of allowed
characters is purposely being limited to a "small" set in the
interests of being conservative, which is important in networking,
both at the machine level and at the human level.

We might even want to think of NEVER as being "probably never",
without actually writing that down. Future generations may well find
ways to introduce some of the symbols or even punctuation into IDNs
without causing any real harm.

By the way, I'm guessing from your lack of response to my suggestion
of a new term "V-label" for variants as supported in browsers today,
that you'd rather not legitimize these things by giving them an
official name. Fair enough. But it still would be good to get the
UTF-8 SMTP drafts to explicitly say whether U-labels are required, and
whether non-ASCII dots are allowed.


On Dec 17, 2007 11:34 AM, John C Klensin <klensin at> wrote:
> Erik, Mark, and others,
> We've been working on the IDNA revision on the assumption that
> there will be many more names, much more heavily used, in the
> future  than there are now and that, as a consequence, a
> protocol that is as clean and kludge- and special-case-free is
> an appropriate target, even if it breaks some current uses of
> IDNs in contexts that are not strictly DNS registration or
> resolution.  We have also been trying to address the subset of
> issues raised in RFC 4690 that are amenable to solution in the
> protocol.
> I observe from your data that possibly-problematic label strings
> occur in small fractions of a percent of the DNS names you
> searched.  The people who most fervently believe in IDNs as a
> major enabler of Internet use are convinced that they are likely
> to constitute a majority of domain names after they are fully
> deployed, supported in all important applications, and working
> well.  The ratio between the fractions of a percent you are
> seeing today and those expectations makes, IMO, an extremely
> strong case for getting things right now, even if the transition
> process is a little bumpy (which I continue to believe will
> occur in a very small number of cases).   YMMD, of course.
> However, given the tone of several recent messages, I want to
> try, as an intellectual exercise at least, to see if we can
> identify the list of changes that would need to be made to
> IDNA2003 to respond to "this doesn't work" and "this causes
> interoperability problems" issues, rather than worrying about
> the broader issues raised in 4690, where those issues include
> providing a better platform for making presentation to the user
> culturally and linguistically correct.
> I think the list, in no particular order, is:
> (1) Unicode version independence.  While I hope all of us would
> agree that making IDNs possible from additional scripts is
> highly desirable, the absence of those scripts is not an
> interoperability problem in IDNA2003.  If we adopt the
> minimalist view, we need Unicode version independence because,
> in practice, applications rely on libraries and do not have much
> control over the version of Unicode they are using.  Version
> independence drives us toward tables and operations defined in
> terms of properties, rather than a version-specific mapping
> table like Stringprep.   We might consider getting the extra
> scripts to be a desirable side-effect of the change in
> definitional method rather than a primary goal.
> The version-independence issue raises one more problem.  The
> position is untenable that we can define some small list of
> characters as a group and then apply some special action to
> them, assuming that no characters of that group will ever be
> introduced in the future.   Taking the dot/ stop mapping list as
> an example (more about that specific issue below), I do not
> believe there is any way to guarantee that some script will not
> be added in the future whose users will insist that they cannot
> type an ASCII full stop or any of the existing characters on the
> list, but need to use their own.  Of course, one could try to
> say to them that, by virtue of being added late, they are
> unimportant and just lose, but I don't think that position is
> tenable either.  One might be able to work around this by
> inventing a special property that identified that group, and
> only that group, but then we had better be _very_ sure that
> IDNA-unaware applications don't need to know about the property
> or the group.
> (2) We have to unambiguously ban, for  both registration and
> lookup, any character that is unassigned in the version of
> Unicode the application is using.  As you have pointed out, not
> doing so requires either a rigid and enforceable freeze at some
> particular version of Unicode or a firm and enforceable
> guarantee that no compatibility characters will ever again be
> added to TUS.  The latter condition is true whether we continue
> to map in the protocol or not: if we do not, then compatibility
> characters are prohibited, if we do, then adding such a
> character would change the NFKC mapping of the code point from
> itself to a new target code point.
> (3) We need to get rid of the dot-mapping (or full stops, or
> whatever one chooses to call them).  If we are to take the "no
> changes" approach, then IDNA-unaware applications and systems
> must be able to parse domain names into labels and unambiguously
> convert back and forth from dot-separated-label form to
> length-value-list form.  Now, in principle, one could simply
> write a much stronger rule than now appears in RFC 3490,
> forbidding non-ASCII dots in a domain name unless at least one
> U-label occurs ("at least one IDN" isn't sufficient because LDH
> labels are still IDNs in 3490).  The problem with that approach,
> especially in context with 3490's definition of an IDN, is that
> things leak.  They may leak even if dot-mapping is pushed out to
> UIs, but having non-ASCII full stops banned as just members of
> the set of punctuation characters makes a much cleaner detection
> mechanism.
> (4) We need to very carefully re-examine case-mapping and the
> logic behind the IDNA2003 rules.  We know about the "i" and
> Eszett problems.  We know that a major reason for mapping Eszett
> to "ss" was the absence of an upper-case character, but  I
> gather that pressure is continuing to mount to assign a code
> point to precisely such a character.  While we are having this
> argument about the handling of Eszett, I don't believe that any
> of the other traditional "Fraktur" ligatures have been assigned
> in Unicode.  Is it possible that pressure will occur to include
> some of them, or a non-spacing joiner suitable for use with
> Latin characters, and that the pressure will mount sufficiently
> that the characters are added to 10646 (I am assuming that such
> characters would be more likely to enter the code space via
> ISO/IEC JTC1/SC2 than via the Unicode Consortium, but I could be
> wrong)?   It is likely?   No, but I believe it is possible
> eventually (Unicode version 15 anyone?) and that further
> reinforces my resistance to being told that one should just
> special-case a few characters rather than trying to develop
> general, and stable, rules.
>   Similarly, can we guarantee, forever, that no other cases
> similar to the dotless "i" one will arise?  We know that several
> of the basic Latin characters are decorated versions of earlier
> forms, sometimes with the decorations occurring to make it
> easier to distinguish from other characters.  Suppose we
> discover a language that uses a bar-less "f" and that has
> organized keyboards so that the only rational way to type a
> standard Latin "f" is with small letter bar-less f followed by a
> combining mid-height horizontal bar. Are these possibilities
> likely?  Nope.  But I think we have to design on the basis that
> Unicode and IDNs will be around for a _very_ long time and, if
> we handle things as special-case exceptions now, we had better
> have a plan about expanding the exception list.
> We need to remember that we got case-mapping in ASCII partially
> because it is completely unambiguous and mechanical (i.e., it
> can be done by an OR operation and does not require a table with
> per-character rules).  Some of the IDNA (and Unicode) case rules
> appear to be based on the assumption that we know which scripts
> have case distinctions and that there will not be any more.
> But, at the risk of being facetious, after the Martians invade
> and insist on a plane of Unicode for their ideographic script
> which has three cases (called, perhaps, "sunlight", "night", and
> "flooded"), I'd like to believe that IDNs won't self-destruct.
> The easy path out of this is to get case-mapping out of the
> standard so the Turks and Martians can make UI rules that make
> sense to them.  But other solutions are certainly possible.
> b
> (5) NFKC mapping in the protocol is _not_ a problem in itself,
> But, if we don't eliminate it, we need to solve two other
> problems. I don't know the solutions, but they may be out there:
> (5a) Compatibility mappings are occasionally gotten wrong and
> need correction, whether to a different mapping or to declare
> something that was classified as a compatibility character as an
> independent one.  If one bans compatibility characters at the
> protocol level, then the character is invalid (and must either
> be mapped out in a UI or rejected) regardless of what it maps
> to.  And, if it is reclassified out of the compatibility
> category, then it becomes a valid character with no conflicts in
> that version of Unicode.   The alternative is no changes, ever,
> under any circumstances and that has already proven unrealistic
> (and hence isn't on the rather narrow list in Patrik's "what has
> to be stable" note).
> (5b) If the protocol permits and enforces compatibility
> mappings, we need to be very specific about contexts in which
> only reduced-form labels (those for which
> ToUnicode(ToASCII(ToUnicode(label)))== label) are permitted.
> For example, we already have security-related identifiers that
> use domain names or email addresses.  The relevant protocols
> specify that the domain names MUST be in lower-case.  If they
> are going to permit IDNs, then they will need to be restricted
> to either A-label use (very bad for identifiers used by people)
> or to reduced-form labels.
> (6) I personally believe that "prohibit symbols and punctuation"
> belongs on the list too, but there may still be controversy
> about that.
> Now, in each of these cases, we can devise a clean and general
> solution and figure out how to deal with incompatibilities and
> transitions or we can try to find some sort of "characters in
> Unicode 3.2 are special and are treated differently than
> subsequent additions" mechanism.   One could even retain the
> punctuation and symbols that way, since some of the worst
> concerns about them is that they are not a permanently closed
> set.  Of course, that wouldn't solve the nomenclature and
> identification parts of the symbol problem, but maybe that is
> "just" a tradeoff.
> One thing that I find interesting about this exercise is that I
> think that the best tradeoff selection for each of the
> requirements ends up taking us very close to where the IDNA200X
> proposal is today -- one does not need to add many more
> requirements or criteria to get there.    But one could make
> other decisions and get other results.
>     john
> _______________________________________________
> Idna-update mailing list
> Idna-update at

More information about the Idna-update mailing list