Comments on IDNAbis issues-05
John C Klensin
klensin at jck.com
Sun Jan 13 19:48:27 CET 2008
--On Wednesday, 09 January, 2008 16:28 -0800 Mark Davis
<mark.davis at icu-project.org> wrote:
> I sent this almost a month ago, and got no reply. I'm assuming
I will not repeat the comments made in my response to your note
about "protocol", nor say anything further about those issues
In rereading both your note and my first draft of my response,
I realized that I have assumed that most of your comments were
substantive --i.e., suggesting that the model or underlying
design of the specification was incorrect-- rather than
requests for editorial clarifications. If they were the
latter, please give me that information and let's try to focus
on substantive matters now and editorial ones later (or at
On Dec 13, 2007 7:48 PM, Mark Davis
<mark.davis at icu-project.org> wrote:
> ues-05.txt Overview.Many nice improvements to the text.
Thanks. It is good to hear that we are making progress.
> Issues-1. IDNAbis has a major backwards compatibility issue
> with IDNA2003: thousands of characters are excluded that used
> to be valid. What reason might people have to believe that
> despite the terms NEVER and ALWAYS that some future version,
> IDNAbis-bis, might not also do the same?
Those "thousands of characters" fall into three categories, and
it is worth examining them in the following groups. In no
particular order, they are:
(i) Characters that cannot actually be represented in a
domain name (i.e., in A-label or ToUnicode(ToASCII(string))
form) even though they can be mapped into it. These
characters include upper-case ones and ones mapped into
other things by NFKC plus, depending on how things are
defined, the "variant dots" that have been extensively
discussed on list, I think since your note was sent. The
issues with them have been extensively explored elsewhere,
most notably in the recent thread about dot-mapping and my
recent response to your note about the
(ii) Characters that can be represented in a domain name
but that have always (i.e., since IDNA2003 was published)
been discouraged or prohibited by various statements and
guidelines which were intended to be applicability
statements about the protocol. This group of characters
includes characters that are not used to write the words of
any language, such as the various symbols, line-drawing,
and punctuation objects. While we know that some of those
characters (fortunately a very small percentage) have been
used in domain names, it seems to us that a few bad
practices, some of them usage on the "because we can"
principle rather than out of perceived necessity, should
not prevent revising the protocol to make IDNs more robust
(iii) Characters that are excluded by IDNA200X but
permitted by IDNA2003 that do not fall into one of the
above groups. There are very few of these characters,
perhaps none, certainly not "thousands" or even "tens".
Any of them can be dealt with as special cases if they are
I repeat the above, not because the statements are new (we have
been over that list, in various forms, many times before), but
because the answer to your question really lies in it.
IDNA200X is excluding many characters that were previously
permitted (by one definition of "permitted" or another) because
it represents a change in design principles from IDNA2003.
Those design principles are far more consistent with normal
design principles for Internet protocols than the earlier ones.
For example they exclude things that are not demonstrably
necessary, rather than including everything because it appears
that one can, and move more of the prohibitions into the
protocol rather than "guidance" hand-waving about what
registration entities, at all levels of the domain tree, should
Could the principle change yet again? In principle, of course.
The community could, for example, decide that IDNs are
impossible without language information, that a user or system
needs to have language information available when strings are
looked up, and, as a consequence, that the prefix and coding
structure must be changed to include a language code in both
the ACE and native strings. I believe the likelihood of that
is on a par with your belief in the likelihood of Unicode ever
removing a character (or fundamentally changing the definition
of a character, which is the same thing in practice). But in
neither case is there a firm way to bind behavior in the
Just as you have suggested that we incorporate a parenthetical
note in various places indicating that something could not
happen without violating a Unicode policy, it would be
plausible to insert notes of that type in some of this text.
However, there is a difference which may be important.
Let me illustrate with an example from my history with
programming languages and standards. The users expected that
the language definitions and implementations would be much more
stable than the applications they wrote using them. If
something compiled one day and not the next or, worse, compiled
both days but produced different behavior, the programmers and
perhaps the end-users would typically be severely irritated
--even if the change was to fix a bug that they had worked
around with the effect that the change broke the work-around
and hence the program. We are, I think, in much the same
situation with Unicode. For applications that depend on the
standard to be stable, the requirements for Unicode stability
have to be much stronger than the requirements on the
applications themselves. IDNs are not, in that sense, quite
an application but just as stability in IDN behavior is
necessary to keep application behavior stable, an even
stronger standard of stability in Unicode is needed to keep
> Issues-2. IDNA provided for backwards compatibility, by
> disallowing Unassigned characters in registration, but
> allowing them in lookup. That let old clients work despite
> new software. While once we update to U5.1 that is not as
> much of a problem, it should be made clear why this change is
It has been explained several times and in several places. I
can try adding more text here, but the key reason is that there
is no way to know the normalization and character class
properties, or other key properties, of a code point that has
not been assigned yet. I thought we had agreement on the
importance of that principle back when stabilized string
normalization and what is now called NPSS were being discussed.
> Issues-3. In general, whenever a statement is made about some
> class of characters causing a problem, at least one clear
> example should be provided, as in
Noted. However, in some places, examples have been omitted
because we know that they would cause more controversies than
they are worth. See comments below about appropriation of
languages into scripts.
> Issues-4. I would strongly suggest separating all of the "why
> did we do this" and "how is it different from IDNA2003" into
> a separate document. It will be of only historical interest
> after this becomes final, and will then only clutter the
This document is primarily about explanation and rationale (and
should probably be retitled accordingly). In other words,
"issues" is intended to become that "separate document", with
all of the actual protocol materials moved into the "protocol"
specification. If you believe there is remaining material here
that should be there, your help in identifying it would be
appreciated. However, in circumstances in which client
implementers are likely to do whatever they think best (see
discussion in my response to your notes on "protocol") the type
of material you characterize as "why did we do this" may be
important for more than historical reasons. It, far more than
some "IETF stamp of approval", may be critical in persuading
people that this is the right thing to do and to do it.
> IDNA uses the Unicode character repertoire, which avoids
> the significant delays that would be inherent in waiting
> for a different and specific character set be defined for
> IDN purposes, presumably by some other standards
> developing organization.
> Seems odd. There are no other contenders in the wings. Would
> be better, if this has to be said, to just cite other IETF
> documents describing the reasons for using Unicode.
A frighteningly large (frightening, at least, to me) number of
the discussions of IDNA in the target communities start or end
with a statement of belief that Unicode is hopelessly broken or
very badly designed for their purposes, at least with their
languages or scripts. While some of those concerns ultimately
involve misunderstandings, others involve information that
appears to be needed but isn't available, presentation form
difficulties, etc. Some of the issues involved are very
specific to IDNs (e.g., when language information needed,
Unicode strings in XML can be tagged with it, but domain names
cannot be). However odd it might be, that paragraph is an
attempt to say to the folks who want to propose different
character coding systems that, to use your words above, "there
are no other contenders in the wings" and it is time to adapt
as needed to Unicode's coding structure and move on. Proposals
as to better ways to say that would be welcome but, although
I'd welcome pointers to alternatives, I don't see a better
place to put it.
> To improve clarity, this document introduces three new
> terms. A string is "IDNA-valid" if it meets all of the
> requirements of this
> specification for an IDNA label. It may be either an
> "A-label" or a "U-label", and it is expected that specific
> reference will be made to the form appropriate to any
> context in which the distinction is important.
> A "U-label" is an IDNA-valid string of
> Unicode-coded characters that is a valid output of
> performing ToUnicode on an A-label, again regardless of
> how the label is actually produced.
> These definitions appear circular, so they need to be teased
> out a bit.
They are certainly not circular. I'll take another look at the
form of the definitions. Again, specific suggestions welcome.
> Depending on the system involved, the major difficulty may
> not lie in the mapping but in accurately identifying the
> incoming character set and then applying the correct
> conversion routine. It may be especially difficult when
> the character coding system in local use is based on
> conceptually different assumptions than those used by
> Unicode about, e.g., how different presentation or
> combining forms
> are handled. Those differences may not easily yield
> unambiguous conversions or interpretations even if each
> coding system is internally consistent and adequate to
> represent the local language and script.
> I suggest the following rewrite:
> The main difficulty typically is that of accurately
> identifying the incoming character set so as to apply the
> correct conversion routine. Theoretically, conversion could
> be difficult if the non-Unicode character encoding system
> were based on conceptually different assumptions than those
> used by Unicode about, e.g., how different presentation or
> combining forms are handled. Some examples are the so-called
> "font-encodings" used on some Indian websites. However, in
> modern software, such character sets are rarely used except
> for specialized display.
While it is clearly easier to read, it doesn't say quite the
same thing. I will try to rework. More important, there seems
to be some disagreement about the last comment. Can you supply
a citation to something that I can use in the document? That
should ideally be to other than a Unicode Consortium study or
opinion, since the people who claim that these transcoding
functions are causing pain tend to believe their solutions are,
or should be, much more widely deployed and that Unicode
Consortium statements and data articulate the organization's
own views by definition, and can therefore not be unilaterally
invoked in as fully objective discussion.
> That, in turn, indicates that the script community
> relevant to that character, reflecting appropriate
> authorities for all of the known languages that use that
> script, has agreed that the script and its components are
> sufficiently well understood. This subsection discusses
> characters, rather than scripts, because it is explicitly
> understood that a script community may decide to include
> some characters of the script and not others.
> Because of this condition, which requires evaluation by
> individual script communities of the characters suitable
> for use in IDNs (not just, e.g., the general stability of
> the scripts in which those characters are embedded) it is
> not feasible to define the boundary point between this
> category and the next one by general properties of the
> characters, such as the Unicode property lists.
> There is no justification given for this process. Moreover,
> it will be doomed to failure. Merely the identification of
> "script communities" is an impossible task. Who speaks for
> the Arabic script world? Saudi Arabia (Arabic)? Iran
> (Persian,...)? Pakistan (Urdu,...)?, China (Uighur,...)?
Actually, the description above, although it could be made much
longer and more detailed, is the motivation and justification
for some careful per-script process that involves
representatives of each script community. This is obviously a
key problem. The fact that non-specialists think in terms of
languages and language-specific writing systems, not scripts
(as Patrik has pointed out many times), makes it much more
difficult. We do know that some portions of (or at least
individuals in) the relevant communities are quite willing to
assert that their scripts belong to their language group (e.g.,
"Arabic script belongs to the Arabs and, if the Persians don't
like it, they should find their own script"). Some of those
people believe that the problem is entirely of Unicode's making
in that the script was artificially "unified" by uncritically
accepting a pre-existing convenient standard for the writing
system of a major language, after which the additional
languages were appropriated into the script by adding code
points for the characters they needed but taking the
presentation assumptions of that major language as a given.
That thinking obviously leads to a claim that every language
actually has its own script and deserves its own separate codes
and code block(s). I share what I assume is your conviction
that such a claim is ultimately untenable. But that doesn't
justify treating the position with disrespect. The people who
hold it are behaving rationally from a cultural and linguistic
preservation standpoint and we have to figure out how to give
them more of a voice in decisions about the script(s) the
affect them than they perceive they have gotten in the past.
With the understanding that some of these are the same folks
who are arguing for junking Unicode for IDN purposes and
starting over (see my comments on your "protocol" notes and
above), the one thing that they will agree on is that some
small group of people operating out of California (or even the
US) are not the appropriate decision-makers for the use of
their languages in IDNs.
I note that, while you say that identification of script
communities is an "impossible task", you don't offer a solution
or even a better suggestion.
> it is removed from Unicode.
> (multiple instances)
> This is not necessary; characters aren't removed from
> Unicode. If you really have to have it, then add "(however,
> the Unicode stability policies expressly forbid this)"
See comment on this subject in the "protocol" response and the
comments on your Issues-1 above. If it will never happen, then
the condition will never apply.
> Applications are expected to not treat "ALWAYS" and "MAYBE"
> differently with regard to name resolution ("lookup").
> They may choose to provide warnings to users when labels
> or fully-qualified names containing characters in the
> "MAYBE" categories are to be
> In practice, expecting applications to treat these
> differently is wishful thinking; especially if it seems
> Eurocentric to users (see other notes on MAYBE). In practice,
> registries always have the ability to filter characters out.
> See above on removing Maybe.
You didn't include your notes on removing MAYBE in this
message, but let me respond on the basis of previous remarks by
you and others.
We've had many requests, from browser vendors, from zone
administrators (many of them completely unaffected by whatever
policies ICANN makes about TLDs they control or advise), and
others who are dependent on domain names and things that look
like them (including parts of the security and PKI Certificate
community), and others for guidance about what is safe, what is
clearly bad, and where they need to exercise caution. Some of
those requests have come from users of scripts that you have
implied that we have put into MAYBE because of a Eurocentric
view -- they are encouraging us to go cautiously on their scripts
until they have time locally to get some issues straightened
out. Note that some of these requests are on the registration
or registration policy side and others are on the lookup or
other application side.
The MAYBE category, the distinction between MAYBE YES and MAYBE
NO, and the discussion associated with them, are intended to
provide exactly that guidance, especially in cases where we, or
the relevant script community (see above), have concluded that
there are still details to be sorted out. Now perhaps you are
right and that, having asked for guidance, those groups will
ignore it and do whatever they like locally. I would predict
that they would make use of the advice. However, if they did
not, removing the distinctions implied by MAYBE somewhere down
the line would be fairly easy. By contrast, adding
distinctions or restrictions later always proves very
difficult, especially with regard to forward and backward
The observation that MAYBE NO and MAYBE YES are
indistinguishable from a lookup protocol standpoint does not
make them useless for all purposes.
Finally, while registries always have the ability to forbid
specific characters (I'm not quite sure what "filter characters
out" would mean on the registry side of things), some
registries make their own registration policies and some do
not. For those who make policies, categories and distinctions
may be helpful in discussing and writing possible rules even if
those categories don't directly affect the protocol. The text
suggests some ways in which such rules could be formulated.
Again, in the contexts in which extra distinctions are not
important, it is harmless to ignore them.
> 5.1.3. CONTEXTUAL RULE REQUIRED
> I know what the point is supposed to be (and don't disagree),
> but this section was very hard to make out.
I'll take another look at the text. Thanks. Again, specific
suggestions (from you or others) would be welcome.
> Characters that are placed in the "NEVER" category are
> never removed from it or reclassified. If a character is
> classified as "NEVER" in error and the error is
> sufficiently problematic, the only recourse is to
> introduce a new code point into Unicode and classify it as
> "MAYBE" or "ALWAYS" as appropriate.
> The odds of this happening are extremely low. Anything in
> Never has to be extremely certain.
We agree (and that is another reason for the somewhat gray area
described as "MAYBE"). Do you think that needs to be said
> Instead, we need to have a variety of approaches that,
> together, constitute multiple lines of defense.
> Defense against what? Without examples, it is hard to say
> what the problems are.
> Applications MAY
> allow the display and user input of A-labels, but are not
> encouraged to do so except as an interface for special
> purposes, possibly for debugging, or to cope with display
> limitations. A-labels are opaque and ugly, and, where
> possible, should thus only be exposed to users who
> absolutely need them. Because IDN labels can be rendered
> either as the A-labels or U-labels, the application may
> reasonably have an option for the user to select the
> preferred method of display; if it does, rendering the
> U-label should normally be the default.
> It is, however, now common practice to display a suspect
> U-Label (such as a mixture of Latin and Cyrillic) as an
I deliberately didn't say that although I certainly agree that
it is true. Let me explain why and then ask for your
suggestions and those of others as to how to handle this.
After a series of informal and very unscientific tests with end
users, I'm convinced that they don't see A-labels as much
different from a sequence of question marks or boxes. Both are
completely opaque with regard to what the original string was
and that is bad news if there is some possibility that there
are some labels that are legitimate but would fail the test and
be displayed in A-label form... the user would have essentially
no way to discriminate between legitimate mixed-script labels
and evil ones.
In retrospect, while "display punycode (A-labels) instead" was
a good fast patch once the problems with phishing and suspect
labels became clear, I think your original suggestion of
displaying the strings in U-label form but with some special
emphasis (colors, special fonts, flags, warning dialog boxes
when touched,...) is a much better idea for "suspect" labels
and that "display A-labels" should be reserved for labels that
are known to be bad, not just ones that are suspect because of
some combination that may actually be legitimate.
I note, for example, that various Russians have claimed that
Cyrillic-Latin mixtures are absolutely essential. If one
accepts their position, then one either needs to do a lot of
localization about what the user finds reasonable or to treat
Cyrillic-Latin mixes differently from Latin-Cyrillic ones (and
I have no idea how one would write a rule that described the
Does this need to be discussed in "issues", regardless of what
is or is not said about common practice?
> 6.3. The Ligature and Digraph Problem
> There are a number of languages written with alphabetic
> scripts in which single phonemes are written using two
> characters, termed a
> "digraph", for example, the "ph" in "pharmacy" and
> The text has been improved considerably from earlier
> versions, but the whole issue is just a special case of the
> fact that words are spelled different ways in different
> languages or language variants. And it has really nothing to
> do with ligatures and diagraphs. The same issue is exhibited
> between theatre.com and theater.com as between a Norwegian URL
> with ae and a Swedish one with a-umlaut.
> So if you retain this section, it should be recast as
> something like
> 6.3 Linguistic Expectations
> Users often have certain expectations based on their
> language. A Norwegian user might expect a label with the
> ae-ligature to be treated as the same label using the Swedish
> spelling with a-umlaut. A user in German might expect a label
> with a u-umlaut and the same label with "ae" to resolve the
> same. For that matter, an English user might expect
> "theater.com" and "theatre.com" to resolve the same. [more in
> that vein].
I think this is an improvement, although part of that text was
motivated by the list discussion about the handling of Eszett
(Latin Small Letter Sharp S, U+00DF) and its friends. The
anomaly there is that it is treated as a "real" Latin character
by NFKC (and special-cased in IDNA2003), while its very close
relative U+FB05 is treated as a ligature, named as one ("Latin
Small Ligature Long S T") and defined as a compatibility
character that is mapped out by NFKC.
One can argue for the difference on the basis of frequency of
use and on the basis of what the relevant community decided to
include in ISO 8859-1 (I assume the latter was the ultimate
cause of the Unicode decision). But, if one is going to treat
one as a letter and the other as a compatibility ligature, then
some communities think it is strange. Perhaps that is the only
case of this, perhaps not, but we have heard claims that, e.g.,
some of the Arabic ligatures raise similar issues.
> there is no evidence that
> they are important enough to Internet operations or
> internationalization to justify large numbers of special cases
> and character-specific handling (additional discussion and
> I suggest the following wording instead:
> there is no evidence that
> they are important enough to Internet operations or
> internationalization to justify inclusion (additional
> discussion and
> It doesn't actually involve "large numbers of special cases",
> there are a rather small percentage of demonstrable problems
> in the symbol/punctuation area. What we could say is that
> there is general consensus that removing all but letters,
> digits, numbers, and marks (with some exceptions) causes
> little damage in terms of backwards compatibility, and does
> remove some problematic characters like fraction slash.
Much better. Thanks.
> For example, an essential
> element of the ASCII case-mapping functions is that
> uppercase(character) must be equal to
> Remove or rephrase. It is a characteristic, but not an
> essential one. In fact, case mappings of strings are lossy;
> once you lowercase "McGowan", you can't recover the original.
Hmm. The original statement doesn't say "not lossy" (or, more
accurately, "not lossy from an information standpoint"). It
just says that a particular function applies. It is
incontrovertible that the function applies. The fact that the
function applied (and that it could be applied very
mechanically) was one reason for the case-matching rule that
made it into the base DNS specs from the Hostname rules. And
it is easy to demonstrate that it does not apply in every case
once one starts examining the Unicode characters with a case
property rather than the version of Base Latin characters that
appears in ASCII/ ISO 646IRV.
> o Unicode names for letters are fairly intuitive,
> recognizable to uses of the relevant script, and
> unambiguous. Symbol names are more problematic because
> there may be no general agreement on whether a
> particular glyph matches a symbol, there are no uniform
> conventions for naming, variations such as outline,
> solid, and
> Actually, the formal Unicode names are often far from
> intuitive to users of the relevant script. That's because the
> constraints of using ASCII for the name, to line up with ISO
> standards for character encodings.
> This section is not really needed. The use of I<heart>NY.com
> is not really problematic; the main justification for
> removing it is that we don't think it is needed (and has not
> been used much since IDNA was introduced). Better to just
> stick with that.
I would welcome a better way to say this. But I<heart>NY.com
is problematic because the user hearing it described that way
has no way to know --absent careful study of the code tables--
whether it is ambiguous or not. Conversely, the user seeing
the "name" doesn't know whether to read it --try to transmit it
to a colleague by voice-- as I<heart>NY.com (or any of the
examples below) or as I<love>NY.com, where the latter is
undoubtedly what the owner of such a domain would intend.
Please consider, as an example, U+2605 and U+2606. While
<heart> may be unambiguous in U+2764 (actually it is not
either, but let me come back to that), one could not have
I<star>NY.com because one has an immediate ambiguity, at the
level of the Unicode character name, between I<black
star>NY.com and I<white star>NY.com. Worse, because these are
not what we have informally called language characters, the
user has no way to know, absent careful study of the Unicode
tables, whether <white star> is just a different font
stylization of <black star> or is a separate character.
Of course, it is even worse than this. There is no possible
way for a normal, casual, user to tell the difference between
the stars of U+2606 and U+2729 or the hearts of U+2665 and
U+2765 without somehow knowing to look for a distinction.
Since we have a white heart (U+2661) as well a few black
hearts, the problem with hearts is ultimately the same as the
one with stars and I<heart>NY.com is hopelessly ambiguous. I
note that I live in an area in which someone might well believe
that harvard<square>.com would be an attractive domain and that
there are far more Squares of various flavors than there are
hearts or stars.
Unlike font and style variations in language (and
"mathematical") characters, identification of compatibility
encodings and application of NFKC is of no help here. All of
these symbols (and many other pairs and triples) are treated as
valid, independent, non-reducible, code points.
We just do not need this stuff. They are serious threats to
interoperability and clear descriptions of domain names.
By the way, while one could have a discussion about motivation,
there are several registered domain names, appearing in TLDs,
that use characters that the "no symbols" rule excludes. See
the comments under Issues-1 (ii) above.
> 11. IANA Considerations
> 11.1. IDNA Permitted Character Registry
> The distinction between "MAYBE" code points and those
> classified into "ALWAYS" and "NEVER" (see Section 5)
> requires a registry of characters and scripts and their
> categories. IANA is requested to
> Expecting an IANA registry to maintain this is setting it up
> for failure. If this were to be done, precise and lengthy
> guidance as to the criteria for removing characters (moving
> to NEVER) would have to be supplied, because of the
> irrevocable nature of this step. The odds of a registry being
> able to perform this correctly are very small.
We do not expect IANA to make the decisions, only to maintain
the registry (IETF rarely expects IANA to make decisions about
protocol registries any more, so this is not an exceptional
case). We are working on procedures but, of course, this
interacts with the discussion about "script communities"
> The best alternative would be to simply have all the
> non-historic scripts have the same status in
> nternet-drafts/draft-faltstrom-idnabis-tables-03.txt>, by
> moving the non-historic scripts to the same status as Latin,
> Greek, and Cyrillic.
Only if one believes that the IDN implications of using all of
those scripts, and the restrictions that need to be applied,
are as well understood as they are for Latin, Greek, and
Cyrillic. Some of the language communities (and I mean
"language" here and not "script") do not believe that.
> The second best would be to have the Unicode consortium make
> the determinations (and take the heat for objections).
I'd like to put that discussion off to another time, if only in
the interest of getting these notes out to you. But some of
those language communities have made it clear that while, for
example, they might trust determinations by ISO/IEC JTC1/SC2
and its formal approval processes linked to National Member
Bodies, they are strongly disinclined to ascribe similar
authority to the Unicode Consortium. However misguided that
attitude might be, I don't believe the IETF is able (or
willing) to take the heat that would result from handing this
over to the Unicode Consortium.
A different way to look at that situation is that there is some
feeling in some of the language communities that an independent
check and signoff is needed on whether the way that the Unicode
Consortium has handled a particular script is adequate for IDN
use. In that context, having the Unicode Consortium review its
own work would obviously not be considered appropriate.
> Some specific suggestion
> about identification and handling of confusable characters
> appear in a Unicode Consortium publication [???]
> Use: [UTR36]
> UTR #36: *Unicode Security Considerations*
Ack. Thanks. Actually, my apologies -- I knew the
reference and had intended to go back and insert it before
posting the draft, then slipped up. This has been fixed in the
working text for -06 since a _very_ short time after -05 was
Again, thanks for the comments. This is a difficult process,
but one that certainly results in a better specification.
More information about the Idna-update