IDNNever.txt

Sat Feb 10 19:03:07 CET 2007

--On Friday, 09 February, 2007 23:28 +0100 Avri Doria
<avri at acm.org> wrote:

>...
> In the meantime, I am not trying to burn anything, just trying
> to understand the reason for such decisions.  Are they
> technical?  policy? something else altogether?  Is it the
> single script per label rule?  A homograph problem?
> 
> If this isn't the right place-time to ask such questions,
> where and when is?  If making the answer public is too risky,
> can someone give me the right clue privately (though I don't
> know why we should ever be afraid to talk of the reasons for
> things in public).  Is the reason perhaps documented in some
> i-d/rfc somewhere? I think it certainly should be if the
> question is so dangerous.
> 
> Just looking for a clue on why the strict
> prohibition/avoidance.  I got to the question in a quest to
> look for answers for non literate users - an extension of the
> ML problem and if there is a technical answer i would like to
> know.

Avri,

My apologies for not getting this our sooner; I ran out of steam
last night.

Like everything else related to IDNs, this is an issue that
involves balancing complex tradeoffs.  Almost nothing is
entirely technical, almost nothing is entirely policy, and that
list could go on.  Most of the technical and policy constraints
aren't really about IDNs at all, but about the fundamentals of
the things we have to work with -- including problems and
constraints that are thousands of years old.  There is nothing
"risky" about the answers (or the questions), but there is a
problem when people make simplistic assumptions or assertions
without a willingness to dig in and understand the issues.

That complexity is the reason for both the length of this note
and the reason you haven't seen lots of explanations floating
around.  Speaking for myself, I've seen lots of evidence that
many of the people who are asking the questions --especially
those who ask the questions belligerently, assuming some
conspiracy against some interest of theirs-- are uninterested
and unwilling to either understand the details of how IDNs work
and why or to follow any explanation that is more than a
sentence or two long.  I know you are not in that category, but
I do see a general pattern.

Anyway, let me see if I can paint a picture...

* First, there is a tension between "things that people might
want to do" and a DNS-based referencing system that is stable,
predictable, and usable by all of the Internet's users.  As you
know, the DNS doesn't care what is put in it: ultimately strings
of bits go in, strings of bits come back out, and there is no
question about what matches and what doesn't.   Applications do
care, and are often very sensitive about characters and syntax.
That is another problem for punctuation and symbols but one I'll
skip to prevent this note from becoming even longer.  If one
isolates the questions being asked to being concerned only about
what can be placed in a DNS zone file and queries to retrieve
those things, and assume that the "users" are computers who care
about bit patterns rather than characters, glyphs, scripts, and
languages, then things get very easy... but the DNS becomes
nearly useless for users.  If, by contrast, our concern is
integrity of references, there is a strong case to be made for a
very narrow set of rules: probably sticking with ASCII and maybe
eliminating a few ASCII characters that are now permitted.  As
far as I know, no one sane is suggesting either extreme
position, but that is where things start to get tricky.

* Second, Unicode is not perfectly optimized for IDNs.  It would
be a miracle if it was, given that it predates the current
generation of IDN efforts by well over a decade and, more
important, because it must meet the needs of a broad range of
applications.  I'm convinced that, in general, the Unicode
Consortium did about as good a job overall as could have been
done.  But the net result is many little issues that have to
either be accepted or worked around and any decision has an
impact on IDNs.  

Perhaps an example will help with this -- as usual, I'm picking
Roman-character-based examples when I can because they are
easier to talk about in English: there are equally difficult (or
"interesting") problems with almost every script.  There are
nominally two ways of encoding a lower-case "o" with two dots
above it.  One is as a single character, which Unicode calls
Latin Small Letter O with Diaeresis (U+00F6).  The other is as
Latin Small Letter O (U+006F) followed immediately by Combining
Diaeresis (U+0308).  The typical user, on a typical system,
doesn't have a lot of control over which encoded form she gets
under normal circumstances: one will be easy to key in, the
other will be a more or less major annoyance.  The normal
Unicode solution to the little difficulty of having two ways to
code the same thing involves application of a process called
"normalization": using the normalization technique applied by
IDNA, the combining sequence is turned into the single-character
precomposed form and that form (in further-encoded form) is all
that is ever stored in the DNS.  The two-character sequence is
never stored, but queries including it will match.   

However, the important part of this example is that I can easily
imagine people who would "like to" treat the two forms
separately.  They might want to render them differently or use
the distinction for other purposes.  I imagine that my Unicode
Technical Committee colleagues (several of whom are on this list
and will correct me if I'm incorrect) would consider that wrong,
a serious violation of the spirit of the standard, or perhaps
just plain dumb.   But Unicode is, ultimately, just a coding
system and those who write and ratify standards typically don't
get to control how they are used.  So IDNA2003 forces
normalization in order to treat two ways of coding the same
character-abstraction (as Unicode defines it) as the same.  The
other side of the same decision is the consequence that no one
can distinguish between the two coding forms in IDN/DNS use,
even if they want to and even if they can do so in plain text on
the same system.    

Is the decision to normalize a technical decision or a policy
one?  It is easy to argue "technical" because, if one treats the
two coding forms as really being the same character, then making
them match is required technically and the normalization
mechanism is certainly a technical one.  But, somewhere down
there is the conclusion that treating the two coding forms
differently is not acceptable.  And that, in some ultimate
sense, is a policy decision.

I would also recommend that people read and study Chapter 2 of
The Unicode Standard (same Chapter number in versions/editions
3.0, 4.0 and 5.0, but the explanations get better so I would
recommend the most recent version if feasible) to better
understand the backdrop against all of this is being carried
out.   Note that the "design principles" discussion goes on for
several pages (much longer than even this long note) and that
that the ten principles listed are neither orthogonal nor
prioritized in a clear way.  The decisions about a particular
character or code point are ultimately made on what most of us
would consider a technical basis.   However, whether those
decisions are influenced or controlled by underlying policy
decisions could be the subject of a long and interesting
philosophical debate.

* Part of our difficulty --and ultimately another decision-- is
that one cannot, in practice, treat Unicode one character at a
time.  In one version of a perfect world, we would find a Grand
Authoritative Committee, learned in DNS and referencing issues
and in all of the languages and scripts of the world and have
them go through Unicode, examining each of the nearly 100K
characters and then all of the pairing of those characters (for
confusability detection) and most or all of the combinations
with adjacent characters (to sort out issues of combining,
ligatures, presentation forms and so on) and make decisions
about what should be permitted in IDNs.  In practice, we could
never find such a committee and, if we could, it would probably
take a generation for them to complete their task... with
Unicode changing just enough during that time that they would
need to start over before they were finished.   However, if one
is not going to do that, then the tools at hand involve making
decisions about groups of characters based on scripts,
properties, and other grouping mechanisms.  I don't know whether
the decision to avoid the Grand Committee path because we would
rather have decently-working IDNs in our lifetimes is a policy
issue or a technical one.  But the answer is clear enough to
those of us who care about IDNs that no one has launched a
formal policy or consensus process to define that committee in
either the IETF, ICANN, or elsewhere.

Another aspect of what I think of as the same problem (others
would categorize it differently) is that there is a huge
advantage in all of this if we can end up with a fairly simple
set of rules.  Even if the description of the decisions about
what to permit and what to exclude is complex, almost everyone
benefits if users and registrants can ask a question about what
is actually permitted to be encoded in the DNS and get a clear
and simple answer.  With IDNA2003, the answer is often "no one
knows but the programs -- you can either run one of the programs
and trust it or simulate the program via the tables and stated
algorithms themselves".  Whether you consider that answer
"policy" or "technical", it has caused a good deal of confusion
as I think most registrars and registries who have taken IDNs
seriously can tell you.  Of course, as elsewhere with the DNS,
there are those who derive economic benefits from confusion: the
decision to not make their interests primary is ultimately a
policy one and, if made the other way, would change many things,
not just about IDNs.  There is also a purely technical reason
for simplicity: we have years of experience with the Internet
(and perhaps life in general) to tell us that things that can be
specified in simple and clear ways are typically implemented
well and interoperate smoothly while ones that involve complex
definitions and many special cases often lead to inconsistent
implementations and opportunities for incompatibilities and
security exploits (most of the latter are not specifically
predictable in advance).

* Because of the properties of Unicode and, more important, the
characteristics and differences in the wide varieties of writing
systems in the world, it is impossible to end up with a rule as
simple as "letters, digits, and the hyphen, and the hyphen
cannot come first or last" (that too, was a mixed policy and
technical decision).   Mixtures of right to left and left to
right scripts require special treatment and the details of the
rules are not simple: the ones in IDNA2003 are wrong in a small
detail that essentially excluded words from several languages
entirely and unreasonably restricted others.  There are some
markers in Unicode that would be invisible ("zero-width
characters") if they stood alone but that are critical for
getting correct spelling and rendering in certain scripts.  They
were excluded entirely in IDNA2003; it is clear that we are
going to need more complex rules for technical reasons if we
make the obvious policy decision that we don't want to write off
those languages.

But seeking to include cases that "someone might want for
something someday" at the price of significant additional
complexity in explanation and determinations of what is
permitted appears to be a very bad tradeoff for both technical
and policy reasons.  That statement is not an absolute; the
questions is when to draw the line.

* Almost all of the discussion about including non-ASCII
characters in the DNS and using them on the Internet --
discussions that go back, admittedly in superficial form, to the
"host name" discussions of the 70s as well as in their more
recent forms with discussions of "multilingual names" -- have
focused on  "words" in assorted languages.  The details of that
focus are actually incorrect: the DNS has never been restricted
to "words" and many important domain names aren't words.  That
is one of several reasons why the IETF dropped "multilingual" in
favor of "internationalized" in talking about IDNs.  But, even
with that correction, we have focused on word-like strings for
labels.  That is mostly a policy decision, but, like the LDH
rule, there are strong technical considerations underlying it.
Because of other constraints -- both technical and policy -- one
will never, as Vint puts it, be able to write the Great <insert
country name here> Novel in a DNS label.   It appears fairly
clear that we should not try.  "Word-like strings", however,
gives us significant help with the simplicity of explanation
problem: if we restrict ourselves to what have been called
"language characters" in a long series of policies, guidelines,
and statements since 2003, we have a rule about what is
permitted that, while it requires additional qualification,
people can generally understand.

* We also know that, where it is possible to do accurately and
well, checking at or before retrieval time (query or lookup time
in DNS-speak) improves user interfaces and quality of
information for users.  It also impacts the quality of User
Interface designs in very significant ways.  Those are
well-understood to be technical facts (in different technology
areas).  Whether to apply them at the UI level is equally
well-understood to be a policy problem, a policy problem that
cannot be, and fortunately does not need to be, resolved
globally.

* That, at last, brings us to the symbols and punctuation.   By
definition, they are not used to write words in any of the
world's languages.  If a particular character is used inside
individual words, but is identified as a symbol or punctuation
(a description that slightly oversimplifies their categories,
but this note is already too long), then it is miscategorized
and we have a different problem: whether it is important enough
to push us further along the path of rules and exceptions about
individual characters (see above) and, if so, what additional
considerations apply to it.  In addition to the "language
character" rule, there are additional reasons to exclude many
symbols and punctuation marks.  For example, the names of those
characters often vary much more widely than names of letters.
There is no standard way to pronounce them or treat them as
phonemes (unlike letters in many writing systems) and that, in
turn, makes both description of them (e.g., in databases and
legal systems) and their use in, e.g., text-to-speech systems
for those for whom reading type on computer screens is
difficult, devilishly difficult.  Others can cause issues with
how the DNS is actually used.  For example, there are lots of
dot-like things around.  If they are permitted, which ones are
going to be treated as equivalent to the period that is used to
separate domain name labels in all contexts except DNS entries
and queries themselves?   How will they, and the ones that are
not treated as equivalent to them, interact with the fact that
--as far as the DNS itself is concerned-- period is a perfectly
valid character inside a label?    Now certainly some symbols
and punctuation are completely safe and someone would want to
use them for something.    But the determination is more subtle
than might appear.  For example, coming back to the descriptive
or reading issue above, one can't talk about the "smiley face"
symbol without knowing whether or not Unicode incorporates
different sorts of smiley faces: white on black as well as black
on white? the different types of smiles that we conventionally
portray as ":-)" versus ":-}"?  And so on.  One would need to
know what Unicore might add in the future as well, which is
either unknowable or requires reaching binding agreements with
them about what they will or will not add, no matter what other
pressures they are under.  In addition, permitting the
smiley-face symbols implies that UI designers will need to
figure out whether to treat 
     happiness:-).info
as an ASCII string (and hence a plain error) or as a convention
for
     happiness<smiley-character>.info
which would then be a valid IDN.   Each step in the direction of
such character-by-character analysis moves us closer to needing
the Grand Authoritative Committee to examine and decide about
those characters.  Each step also makes explanation of the rules
far less simple, which takes us back to many of the issues and
tradeoffs discussed above.

Nothing in that discussion has anything to do with pictorial
languages.  If such languages are identified and the Unicode
consortium decides to code them, then the characters of which
they consist will be "language characters" in the sense in which
I use the term above.  They will presumably be identified in
Unicode with the properties appropriate to such characters.
They are not an issue here.

* That isn't all, but perhaps it is enough... and this note is
already far too long.

* ---

Many years ago, people concluded that, with the sole exception
of one intra-label separator ("-"), symbols and punctuation
should be banned and what become DNS "names" limited to the LDH
set.  The reasons were parallel to some of those identified
above although shifting from a 127-code point repertoire to one
that potentially contains 2**21 code points creates many new
issues and makes old ones more difficult.  In recent years and
from observing actual practice with the DNS, there is now
evidence that even the hyphen was probably unnecessary and that
it has now become an opportunity for confusion.   During all of
that time, there were people who would have liked to see "+", or
"$", or "%" or "?" in the DNS and would have found uses for them
that they would have considered interesting and valuable.  There
is no evidence that the functionality of the ASCII DNS has been
reduced by their exclusion, even if we don't know how a debate
about universal happiness would have worked out. 

The ultimate answer to "why no symbols or punctuation" is the
entire explanation above (plus some things I've left out for
brevity), added together to produce a judgment call about the
tradeoff.   The judgment call says that, clever ideas that some
people might have, or want to have, notwithstanding, no case has
been made for the value of these things that overcomes the
descriptive problems, the character-by-character identification
problems, the added complexity in rules and risks of
incompatible programs, the loss of pre-query checking, the
additional opportunities for bad acts (spoofing and what are
improperly called homographs (see RFC 4290) are a tiny part of
that issue), the inevitable delays in getting the important,
word-like, part of IDNs on a really solid footing
internationally as these characters are examined case-by-case,
and so forth.    

Is that about policy or is it technical?  As I trust anyone who
has understood the above realizes, the two are not completely
separable, but the reasons are quite compelling either way.   If
it is policy, it is policy that is as technically constrained
--and constrained by the history of writing systems as well-- as
the value of Pi or the speed of light in a vacuum.  One could go
off and take a vote and discover that an overwhelming fraction
of the Earth's population would consider it much more convenient
if Pi were 3.0 (I know I would).  But that would not change the
underlying value; it would just create states that are silly and
sometimes dangerous.  Whether to talk about the circumference of
a circle in terms of radius or diameter and whether to measure
it in feet or meters, are policy questions.  But the value of Pi
cannot be a policy question and those who want to evaluate or
change it need to understand the mathematics first.  Roughly the
same issues apply to many of these IDN decisions: I believe that
policy decisions should be made in policy arenas, probably as
strongly as you do.  But, in an area as intertwined with
technical (and even cultural and perceptual) constraints as this
one, the belief that everyone who has an opinion, even people
with no knowledge of the constraints, should get to "vote",
leads to conclusions about voting on the value of Pi... and to
equally silly results.   And the more people we have who make
strong statements about what should happen with IDNs but who are
not willing to understand how IDNA and the DNS works, and to do
the work needed to understand issues, tradeoffs, and constraints
such as those outlined above, then the more we will face a
choice between pushing the decisions toward the technologists
and  delays before we have an IDN system that is really usable
for all of the world's population... keeping in mind that every
single time an end-user needs to see Punycode is a step away
from that goal that we both consider vitally important.

regards,
    john

p.s. If you are looking for solutions for non-literate users, I
think there are some very good possibilities out there and would
be happy to point you to the ones I know about.  But I don't
believe that IDNs and the DNS are the right place to look for
answers unless your idea of how to deal with illiteracy is to
teach completely new languages based, possibly, on pictographic
or semi-pictographic scripts.  If you do the latter, then you
should be planning a conversation with UTC about coding your new
characters.  But that approach has rarely been very useful in
the past, primarily because what you would be teaching is not "a
system for use by non-literates" but literacy in a language that
no one else uses.... permanently disadvantaging and isolating
those populations.