Browser IDN display policy: opinions sought

Sun Dec 18 17:45:47 CET 2011

--On Sunday, December 18, 2011 09:50 +0000 Raed Al-Fayez
<rfayez at citc.gov.sa> wrote:

> Dear All,
> 
> My name is Raed Al-Fayez I am from SaudiNIC (.sa &
> .السعودية    ccTLD Registry).
> 
> First of all I would like to thank Gervase Markham for opening
> such important issue; Also I thank everyone who have
> contributed in its discussion.
> 
> Please allow me to share with you our opinions and thoughts
> regarding "Browser IDN display policy":
>...

Raed,

Thanks for your careful and thoughtful note.  In the hope of
encouraging more discussion and understanding, I'd like to
clarify some things that often get in the way of clear thinking:

Equating the pre-IDN domain name policies, and the hostname
policies that preceded them, with "English" or "ASCII" is
convenient but misleading.  The decisions about allowed
characters and the structure of names, going back to the 1970s,
were made in order to have a character repertoire that was
easily and accurately mapped back and forth among coded
character sets.  Many characters that might have made sense were
excluded because they did not appear the same way in the
conventional glyphs associated with different character sets and
codings, others because they were inherently confusable when
written in different ways (the controversial decision to exclude
spacing underscore ("_") is the example I remember most vividly
although there were other issues with that character), non-Latin
scripts were excluded either because they had too many
characters to represent easily or because properly representing
them in the 6 and 7 bit codes of the day required multi-"byte"
sequences, sometimes because they just had too many distinct
characters or glyph forms.  Some of them could be represented
reasonably well in 8 bits, others could not, but 8 bit codes
were not universally available.

Even the decision to make upper case and lower case Latin
characters match --which many of us believe in retrospect to
have been a mistake-- was driven in part by the recognition that
there were systems attached to the network that were single-case
only (usually upper and often because of histories in six-bit
coding systems.  

There were also considerations about characters used in common
operating system command lines, but they didn't dominate.  For
example, Multics and later Unix used "-" as a command argument
introducer, but it is permitted in domain names (although not as
a leading character, partially for that reason).

So we ended up with undecorated Latin characters with one
wide-available in-label separator character (hyphen-minus in
Unicode-speak) and one label separator (period or dot).  Yes,
the DNS uses ASCII, but that is ultimately an "on the wire"
convention to prevent total confusion (see RFCs 20 and 5198 for
discussions of that issue).  If people wanted or needed to do
something else in the local operating system, they did.

Could folks have started with something other than Latin?  Yes,
but it wasn't practical given the state of computer developments
at the time.  Even ignoring that, the choices are more limited
than one would think and the precedents predate the Internet and
computing by many years.  Arabic would not have worked because
the connected characters and differentiation issues that have
occupied a good deal of time within Arabic script IDN
discussions make it unsuitable for this type of
mnemonic-identifier use by those who don't already read the
script.  Most other scripts have the same issues or others.
Japanese Kana (one set or the other, not both, and certainly not
Kanji) might have done the job.  Cyrillic might have worked if
it had been possible to pick an acceptable subset.  Greek would
not have worked due to some of the same matching issues that now
contribute to variant discussions.  And there were very few
scripts for which there were stable coding standards by the time
these specifications started to solidify.

Anyway, please don't assume that all of these decisions were
made out of ignorance or indifference to other scripts or
codings.  The issues associated with internationalization were
very much discussed and considered, long before the DNS.  It
wasn't practical to do anything much different from what was
done -- I note that the ITU made essentially the same decisions,
agreeing on nearly the same set of characters, in the design of
protocols that use letter-based keywords and parameters -- and a
great deal of the decision-making was about differentiability
and existing international practices, not about either ASCII or
English.

    best,
    john