Browser IDN display policy: opinions sought

Tue Dec 20 06:31:35 CET 2011

On 2011/12/19 1:45, John C Klensin wrote:

> Thanks for your careful and thoughtful note.  In the hope of
> encouraging more discussion and understanding, I'd like to
> clarify some things that often get in the way of clear thinking:
>
> Equating the pre-IDN domain name policies, and the hostname
> policies that preceded them, with "English" or "ASCII" is
> convenient but misleading.

For better or worse, that's often the easiest way to refer to them. 
While the technical choices you describe below are not in a strict 
absolute sense related to either ASCII or English, they are nevertheless 
very strongly related to both, at least indirectly. For better or worse, 
most computers of that time where developed in and for English-speaking 
countries, and the same applies to the various encodings (from 6bits 
upwards) that went with it.

Anyway, while the technical choices you describe below may very well 
have been appropriate 30 or 40 years ago, I don't think anybody is 
challenging that. What's important are the technical choices today. 
Neither 6bit encodings nor multibyte character encodings (incl. UTF-8) 
nor most of the other aspects you describe below are relevant today, nor 
are any of them a reason for using punycode rather than Unicode to 
display IDNs.

Regards,   Martin.

> The decisions about allowed
> characters and the structure of names, going back to the 1970s,
> were made in order to have a character repertoire that was
> easily and accurately mapped back and forth among coded
> character sets.  Many characters that might have made sense were
> excluded because they did not appear the same way in the
> conventional glyphs associated with different character sets and
> codings, others because they were inherently confusable when
> written in different ways (the controversial decision to exclude
> spacing underscore ("_") is the example I remember most vividly
> although there were other issues with that character), non-Latin
> scripts were excluded either because they had too many
> characters to represent easily or because properly representing
> them in the 6 and 7 bit codes of the day required multi-"byte"
> sequences, sometimes because they just had too many distinct
> characters or glyph forms.  Some of them could be represented
> reasonably well in 8 bits, others could not, but 8 bit codes
> were not universally available.
>
> Even the decision to make upper case and lower case Latin
> characters match --which many of us believe in retrospect to
> have been a mistake-- was driven in part by the recognition that
> there were systems attached to the network that were single-case
> only (usually upper and often because of histories in six-bit
> coding systems.
>
> There were also considerations about characters used in common
> operating system command lines, but they didn't dominate.  For
> example, Multics and later Unix used "-" as a command argument
> introducer, but it is permitted in domain names (although not as
> a leading character, partially for that reason).
>
> So we ended up with undecorated Latin characters with one
> wide-available in-label separator character (hyphen-minus in
> Unicode-speak) and one label separator (period or dot).  Yes,
> the DNS uses ASCII, but that is ultimately an "on the wire"
> convention to prevent total confusion (see RFCs 20 and 5198 for
> discussions of that issue).  If people wanted or needed to do
> something else in the local operating system, they did.
>
> Could folks have started with something other than Latin?  Yes,
> but it wasn't practical given the state of computer developments
> at the time.  Even ignoring that, the choices are more limited
> than one would think and the precedents predate the Internet and
> computing by many years.  Arabic would not have worked because
> the connected characters and differentiation issues that have
> occupied a good deal of time within Arabic script IDN
> discussions make it unsuitable for this type of
> mnemonic-identifier use by those who don't already read the
> script.  Most other scripts have the same issues or others.
> Japanese Kana (one set or the other, not both, and certainly not
> Kanji) might have done the job.  Cyrillic might have worked if
> it had been possible to pick an acceptable subset.  Greek would
> not have worked due to some of the same matching issues that now
> contribute to variant discussions.  And there were very few
> scripts for which there were stable coding standards by the time
> these specifications started to solidify.
>
> Anyway, please don't assume that all of these decisions were
> made out of ignorance or indifference to other scripts or
> codings.  The issues associated with internationalization were
> very much discussed and considered, long before the DNS.  It
> wasn't practical to do anything much different from what was
> done -- I note that the ITU made essentially the same decisions,
> agreeing on nearly the same set of characters, in the design of
> protocols that use letter-based keywords and parameters -- and a
> great deal of the decision-making was about differentiability
> and existing international practices, not about either ASCII or
> English.
>
>      best,
>      john
>
>
>
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update