Updating RFC 5890-5893 (IDNA 2008) to Full Standard

Thu Nov 15 19:08:00 CET 2012

--On Thursday, November 15, 2012 08:28 -0800 Anne van Kesteren
<annevk at annevk.nl> wrote:

> What is an "IUser"? Also, what other than "a" (U+0061) would
> "Ａ" (U+FF21) map to? Host names have been case-insensitive
> from the start, the Turkish I is not going to change that.

Anne,

Statements like "Host names have been case-insensitive from the
start" are precisely where i18n design decisions start wandering
into the swamp.  From the start, host names have been basic,
undecorated, "Latin" characters, coded in [seven bit] ASCII and
transmitted over the Internet with a leading zero on each octet.
Worse, while the DNS (server) case-matching rules provide and
support case-insensitive matching for ASCII, if one were to code
UTF-8 or ISO 8859-1 (or ISO 8859-anything else) into the DNS,
octets whose high bit is one are compared without any adjustment
for case, i.e., case-sensitively.

"A" is just an example, but the problem is that in some
languages and locales, not only does "a" (U+0061) upper-case to
"A" (U+0041), but so does "á" (U+00E1).  In other languages and
locales, á upper-cases to "Á" (U+00C1) -- it just depends.
And that makes it plausible that the lower case form of "A" can
be either "á" or "a" and possibly some other things.

If one must deal with "Ａ" (U+FF21) at all (and I think that is
a UI matter, not a DNS or IDNA one), then it is pretty clear
that it should be mapped to "A" (U+0041).  But that leaves the
non-unique lower case issue above.

It was precisely the above types of issues, their relationship
to the mapping between A-labels and native character forms being
non-unique, and the problems the latter caused that resulted in
exclusion of mapping from the base IDNA2008 protocol.

Note  that none of the above depends on either dotless-i or
Jefsey's specialized perspective -- the problems are fairly
general.

> Also, the focus on end users over stability of URLs found in
> markup in elsewhere feels like a distraction. Most users, for
> better or worse, use a search engine these days to get to a
> particular domain. They no longer enter addresses in the
> address bar.

But this is exactly the problem.  Different parties are looking
at the same symptoms and drawing different conclusions.  ICANN
and its many constituencies and dependents, for example, appears
to not believe the data.  If they did, the pursuit of "delegated
variants" and other efforts to make the DNS do what search
engines do far better (and with less trouble and risk) would be
insane.  

But I think we disagree on the criteria for "stability of URLs".
I spent part of my life working with identifiers in an
information retrieval context.  That leads me to want to see
identifiers --URLs included-- that are expressed in unique
canonical form and to distrust aliases and alternate forms as
just something else that can go wrong.  That principle applies
to the identifiers themselves, not navigational aids to finding
the identifiers or objects.    The same principle especially
shows up in one piece of the URL puzzle: if the comparison rule
is "exact string match", then two different URLs for an object
are a problem.  If one wants really stable, useful, URLs and
assumes that the mapping from what the user types to those URLs
will be the province of search engines and other UI aids, then
the URLs (or other identifiers for other protocols) one wants
will have domain information expressed in canonical form -- for
IDNs, A-labels -- and tail information expressed with as few
variable forms as possible.

>From that perspective, I don't see the things you are asking
for/ doing as providing/ preserving URL stability at all.  It
seems to me that what you are trying to preserve is the sloppy
or "clever" behavior of some early HTML authors and their
ability to be similarly sloppy or clever in the future.  Again
from my perspective, there has never been a good reason to use a
non-target form (i.e., Nameprep result under IDNA2003 or U-label
form under IDNA2008) in a URL.  As proportionately more and more
web pages are created with HTML-specific authoring tools (and
fewer with the likes of vi or emacs), the argument for not
limiting URLs to A-label form seems to diminish at at least the
same rate.

regards,
   john