Updating RFC 5890-5893 (IDNA 2008) to Full Standard
Mark Davis ☕
mark at macchiato.com
Fri Nov 16 00:56:07 CET 2012
The development of IDNA2008 was a long, painful, and frustrating process,
with a split between:
- people who were concerned with backwards compatibility and what would
happen during a migration period (such as most representatives to the
Unicode consortium, and
- people who did not feel that it was a concern (such as John, the other
authors, and most participants in the WG).
The rough consensus of the WG was judged to be that backwards compatibility
and migration were not concerns, and that's what went into idna2008.
Yet for companies like mine, compatibility is rather important; we need to
ensure that URLs like http://ÖBB.at <http://xn--bb-eka.at> continue to work
as people expect, and URLs like
to a single location, rather than (depending on the browsers),
- sometimes their original pages (http://www.amt-golssener-land.de)
- sometimes new pages (http://www.xn--amt-golener-land-mlb.de)
Not wanting to repeat that long conversation (and wade through the
typically *very* long messages on the topic), I might suggest your poking
through the archives, such as searching:
*— Il meglio è l’inimico del bene —*
On Thu, Nov 15, 2012 at 10:08 AM, John C Klensin <klensin at jck.com> wrote:
> --On Thursday, November 15, 2012 08:28 -0800 Anne van Kesteren
> <annevk at annevk.nl> wrote:
> > What is an "IUser"? Also, what other than "a" (U+0061) would
> > "Ａ" (U+FF21) map to? Host names have been case-insensitive
> > from the start, the Turkish I is not going to change that.
> Statements like "Host names have been case-insensitive from the
> start" are precisely where i18n design decisions start wandering
> into the swamp. From the start, host names have been basic,
> undecorated, "Latin" characters, coded in [seven bit] ASCII and
> transmitted over the Internet with a leading zero on each octet.
> Worse, while the DNS (server) case-matching rules provide and
> support case-insensitive matching for ASCII, if one were to code
> UTF-8 or ISO 8859-1 (or ISO 8859-anything else) into the DNS,
> octets whose high bit is one are compared without any adjustment
> for case, i.e., case-sensitively.
> "A" is just an example, but the problem is that in some
> languages and locales, not only does "a" (U+0061) upper-case to
> "A" (U+0041), but so does "á" (U+00E1). In other languages and
> locales, á upper-cases to "Á" (U+00C1) -- it just depends.
> And that makes it plausible that the lower case form of "A" can
> be either "á" or "a" and possibly some other things.
> If one must deal with "Ａ" (U+FF21) at all (and I think that is
> a UI matter, not a DNS or IDNA one), then it is pretty clear
> that it should be mapped to "A" (U+0041). But that leaves the
> non-unique lower case issue above.
> It was precisely the above types of issues, their relationship
> to the mapping between A-labels and native character forms being
> non-unique, and the problems the latter caused that resulted in
> exclusion of mapping from the base IDNA2008 protocol.
> Note that none of the above depends on either dotless-i or
> Jefsey's specialized perspective -- the problems are fairly
> > Also, the focus on end users over stability of URLs found in
> > markup in elsewhere feels like a distraction. Most users, for
> > better or worse, use a search engine these days to get to a
> > particular domain. They no longer enter addresses in the
> > address bar.
> But this is exactly the problem. Different parties are looking
> at the same symptoms and drawing different conclusions. ICANN
> and its many constituencies and dependents, for example, appears
> to not believe the data. If they did, the pursuit of "delegated
> variants" and other efforts to make the DNS do what search
> engines do far better (and with less trouble and risk) would be
> But I think we disagree on the criteria for "stability of URLs".
> I spent part of my life working with identifiers in an
> information retrieval context. That leads me to want to see
> identifiers --URLs included-- that are expressed in unique
> canonical form and to distrust aliases and alternate forms as
> just something else that can go wrong. That principle applies
> to the identifiers themselves, not navigational aids to finding
> the identifiers or objects. The same principle especially
> shows up in one piece of the URL puzzle: if the comparison rule
> is "exact string match", then two different URLs for an object
> are a problem. If one wants really stable, useful, URLs and
> assumes that the mapping from what the user types to those URLs
> will be the province of search engines and other UI aids, then
> the URLs (or other identifiers for other protocols) one wants
> will have domain information expressed in canonical form -- for
> IDNs, A-labels -- and tail information expressed with as few
> variable forms as possible.
> >From that perspective, I don't see the things you are asking
> for/ doing as providing/ preserving URL stability at all. It
> seems to me that what you are trying to preserve is the sloppy
> or "clever" behavior of some early HTML authors and their
> ability to be similarly sloppy or clever in the future. Again
> from my perspective, there has never been a good reason to use a
> non-target form (i.e., Nameprep result under IDNA2003 or U-label
> form under IDNA2008) in a URL. As proportionately more and more
> web pages are created with HTML-specific authoring tools (and
> fewer with the likes of vi or emacs), the argument for not
> limiting URLs to A-label form seems to diminish at at least the
> same rate.
> Idna-update mailing list
> Idna-update at alvestrand.no
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Idna-update