Eszett and IDNAv2 vs IDNA2008

Shawn Steele (???) Shawn.Steele at microsoft.com
Thu Mar 12 02:28:17 CET 2009


Mark quoted:

> "c) IDNA2003 is now well established and widespread. *With a new version of*
> * IDNA we would like and would expect the situation to be backwards*
> * compatible with IDNA2003. *That is, for all practical effects: eszett
> *works* for the users and is mapped to ss." [My bolding.]

> Nor has the discussion around mapping been a waste of time either. Frankly,
> unless IDNA2008 makes changes for interoperability in lookup with IDNA2003,
> at this point it may very well be better to go in the direction of the
> IDNAv2 proposal instead. It'd be a shame to not keep the many improvements
> developed in IDNA2008, but the level of breakage between the old version of
> IDNA and new would be pretty serious.

I think that this summarizes some of my thinking as well.  Any implementation of IDN (at least at the browser level) will need to be backward compatible with IDNA2003.  So if it doesn't work in any new version, we'll end up doing an IDNA2003 lookup as well.  In practice that means that removing existing code points has no effect (eg: symbols), and that characters like eszett become problematic.

I think that some of the goals of IDNA2008 are good, but IDNAv2 may better serve the short-term need for additional post-Unicode 3.2 code points.  I don't see any problem with pursuing both an IDNAv2 and IDNA2008, although I think that IDNA2008 and any successor would have to recognize the requirement for de facto backwards compatibility with IDNA2003.  That includes bad decisions.

That doesn't rule out extending IDN to support scenarios that were illegal in IDNA2003 (new code points, ZWJ, etc.), and something about the RTL behavior.

I don't think that an IDNA2003 extension would rule out solutions to the eszett or similar problems where characters were previously ignored or mapped.

I think that eszett and similar characters belong to a special group of characters that need special display behavior.  Sometimes multiple character sequences can be used.  Not necessarily linguistically correct, but that may not matter to common usage.  I was trying to find a silly example, so I looked for http://daß.de and ended up at http://dass.de, which is an acronym, sort of proving that eszett doesn't always work.  Apparently dass is supposed to be dass though, so http://www.groß.de would be a better example.

If I were a German seeking a domain, I'd want both ß and ss to be bundled since I have no clue how users are going to type it.  Regardless of whether the alternate spellings were linguistically different, it would only be reasonable for me to want to claim both names.  The difference is in usage, not where they resolve.

So for DNS resolution of the Eszett, in practice IDNA2003 is "fine".  The problem is when I want to display it.  Again "http://www.groß.de" works fine in a link (browsers turn it in to www.gross.de), so browsers and links can use that form.

In my view the real problem comes when I don't know what the preferred display form is supposed to be.  If we could resolve that problem, then IDNA2003 would work fine for eszett, and likely the other issues like Greek and ZWJ as well, although I'm more familiar with eszett.  I think this is a generic problem for casing and other mappings as well.  My naïve view is that the DNS system could be provided with hints as to the preferred display form of a name.  Another field or record type or something.

In this view, there would be 3 types of labels:  A-labels & U-labels that were unique, conformed to the mapping rules and were already mapped, and a "display" Unicode label that did was not necessarily fully mapped.  The display label would resolve to a valid U-Label when the mappings were applied, but multiple display labels could potentially map to a single U-label.  Presumably display labels would have some restrictions (disallow control codes), but not as restrictive as fully mapped U-labels.

This would allow variations like Eszett, and formatting characters like the ZWJ to be added to a display label to control correct presentation of the domain name, yet it wouldn't impact the ability of the system to resolve U-labels/A-labels for related sequences.

- Shawn




More information about the Idna-update mailing list