Eszett and IDNAv2 vs IDNA2008
Erik van der Poel
erikv at google.com
Thu Mar 12 18:27:54 CET 2009
I think I understand why you would want browsers to remain backward
compatible with IDNA2003 (even though MSIE7 is incompatible with MSIE6
because MSIE7 allows non-ASCII domain names while MSIE6 doesn't), but
if we were to keep the mappings for Eszett, Final Sigma, ZWJ and ZWNJ,
presumably we wouldn't be able to /add/ mappings for Greek tonos
either (for the same compatibility reasons).
This would be kinda sad, since the .gr folks are not so happy with
DNAME, but perhaps they can come up with a different mechanism to
bundle names, such as generating all of the possible names from a
single source name, and then adding all of those to the zone.
I can see the arguments on both sides (removing mappings from the base
IDNA spec vs keeping the mappings), and I don't know where this WG and
the implementations are going to go, but the display preference idea
is still quite intriguing.
One approach is to transmit the info in DNS responses. Another
approach is the http://<domain-name>/idndisp.txt idea. The idndisp.txt
mechanism could even serve as a way of transitioning to the DNS
The display preference cannot be encoded in the question section,
because that has to match the request exactly, including upper/lower
case. We might not be able to encode it in the answer section since
some middleware expects the answer section to use a compressed name
(which is a kind of alias for the name in the question section). We
might not be able to encode it in the additional section if some
middleware removes names that do not match any of the names in the
question, answer and authority sections.
So perhaps another idea is to encode it in the authority section,
using a different prefix such as xd-- (where d means display). This
prefix would be interpreted by new clients as a display preference
(underneath the Punycode that follows the prefix), and old clients
would just use it as an ordinary authority, with its IP address in the
additional section. I don't know whether the authority section can be
(ab)used in this way, though.
I realize that John asked us all to focus on the drafts, but, frankly,
I don't see much consensus on the mapping issue yet, so I still think
it is better to explore all possibilities.
On Wed, Mar 11, 2009 at 6:28 PM, Shawn Steele (???)
<Shawn.Steele at microsoft.com> wrote:
> Mark quoted:
>> "c) IDNA2003 is now well established and widespread. *With a new version of*
>> * IDNA we would like and would expect the situation to be backwards*
>> * compatible with IDNA2003. *That is, for all practical effects: eszett
>> *works* for the users and is mapped to ss." [My bolding.]
>> Nor has the discussion around mapping been a waste of time either. Frankly,
>> unless IDNA2008 makes changes for interoperability in lookup with IDNA2003,
>> at this point it may very well be better to go in the direction of the
>> IDNAv2 proposal instead. It'd be a shame to not keep the many improvements
>> developed in IDNA2008, but the level of breakage between the old version of
>> IDNA and new would be pretty serious.
> I think that this summarizes some of my thinking as well. Any implementation of IDN (at least at the browser level) will need to be backward compatible with IDNA2003. So if it doesn't work in any new version, we'll end up doing an IDNA2003 lookup as well. In practice that means that removing existing code points has no effect (eg: symbols), and that characters like eszett become problematic.
> I think that some of the goals of IDNA2008 are good, but IDNAv2 may better serve the short-term need for additional post-Unicode 3.2 code points. I don't see any problem with pursuing both an IDNAv2 and IDNA2008, although I think that IDNA2008 and any successor would have to recognize the requirement for de facto backwards compatibility with IDNA2003. That includes bad decisions.
> That doesn't rule out extending IDN to support scenarios that were illegal in IDNA2003 (new code points, ZWJ, etc.), and something about the RTL behavior.
> I don't think that an IDNA2003 extension would rule out solutions to the eszett or similar problems where characters were previously ignored or mapped.
> I think that eszett and similar characters belong to a special group of characters that need special display behavior. Sometimes multiple character sequences can be used. Not necessarily linguistically correct, but that may not matter to common usage. I was trying to find a silly example, so I looked for http://daß.de and ended up at http://dass.de, which is an acronym, sort of proving that eszett doesn't always work. Apparently dass is supposed to be dass though, so http://www.groß.de would be a better example.
> If I were a German seeking a domain, I'd want both ß and ss to be bundled since I have no clue how users are going to type it. Regardless of whether the alternate spellings were linguistically different, it would only be reasonable for me to want to claim both names. The difference is in usage, not where they resolve.
> So for DNS resolution of the Eszett, in practice IDNA2003 is "fine". The problem is when I want to display it. Again "http://www.groß.de" works fine in a link (browsers turn it in to www.gross.de), so browsers and links can use that form.
> In my view the real problem comes when I don't know what the preferred display form is supposed to be. If we could resolve that problem, then IDNA2003 would work fine for eszett, and likely the other issues like Greek and ZWJ as well, although I'm more familiar with eszett. I think this is a generic problem for casing and other mappings as well. My naïve view is that the DNS system could be provided with hints as to the preferred display form of a name. Another field or record type or something.
> In this view, there would be 3 types of labels: A-labels & U-labels that were unique, conformed to the mapping rules and were already mapped, and a "display" Unicode label that did was not necessarily fully mapped. The display label would resolve to a valid U-Label when the mappings were applied, but multiple display labels could potentially map to a single U-label. Presumably display labels would have some restrictions (disallow control codes), but not as restrictive as fully mapped U-labels.
> This would allow variations like Eszett, and formatting characters like the ZWJ to be added to a display label to control correct presentation of the domain name, yet it wouldn't impact the ability of the system to resolve U-labels/A-labels for related sequences.
> - Shawn
> Idna-update mailing list
> Idna-update at alvestrand.no
More information about the Idna-update