mappings-01

Shawn Steele Shawn.Steele at microsoft.com
Wed Jul 8 22:36:56 CEST 2009


Let's make a REASONABLE standard that those of us building apps can actually follow to the letter.

> I think we agree about this, but I note that one of the things
> that makes it hard to get discussions about this exactly right
> is that the DNS does case independent matching for octets that
> appear to represent ASCII characters.

A randomization, but that's one thing that really bugs me.  If there is discussion of mapping, then ASCII mapping should be in the same bucket.  If we need ASCII mapping, then we probably need Unicode mapping.  If we can live without Unicode mapping, then we can live without ASCII mapping.  (I'd exempt punycode in my peculiar thinking because they themselves aren't the string, they merely represent the string.)

> That isn't what draft-...-protocol is intended to say.  What it
> is intended to say is that, by the time a string arrives at the
> IDNA protocol layer, it is expected to be in a form that does
> not require further mapping _and_ to be in NFC form.

I guess I don't know what an "IDNA protocol layer" is.  If I write the Unicode name of a server on a napkin at a bar for a friend, I'd call that an "IDN Name", and presumably for it to work, it would need to comply with an "IDNA protocol".  It's obviously not in NFC, and likely will be "mapped" since some people write all caps or all lower case or cursive or whatever.

I'm not sure where the IDNA protocol starts.  Obviously the actual name resolution must be covered, however "something" has to allow me to get it into the canonical form.  If that something isn't part of the IDNA protocol or allowed by the protocol, then it's sort of a catch-22.

>        (i) Section 5.2 ("Conversion to Unicode") and Section
...
>        (ii) The combined section could pick up some of the text

I don't see why "conversion to Unicode" even needs to be addressed.  Certainly the inputs to IDN are clearly Unicode, so I'd just cut 5.2.  I thought I saw some words I liked, but I can't find them again :(

So long as we're talking about section 5, "it is probably better for users to understand IDNs strictly in lower-case, U-label, form" is pointless wishful thinking and should be removed.  End users can't even be trained to do simple anti-phishing steps, let alone be expected to understand what U-Labels, eg: Unicode Normalization Form C is.  "I" can't even figure out if a string is Form C just by looking, I'd have to do the alt-x thing to see its code points or something.

What is really vague right now is "UI level" and "many cases".  Specifically I would suggest that "best practice" would be for an href to be in the canonical form.  However, in practice, when I type a URL in my blog, there's near-zero chance that I'm going to make sure I type it in correct canonical form.  Sure, it'd be nice, but it's unreasonable to expect that all, or even most, IDN links will be canonical within a link.  I would also expect that if the name MUST be canonical at the href level, that most browsers will ignore that restriction.  Certainly several browsers I've played with are already very generous in their interpretation of href.  I can't imagine how it's helpful for the standard to specify something that applications will ignore because it isn't useful.

Certainly if a user can enter a domain name from a napkin, then mapping is a MUST, not a SHOULD.  The only case where a UI level need not map would be if the user is somehow already restricted to names in a valid form (eg: picking from a list).

I would much rather see words like:

* Names MUST be in the canonical form when a DNS server gets the request.  (Note that I don't even say when making a DNS request.  Certainly nslookup with a user-entered name is a normal problem)
* Names SHOULD be in the canonical form when interchanged by files or other protocols.  As in the href case I don't think it's practical all of the time, even if it is best practice.
* Names MAY be mapped to the canonical form by applications reasonably expecting to encounter unmapped forms, such as when processing hrefs in an html file.
* Names SHOULD be mapped to the canonical form if there's a reasonable expectation that it was entered by a human, such as in an address bar

I'd also like to say that they MUST be mapped if you know a human typed it, however I imagine that some users would probably expect the form they enter in their address book to remain in the same form when they read it again.  Eg: if I type in "AAA.com", I expect my address book to say "AAA.com" when I open the record again, not "aaa.com".  Early web page editors had horrid customer experiences when they reformatted user input.  I would expect that a person compiling a report in Excel would be quite irate if the "Home Page" column that they painstakingly formatted for a presentation suddenly became all lower case after saving since it was now a "data file."

These are fundamental usability issues:  Users expect what they type to work.  They expect data to be retrieved they way they saved it.  They don't have the tools or desire to type URLs in NFC + some esoteric IDN rules.

Applications have to design to those usability concerns.  If a user's link is perceived as "broken", this working group doesn't get the feedback, the app devs get the feedback.  Customers didn't "blame" the IDN authors for the PayPal homographs, they blamed Opera and Firefox.  If we rejected all the uppercase letters in the IE address bar, users would demand a fix.

So please let's make a REASONABLE standard that those of us building apps can actually follow to the letter.  1) MUST NOT map when we're sure there's no possibility of unmapped strings, 2) MAY map where apps think they may encounter unmapped strings, and 3) SHOULD map when there's a high probability of encountering unmapped strings.

- Shawn




More information about the Idna-update mailing list