IAB Statement on Identifiers and Unicode 7.0.0
John C Klensin
klensin at jck.com
Wed Jan 28 20:44:27 CET 2015
--On Wednesday, January 28, 2015 18:27 +0000 Shawn Steele
<Shawn.Steele at microsoft.com> wrote:
> Another example that the WG agreed on: I-ı-İ-i round trips
> to i-ı-İ-i the way IDN is designed, please explain to me
> how that isn't confusing? If I write the domain in block caps
> (I) it goes one place, if I write it in lower case (ı) it
> goes to a completely different place.
I'm a little confused by your example. "I" is disallowed
entirely in labels that contain non-ASCII characters. "İ"
(assuming that is U+0130, LATIN CAPITAL LETTER I WITH DOT
ABOVE), a non-ASCII and hence non-LDH character, is disallowed
entirely. So "I-ı-İ-i" (with or without the hyphens) is
simply an invalid U-label -- it doesn't round-trip to anything.
There may (or may not) be an issue with CaseFolding in your
example, but IDNA2008 doesn't use CaseFolding.
The answer might be different with IDNA2003, but, as Vint has
most recently pointed out, making U-labels and A-labels duals of
each other, eliminating cases in which ToUnocode(ToASCII(String)
was not equal String, was a major motivation and primary design
goal for IDNA2008.
Could you explain your example a little bit better?
>(We should've made
> these all map to "i", but we didn't). This is far more
> confusing to real people that the code points under discussion
> (and I'm not sure how Unicode character properties could help).
Remembering a plea in, I think, PRECIS in Toronto, I believe our
colleagues who are concerned about Turkic scripts would disagree
with you and, if they were not polite, might claim that, if so,
there should be a round-trip relationship between "ö" and "o".
> I think that unique identifiers that aren't possible to be
> confused are a pretty good idea. I'm pretty sure that we
> can't do that with IDN. (Or even with legacy DNS if L3-G0 is
> an example.) Maybe if we had a canonical form that mapped
> confusing things to the same thing that'd be a start, but it'd
> be as bad as punycode when you round tripped some cases, and
> would be a layer that we don't have right now.
See my longer note. Also, in the interest of being clear about
what we are talking about, "as bad as punycode" is meaningless
-- the Punycode algorithm really doesn't care about this stuff
and the issues lie entirely with U-label to A-label and back
conversions in IDNA2008 (or, if you prefer an obsolete
specification, the ToASCII and ToUnicode operations in IDNA2003.
However, one more observation may be important to this
discussion. When we look at things from a security point of
view, we usually have a maxim that false negatives are ok as
long as there are no false positives. It is a useful principle,
but trying to explain to a user that her email was undeliverable
because someone used a keyboard or locale different from the
usual one or an account was locked because the form in which a
password was typed differed from yesterday's version could be,
well, challenging. Domain names, especially in web browsers are
easy in that regards because a mismatch typically triggers a
search process with reasonable odds of turning up the intended
name. But, for many types of identifiers and uses of them,
false negatives are _not_ ok.
More information about the Idna-update