ajs at anvilwalrusden.com
Wed Aug 13 21:23:26 CEST 2014
On Wed, Aug 13, 2014 at 12:00:54PM -0700, Markus Scherer wrote:
> > ECOLE.CA and ecole.ca will match, but école.ca <http://xn--cole-9oa.ca>
> > does not match ECOLE.CA
> > (which is unexpected for French-from-France readers) and ÉCOLE.CA
> > <http://xn--cole-9oa.CA>
> > doesn't work at all.
> It does with UTS #46.
UTS #46 recommends case-mapping before feeding to the IDNA subsystem.
Of course, IDNA2008 _also_ recommends that, so you don't need UTS #46
to get this behaviour. It's just not part of the protocol as such.
(See RFC 5895.)
> So what? Case folding is well-defined by Unicode and implementations are
> easily available. It's also built into the UTS #46 mapping.
Sure. "So what?" indeed, since it's also built into RFC 5895. What
RFC 5895 does not solve is the ss/ß mapping, for exactly the reason
that operators told us after IDNA2003 that the mapping approach didn't
work for them.
> >From the user's perspective, and from how browsers behave, there is no
Unfortunately for protocol designers, the Internet is not "what
browsers do". Just as a for instance, IDNA needs to work predictably
and usably with EAI, which has a completely different way of handling
internationalization than IDNA (because it had to deal with the
local-part, and that wasn't really a suitable case for an ACE). The
assumption that the only thing that counts is what browsers expect is
not a safe one.
As for the user perspective, I'm afraid I disagree. Certainly there
are lots of users who expect domain names to be case-insensitive. But
every domain name administrator is of course also a user, and they
expect "case-preserving but case-insensitive for matching". One can
make a pretty good argument that it was a serious flaw in STD 13 that
it made ASCII special and offered the case-insensitive+case-preserving
approach. But that doesn't matter, because this is the system we have
actually deployed everywhere in the world, so we need to make
IDNA2008 made some trade-offs. UTS#46 makes different ones. The sad
fact is that, as a result of IDNA2003, IDNA2008, and UTS#46, there are
actually five ways to deal with IDNs:
1. Just put UTF-8 in the labels. This will work for IDNA-unaware
applicatins, but will probably break for everything else.
Increasingly rare in "public" zones.
2. Use IDNA2003. This is broken unless your machine somehow
magically has Unicode 3.2 on it.
3. Use IDNA2003, but ignore the Unicode-version restriction and use
whatever library you have. This mostly appears to work, but it has a
bunch of undefined behaviour. I've yet to see a systematic analysis
of this approach, though it is often trotted out as proof that
IDNA2008 is unnecessary (usually with the standard "this works fine
for me" level of testing).
4. Use IDNA2008. This works, but breaks backward compatibility with
some names that work under IDNA2003. Note that the case folding is
_not_ broken if you implement RFC 5895 (and everyone should).
5. Use UTS #46. This works, but it has a lot of options which may or
may not break compatibility with IDNA2003 depending on what you want
to do. In addition, depending on what you do with it, you _might_
have a tricky transition in future.
And of course, all of this is just domain names. The reason I
participated in this discussion was not because of domain names, but
because of them and everything else we are planning to
internationalise with similar techniques. I'm now wondering whether
that's such a hot idea.
ajs at anvilwalrusden.com
More information about the Idna-update