Proposed new Firefox IDN display algorithm

Thu Feb 2 23:12:42 CET 2012

Hi,

First, apologies for taking so long to get back to this, and thanks
for the invitation. 

On Fri, Jan 20, 2012 at 06:38:30PM +0000, Gervase Markham wrote:
> 
> https://wiki.mozilla.org/IDN_Display_Algorithm

I read this on 2012-02-02.  I have some comments.

Under "Other Browsers", you have this: "[…]this does not give site
owners any confidence that their IDN domain name will be correctly
displayed for all their visitors (and no way of telling if it's not)."
I hope it's clear that, in fact, no matter what you do you have no
hope of fixing this problem.  Without enforcing a single font (which
is completely Unicode-comprehensive) and UTF-8-using locales on
everyone (at the very minimum), there will be cases where IDNs don't
appear correctly.  In any case, there's no way for a site operator to
know it because you don't get display faults back through the
protocols.  (Also, just a nit: "IDN" stands for Internationalized
Domain Name, so "IDN domain name" is redundant.)

Under "Proposal", you have this: "The hope is that any intra-script
near-homographs will be recognisable to people who understand that
script."  The problem with this is that with very few exceptions,
_nobody_ understands a script.  English and French, both of which I
speak, are nominally written in the same script, for instance, but
they use different parts of it; and they're about as close to one
another as you can get.  The character U+00D8, LATIN CAPITAL LETTER O
WITH STROKE (Ø) is certainly part of Latin, but unless I see it in the
right context, I'll read it as DIGIT ZERO.

Under "Algorithm", you have this: "If a TLD is in the whitelist, we
will unconditionally display Unicode."  Why do you believe that the
TLD policies help?  None of the gTLDs, as far as I am aware, has a
policy that old-fashioned LDH names can't have U-labels beneath them.
Might it be enough for an attacker to put
arabic-label.arabic-label.arabic-label.badguy.com, and expect the
ASCII to get ignored?  (Maybe this is supposed to be solved by the
greying out of everything not near the top of the tree?)

Also in that section is discussion of using the data from Unicode
6.1.  While I think this could be a good idea and I think it's worth
considering carefully, I'm slightly worried about two things.  First,
this is a new feature of Unicode, and it's hard to predict how well it
will work in practice.  Second, are you planning just to code this
into the browser, or are you planning on using the local Unicode
facilities on the machine?  The latter seems preferable to me, but it
means that you don't get this facility until Unicode 6.1 is on the
machine (and of course, you can never get it for IDNA2003, since
that's pinned to a pre-6.1 version of Unicode).

Finally, under that section, you have a plan to "display Punycode" in
some cases.  As others already suggested in this thread, that seems as
bad as anything else: A-labels are confusing to _everybody_.  (You
have this as an open question, sort of, but assume you're going to
display A-labels no matter what.  I think that's a mistake.)

In the open questions, you ask, "Should we document our character
hard-blacklist as part of this exercise? It's already visible in the
prefs. Are any characters in it legal in IDNA2008 anyway?"  I'd say
document them only if there are any that aren't IDNA2008 legal.  But I
seem to recall you'll have problems here anyway -- ZWNJ and ZWJ are
sometimes legal under IDNA2008, and for some reason I believe you
block those (I didn't check, though.)

I think your discussion of "Downsides" is quite correct.

Thanks for posting this, and for the invitation to comment.  I hope
these comments are useful.

Best regards,

A

-- 
Andrew Sullivan
ajs at anvilwalrusden.com