Proposed new Firefox IDN display algorithm

Mon Feb 6 21:24:39 CET 2012

--On Friday, January 20, 2012 18:38 +0000 Gervase Markham
<gerv at mozilla.org> wrote:

> Thanks to all on this list who provided input; I have taken
> several of your suggestions into this proposal for a change to
> the way Firefox chooses how to display IDNs:
> 
> https://wiki.mozilla.org/IDN_Display_Algorithm
> 
> Comments, particularly on the "Possible Issues and Open
> Questions", would be very welcome.

Gerv,

Again, my apologies for the delay in my getting to this.  I've
read several, but not all, of the comments made by others in the
interim.  If some of this is duplicative, I apologize.

Some comments (all checked against the 4 February version of the
proposal):

First, two very general observations.  

* I think your background/ problem statement is misleading.
That may distort some of the rest of the document.  Your choice
is not "to display or not to display".  A-labels are a display
option, not non-display.  U-labels are another display form.  So
are "????" and little boxes.  That would be just a pedantic
distinction except for two things.  One is that you have a
family of other options: display in lurid colors, pop-up
warnings if someone tries to click on a link, or just outright
refusal to use the URL... perhaps more.   The other is that,
while you can safely assume that A-labels can be displayed, you
cannot guarantee correct display and rendering of U-labels
unless Firefox starts carrying around its own reference fonts
and rendering routines.  There are actually some things to
recommend the latter but keeping the footprint small is not one
of them.   

I refer to the set of cases that should be treated exceptionally
-- not displayed in U-label form without comment -- as needing
"special treatment" or "alerts", whether that treatment is
A-label display or something else.   In other words, at least in
the proposal, I'm trying to get you to separate "identification
of a label that deserves worrying about" (or "identification of
a label which is safe" with all others defaulting into "worry
about") from what you do about that label or the FQDN of which
it is part.

* I think you and I have a long-standing disagreement about the
tradeoff between giving the user greater control over her
environment and the costs of doing so (different experiences for
different users, more complex option sets, and so on).  As I've
understood it from other conversations, I prefer to recognize
that people come in different levels of skill and understand
and, to some extent, to want to be flexible for those who are
capable of learning and understanding.   The opposite position
--much more extreme than any I'd heard you take-- is that,
because there are some hypothetical "grandmothers" out there who
may be able to use the Internet as long as they can follow
scripts without even a modicum of understanding about what is
going on, all users should be treated as if they are those
grandmothers.  While I assume we would agree that either extreme
is wrong, I think I'm a little more willing to assume that even
those hypothetical "grandmothers" can learn and, indeed, that
many of them are more like to read and understand well-written
documentation than their teenage counterparts.

Against that backdrop, some specific comments, some about issues
a little broader than this specific proposal but, IMO, relevant
to it:

(1) The rule about ZWJ/ZWNJ (and maybe some other things) is
that labels containing them should not be displayed without
special treatment unless they are effectively visible on
display.  For the purpose of that classification, the CONTEXTJ
requirements of IDNA are a starting point: you should probably
do at least that and might want to go further.

(2) It seems to me that one of the problems with any strategy
that tries to decide between "safe" names and ones that need
special treatment on any basis other than experience with, or
the reputation of, the particular domain or site is doomed to
mistreat some sites and registrants who are perfectly ok in
order to protect against sloppy behavior or bad deeds of someone
else.  That is true with your old policy, your proposed new
policy, and the policies of the other browsers.  The differences
are just about who gets hurt and why.    One of the things we
now know that we didn't when you first deployed your policy is
that the policy seems to have about zero effect on registrant
choices about the TLDs in which to register.  If the policy
caused a significant change in registrant decisions, we can
assume that some obvious popular TLDs would be beating a path to
your door to sign up.    While it appears initially to be a
separate set of issues, I think the decision in recent versions
of Firefox to hide protocol identifiers in the displayed form of
the address bar may be a liability for the IDN case.  Suppose
you could actually be careful about either the certificate
authorities you recognized by default or, as you do with TLDs
under these policies, you did some classification of "trusted"
and "less trusted".    In that case, perhaps a site that was
accessed via HTTPS and that presented a certificate from a
trusted CA could be exempted from special treatment regardless
of the TLD in which their domain appeared.   In other words,
normal ("Unicode") display would be possible if the TLD appeared
on the whitelist, or you got a cert that you actually trusted,
or the string passed your heuristic.

(3) Personally, I'd add a user-specific FQDN (not TLD) whitelist
to that list of hard exceptions.  In order to avoid even more
databases/ tables, suppose you checked a string to be displayed
against the user's bookmark list _and_ created some extra
warning and explanation if the user decided to bookmark a string
that you considered suspicious.  If the user decided to bookmark
the site despite those warnings/ explanations, maybe you should
believe that he or she knows what they are doing and get out of
the way.  Again, for familiar domains, the likelihood of being
confused by a native character display is probably actually
lower than the odds of confusing one un-memorable A-label string
with another (see below).  Whether to also let users whitelist
TLDs whom you haven't whitelisted is another question.  A few
years ago, I would have (and did) argue for that option.  I'm
getting less sure about that over time.

(4) While I think the new classification data for "Common" and
"Inherited" scripts that Mark identifies will be useful,
especially to registries who are trying to make intelligent
decisions about what to accept, I want to caution against doing
anything at lookup time that is dependent on an inferred
language.  Partially because a very large fraction of DNS labels
are abbreviations, acronyms, codes, constructed terms, fanciful
names, or other sorts of mnemonics (even "com" and "org" aren't
really words, and few dictionaries will tell me what a "mozilla"
is) assuming that one can infer or rely on languages or the
rules of those languages, takes one into a rat hole of truly
awesome dimensions.

(5) One of the disadvantages of going to A-label display is
that, while A-labels provide a good clue that something unusual
is going on, users may vary widely as to whether that is taken
as a warning sign or just one of many incomprehensible things
that happens on the Internet.  If the user, grandmother or not,
is using the network by rote and with the assumption that a
great deal of it is just magic, then it is possible that any
A-label is confusable with any other A-label.  As long as
A-label display is rare and the user never has the experience of
having an intended and safe FQDN displayed in A-labels, that is
probably ok.  But, A-label display becomes common and some of
the labels thus displayed are actually associated with
reasonable domains and safe sites, the warning value of that
technique will deteriorate significantly unless users can
remember which A-labels have been visited before and are ok.  I
don't think we can count on that.

(6) I think your heuristic itself is about as good as you are
going to get.  I might be able to suggest small variations, but
I think they would mostly just shift the false negatives around
a bit rather than resulting in substantial improvement.   But I
have to wonder about your threat model.  If the goal is to say
"something needs to be done and we are doing as much as we
reasonably can, even if it is not likely to be very effective",
I have no problems with that.  But, if you want to move well
beyond that, especially in a world in which I have to expect
ICANN's methods to prevent confusing name pairs at the top level
will fail in at least some significant cases (explanation on
request), I have to wonder.  Ultimately, there are two sorts of
confusing conflicts, those that result from accidents and those
that are actually the consequences of intentional attacks.
These examples are deliberately silly but, if someone tries to
reach Honda's site and accidentally finds a teddy bear
manufacturer instead, the user will be momentarily discomfited
and confused about what happened, but no real harm will be done.
If they accidentally get a Toyota site instead and disclose some
information before they realize their mistake, they might get
some extra marketing literature from Toyota, but Toyota is
presumably not in the identity theft or some other evil business
so, again, little harm.  

On the other hand, suppose the second site was put up by an evil
enterprise with seriously bad intent.  I suggest, because of the
combination of "whole script" problems and the issues that
Andrew identified, your heuristics mean that such an enterprise
will need to be somewhat smarter than they would need to be
without those heuristics.  But there seems to be no shortage of
smart, evil, people and enterprises around the Internet so, once
they figure out how to get around the heuristics, it is kind of
open season and about the best that can be done is to keep
repeating one of the things the report suggests: that the real
ability to do something about deliberate confusion-based
phishing and similar problems lies with the registries and those
who can pressure them or constrain their behavior.  That isn't
just ICANN: it is a variety of governments, law-enforcement
agencies, and even individuals who, if hurt by delegation of
strings that could have no purpose other than confuse, could
presumably sue the registries and claim that their "we have no
responsibility, it is all on the registrants" contracts
shouldn't protect them if they know bad (evil, criminal,
illegal) behavior is going on and they are consciously abetting
it.   I suggest, as noted above, that we good learn the real
lesson from the "paypal" example that started most of us down
this path, which is that the real problem maybe isn't the name
but the ease by which a bogus organization with a fake identity
can obtain a certificate simply by "owning" a domain name and
mailbox for a short time.

    best wishes,
    john