Proposed new Firefox IDN display algorithm

Mon Feb 20 02:26:13 CET 2012

Gerv,

Consolidating responses and doing some trimming....

--On Friday, February 17, 2012 15:52 +0000 Gervase Markham
<gerv at mozilla.org> wrote:

>...
>> One is to provide a switch that permits a user to say "I think
>> I'm smarter than you are and am willing to take responsibility
>> for that belief and its consequences".
> 
> At the moment, such a switch exists - sort of. You can just
> add every TLD to the whitelist. We could have a global switch
> if it were hidden away in about:config; the issue here is that
> if such a switch exists, then people might switch it on. ;-)

We might disagree a bit, but you have no way to make users sign
in blood that they really, really, are taking responsibility
(rather than, e.g., just blowing the problem off so they can
blame you later.   So I guess that, if ICANN overwhelms your
system with new TLDs no one can keep track of, it is rational
for you to assume "no one" includes the hypothetical responsible
user too and forcing that user to manually add everything she
has thought about to the whitelist is a rational compromise.

>> The second, and even more important, is that I believe the
>> browser should provide a very accessible, very easy-to-use,
>> transcoder for these labels.
> 
> I can see why you want this, although I suspect that the
> target audience would be so small that the Firefox team would
> balk at adding a new feature for it. Every feature has initial
> and ongoing costs, after all. I see your analogy to View
> Source, but that's been a great tool in proving that the web
> is readable, hackable and remixable - the usage of this
> feature would be a lot more obscure and specialized.

I think your "target audience" assumption above represents a
fundamental different in assumptions about users and user
interfaces.  Let me explain mine (not original -- most of what
I'm about to say goes straight back to Doug Englebart although
he clearly bears no responsibility for my interpretation).   If
you and your associates still disagree, we can both move on.

>From my point of view, the typical user of your software is
neither "new/ first time" because people rarely stay "first
time" for very long  --they  do become familiar with their
environments and the tools they use-- nor are most of them
stupid and devoid of the ability to learn.  They may be lazy in
the "don't want to learn anything I don't have to" sense, but
not stupid in the "can't learn it even after I get motivated".
There are limits of course, but very little of what we are
talking about is, e.g., quantum physics with all it implies
about mathematic background, understanding of stochastic
processes, etc., as well as the physics themselves (much of
rocket science doesn't have the prerequisites).

In this particular case, we agree that the audience that will
use (or even understand) that sort of tool on a proactive basis
will be very small.  But there are two other audiences with whom
you ought to be concerned.  First, I assume most of us have
family, friends, or neighbors who are inclined to call up and
ask questions if they encounter something odd that doesn't come
with a really good explanation.  If there is a tool that I can
get them to use to walk through the problem, then the actual
user base for that tool goes up even if I'm using it on behalf
of others and even more if I can teach them how to use it.
Second, it does take being burned very many times (usually just
hearing horror stories is sufficient) before people decide
either that learning things is useful or that they want to avoid
the risky situations entirely.  

For IDNs, the latter could either be "no IDNs" (and pressure on
you or for a plugin for a tool that would prevent IDN use
entirely) or a far narrower "if I'm not sure I understand the
script, I want the URI blocked" restriction than even the
Microsoft strategy implies.  We presumably don't want to drive
folks in either of those directions; if we think IDNs are a good
idea, we should be doing things that make people feel as safe as
possible using them, even if the price of being safe is that
they have to understand a bit more about what is going on and
have tools they can use to easily investigate further if they
see something suspicious-looking.

>> As a trivial, ASCII-only, example, if I see "rn" on a small
>> screen and in poor light, it would be a huge advantage to be
>> able to be able to get the browser to show me code points that
>> would tell me if I'm looking at U+006D or at U+0072 U+006E.
> 
> If this is your problem, then we have a greater UI problem
> than providing a way for people to see the hex codes behind
> each character!

Well, yes and no and, again, let me use FireFox as an example.
Up until around version 6.x, when I hovered the pointer over a
link or when something was downloading, I saw the full URL at a
predictable place on the screen and saw it in fairly large
letters.  Now, at least with my plug in family, it is sometimes
in the lower left and sometimes in the lower right and always in
about the smallest type you could have any reasonable
expectation that anyone could read.  Now, regardless of what
sort of visual confusion one is worried about, tiny characters
inevitably make things less distinguishable and increase the
risk.  So, yes, there are greater UI problems out there, but,
even if you and your colleagues have good reason to preserve
that particular bit of UI-induced risk, my being able to click
on something and get a disambiguation display in
reasonable-sized and easily distinguished characters would be an
improvement.

>> * I think your background/ problem statement is misleading.
>> That may distort some of the rest of the document.  Your
>> choice is not "to display or not to display".  A-labels are a
>> display option, not non-display.  U-labels are another
>> display form.  So are "????" and little boxes.  That would be
>> just a pedantic distinction except for two things.  One is
>> that you have a family of other options: display in lurid
>> colors, pop-up warnings if someone tries to click on a link,
>> or just outright refusal to use the URL... perhaps more.

> This is true. Although I've outlined the advantages of A-label
> display in error conditions in other messages in this thread.
> However, if it turns out that this change allows us to display
> the vast majority of used IDNs, there may be a case for some
> more severe error condition when we hit one we think we
> shouldn't. Is that what you are suggesting?

Approximately.  Again, I'd like to see the user be able to
override any warning/threatening choices you make, but only if
you can also give her comprehensive information if she is smart
enough to ask.  In the context of the remarks above and Vint's
comments of some days ago, I'd much rather see:

    Probably really bad domain
       URI with A-labels:  xxxxxxxx
       IRI with U-labels or placeholders: xxxxxx
       Unicode code point list for IDN Labels: xxxxxxx
    ------------           ----------------
    | Forget It |          | Open it       |
    | Button    |          | Anyway Button |
    | (large)   |          | (tiny)        |
    -------------          -----------------

than some cryptic "Evil!" message with an "Ok" or "Dismiss" box
whose status wrt the next step(s) is unclear.

>...
> We do actually carry around our own rendering routines:
> http://en.wikipedia.org/wiki/HarfBuzz#HarfBuzz
> 
> We don't carry around our own fonts. But what point are you
> making here? If we find (somehow) that we are unable to
> display a U-label correctly, we should do something else other
> than what happens by default now, which I suspect is little
> boxes with numbers in? (Although I haven't checked - can you
> point me at a test domain name which uses highly obscure
> Unicode characters?)

I'll look or try to make one.  Others may be able to help here.
If the numbers in the little boxes are 4-6 digit Unicode code
points, I'd be happiest seeing those along with the A-labels.
If they are, e.g., escaped UTF-8, I'd want to see, or easily get
at, the Unicode code point list and the A-labels.

>> A-label display or something else.   In other words, at least
>> in the proposal, I'm trying to get you to separate
>> "identification of a label that deserves worrying about" (or
>> "identification of a label which is safe" with all others
>> defaulting into "worry about") from what you do about that
>> label or the FQDN of which it is part.
> 
> OK. If currently we have:
> 
>                     Can Display          Can't Display
> 
> Worrisome Label    A-label              A-label
> 
> Good Label         U-label              Little boxes
> 
> How would you have us modify that matrix? (This assumes for
> the sake of argument that it's possible to distinguish between
> the bottom two boxes.)

For the top two cells, I'd like to be able to access (with
right-click or some other mechanism) the U-label and code point
list.  For the bottom pair, I'd like to be able access (ditto)
the A-label and code point list.   It would also suit me just
fine to be able to get at the A-label, the U-label, and the code
point list for all four cell, even though one of those fields
will be redundant for each cell.   For the user who doesn't know
the he should right-click the cells, or who knows and doesn't
care, the display set above is just fine.

>...
>> (2) It seems to me that one of the problems with any strategy
>> that tries to decide between "safe" names and ones that need
>> special treatment on any basis other than experience with, or
>> the reputation of, the particular domain or site is doomed to
>...
> To clarify: you are suggesting that sites with a cert from a
> trusted CA should get U-labels regardless?
> 
> CA certificates are about identity, not about "good-ness" or
> honesty. I have, coincidentally, just been arguing in another,
> CA-related, forum that they should by no means put
> restrictions on which domain names people can get certificates
> for, because deciding which domain names should and shouldn't
> exist is the job of registries, and once they've done that, a
> CA should go along with it. It should not be possible to
> register a domain and yet not be able to get a cert for it
> because the registry is fine with the name you pick but the CA
> isn't. I can imagine domain owners being quite put out if that
> could/did happen.

Let me try a different point of view out on you.  Even accepting
your "identity" definition, a CA can "identify" anything from
something associated with the domain (remember that, independent
of ability to obtain a domain, registries and registrars differ
widely in how much information needs to be provided, verified,
and/or exposed to do so.  At least in this context, I don't have
a problem with someone obtaining a domain under and alias, with
faked or hidden contact information, if the registry permits
that... and getting a certificate to match.  But, in that case,
the "identity" that is being certified is really very thin and
uncertain.  And, fwiw, your users are probably much more at risk
from a "almost no one knows who is operating this domain and
those who know aren't telling" situation than that are from a
suspect IDN (realistically, the two cases go together) and it
would be at least as reasonable for you to warn about domains
with anonymous/ proxy registrations (or registries that permit
them) as it is to worry about sloppy registry IDN policies.

At the other extreme, a CA can choose to issue an "identity"
certificate only if it can actually verify the identity of the
registrant (not just, e.g., reachability). Note that, in
permitting descriptions of cerificate applicability, X.509
recognizes the difference as do a number of CAs who issue
certificates with different levels of quality and assurance.

I am suggesting only that folks who have certificates that you
can recognize as providing a high level of actual identity
assurance and authentication may be entitled to a little more
positive treatment than domains whose certificates indicate only
that the entity involved was able to obtain a domain.  I'm not
suggesting treating the latter badly, only that, as your policy
moves down a path in which someone can get normal, U-label
display, for their IDNs by any of several mechanisms, it may be
worth considering high-quality / high-assurance identity
certificates as another one of those mechanisms.

>> (3) Personally, I'd add a user-specific FQDN (not TLD)
>> whitelist to that list of hard exceptions.
> 
> Users can technically add TLDs to the list. (I don't think
> they can add FQDNs.)

The thing I'm circling around is that we already see
www.KnownGoodGuy.SuspectDomain and see if frequently.  We may
see it even more with ICANN's new TLD program.  There ought to
be ways for those <known good guys> who find themselves in TLDs
who haven't made you happy, to get normal display.  The
certificate idea above is one way, user-supplied FQDN
whitelisting (by bookmarks or otherwise)  might be another.  To
make the problem here more clear, assume that "mozilla" required
a non-ASCII character to write.  www.mozilla.org would still
work because you have whitelisted ORG and PIR.  But, if some
user types in "www.mozilla.com", you'd really like them to see
native characters even though COM and VGRS are not whitelisted. 

>> In order to avoid even more
>> databases/ tables, suppose you checked a string to be
>> displayed against the user's bookmark list _and_ created some
>> extra warning and explanation if the user decided to bookmark
>> a string that you considered suspicious.  If the user decided
>> to bookmark the site despite those warnings/ explanations,
>> maybe you should believe that he or she knows what they are
>> doing and get out of the way.
> 
> I'm not sure enough users use bookmarks in the "traditional"
> way like that (or at all) for it to be a good idea to bake
> them into the security strategy.

Don't know.  FWIW, I feel the same way about using email address
lists as a part of a security strategy -- but the approach seems
to be hugely popular.

>> (4) While I think the new classification data for "Common" and
>> "Inherited" scripts that Mark identifies will be useful,
>> especially to registries who are trying to make intelligent
>> decisions about what to accept, I want to caution against
>> doing anything at lookup time that is dependent on an inferred
>> language.
> 
> I tried to write the document in terms of "script", not
> "language" - I understand the perils of trying to infer the
> latter. Are you suggesting I haven't succeeded? If so, point
> me at the broken bits.

I think your text is fine.  The problem is that one can't get
very far with labels in some scripts without including some
"Common" or "Inherited" characters.  But, for a given label,
which characters from those groups should be treated as part of
the same script as the other characters in the label is going to
depend at least on script (e.g., for Latin Script, some Common
or Inherited characters are going to be appropriate and should
not result in the label being treated as mixed-script, but
others are not).  However, if you wanted a high-quality test,
the answer to "which Common and/or Inherited characters should
be permitted with the other characters in this label (without
getting the label flagged as mixed-script)" requires language
knowledge.  I don't think you should go there but, if you don't,
everyone needs to understand that the test being made is fairly
weak.

>> (5) One of the disadvantages of going to A-label display is
>> that, while A-labels provide a good clue that something
>> unusual is going on, users may vary widely as to whether that
>> is taken as a warning sign or just one of many
>> incomprehensible things that happens on the Internet.  If the
>> user, grandmother or not, is using the network by rote and
>> with the assumption that a great deal of it is just magic,
>> then it is possible that any A-label is confusable with any
>> other A-label.  As long as A-label display is rare and the
>> user never has the experience of having an intended and safe
>> FQDN displayed in A-labels, that is probably ok.  But,
>> A-label display becomes common and some of the labels thus
>> displayed are actually associated with reasonable domains and
>> safe sites, the warning value of that technique will
>> deteriorate significantly unless users can remember which
>> A-labels have been visited before and are ok.  I don't think
>> we can count on that.
> 
> I agree that seeing A-labels should be a rare thing, and
> perhaps one of the downsides of the current implementation is
> that users might see them more often than one would like. I
> hope the new proposal will reduce the incidence of this. Given
> that, I can't quite see your point?

Depending on where people browse and whether serious
confusable-IDN-based attacks actually become popular, I don't
think your new proposal is going to reduce the frequency with
which people see A-labels enough to make a big difference.

>...

best,
   john