Proposed new Firefox IDN display algorithm

Mon Feb 20 15:01:57 CET 2012

john, i assume meant "does NOT take being burned..."

Second, it does take being burned very many times (usually just
hearing horror stories is sufficient) before people decide
either that learning things is useful or that they want to avoid
the risky situations entirely.

On Sun, Feb 19, 2012 at 8:26 PM, John C Klensin <klensin at jck.com> wrote:
> Gerv,
>
> Consolidating responses and doing some trimming....
>
> --On Friday, February 17, 2012 15:52 +0000 Gervase Markham
> <gerv at mozilla.org> wrote:
>
>>...
>>> One is to provide a switch that permits a user to say "I think
>>> I'm smarter than you are and am willing to take responsibility
>>> for that belief and its consequences".
>>
>> At the moment, such a switch exists - sort of. You can just
>> add every TLD to the whitelist. We could have a global switch
>> if it were hidden away in about:config; the issue here is that
>> if such a switch exists, then people might switch it on. ;-)
>
> We might disagree a bit, but you have no way to make users sign
> in blood that they really, really, are taking responsibility
> (rather than, e.g., just blowing the problem off so they can
> blame you later.   So I guess that, if ICANN overwhelms your
> system with new TLDs no one can keep track of, it is rational
> for you to assume "no one" includes the hypothetical responsible
> user too and forcing that user to manually add everything she
> has thought about to the whitelist is a rational compromise.
>
>>> The second, and even more important, is that I believe the
>>> browser should provide a very accessible, very easy-to-use,
>>> transcoder for these labels.
>>
>> I can see why you want this, although I suspect that the
>> target audience would be so small that the Firefox team would
>> balk at adding a new feature for it. Every feature has initial
>> and ongoing costs, after all. I see your analogy to View
>> Source, but that's been a great tool in proving that the web
>> is readable, hackable and remixable - the usage of this
>> feature would be a lot more obscure and specialized.
>
> I think your "target audience" assumption above represents a
> fundamental different in assumptions about users and user
> interfaces.  Let me explain mine (not original -- most of what
> I'm about to say goes straight back to Doug Englebart although
> he clearly bears no responsibility for my interpretation).   If
> you and your associates still disagree, we can both move on.
>
> From my point of view, the typical user of your software is
> neither "new/ first time" because people rarely stay "first
> time" for very long  --they  do become familiar with their
> environments and the tools they use-- nor are most of them
> stupid and devoid of the ability to learn.  They may be lazy in
> the "don't want to learn anything I don't have to" sense, but
> not stupid in the "can't learn it even after I get motivated".
> There are limits of course, but very little of what we are
> talking about is, e.g., quantum physics with all it implies
> about mathematic background, understanding of stochastic
> processes, etc., as well as the physics themselves (much of
> rocket science doesn't have the prerequisites).
>
> In this particular case, we agree that the audience that will
> use (or even understand) that sort of tool on a proactive basis
> will be very small.  But there are two other audiences with whom
> you ought to be concerned.  First, I assume most of us have
> family, friends, or neighbors who are inclined to call up and
> ask questions if they encounter something odd that doesn't come
> with a really good explanation.  If there is a tool that I can
> get them to use to walk through the problem, then the actual
> user base for that tool goes up even if I'm using it on behalf
> of others and even more if I can teach them how to use it.
> Second, it does take being burned very many times (usually just
> hearing horror stories is sufficient) before people decide
> either that learning things is useful or that they want to avoid
> the risky situations entirely.
>
> For IDNs, the latter could either be "no IDNs" (and pressure on
> you or for a plugin for a tool that would prevent IDN use
> entirely) or a far narrower "if I'm not sure I understand the
> script, I want the URI blocked" restriction than even the
> Microsoft strategy implies.  We presumably don't want to drive
> folks in either of those directions; if we think IDNs are a good
> idea, we should be doing things that make people feel as safe as
> possible using them, even if the price of being safe is that
> they have to understand a bit more about what is going on and
> have tools they can use to easily investigate further if they
> see something suspicious-looking.
>
>>> As a trivial, ASCII-only, example, if I see "rn" on a small
>>> screen and in poor light, it would be a huge advantage to be
>>> able to be able to get the browser to show me code points that
>>> would tell me if I'm looking at U+006D or at U+0072 U+006E.
>>
>> If this is your problem, then we have a greater UI problem
>> than providing a way for people to see the hex codes behind
>> each character!
>
> Well, yes and no and, again, let me use FireFox as an example.
> Up until around version 6.x, when I hovered the pointer over a
> link or when something was downloading, I saw the full URL at a
> predictable place on the screen and saw it in fairly large
> letters.  Now, at least with my plug in family, it is sometimes
> in the lower left and sometimes in the lower right and always in
> about the smallest type you could have any reasonable
> expectation that anyone could read.  Now, regardless of what
> sort of visual confusion one is worried about, tiny characters
> inevitably make things less distinguishable and increase the
> risk.  So, yes, there are greater UI problems out there, but,
> even if you and your colleagues have good reason to preserve
> that particular bit of UI-induced risk, my being able to click
> on something and get a disambiguation display in
> reasonable-sized and easily distinguished characters would be an
> improvement.
>
>
>>> * I think your background/ problem statement is misleading.
>>> That may distort some of the rest of the document.  Your
>>> choice is not "to display or not to display".  A-labels are a
>>> display option, not non-display.  U-labels are another
>>> display form.  So are "????" and little boxes.  That would be
>>> just a pedantic distinction except for two things.  One is
>>> that you have a family of other options: display in lurid
>>> colors, pop-up warnings if someone tries to click on a link,
>>> or just outright refusal to use the URL... perhaps more.
>
>> This is true. Although I've outlined the advantages of A-label
>> display in error conditions in other messages in this thread.
>> However, if it turns out that this change allows us to display
>> the vast majority of used IDNs, there may be a case for some
>> more severe error condition when we hit one we think we
>> shouldn't. Is that what you are suggesting?
>
> Approximately.  Again, I'd like to see the user be able to
> override any warning/threatening choices you make, but only if
> you can also give her comprehensive information if she is smart
> enough to ask.  In the context of the remarks above and Vint's
> comments of some days ago, I'd much rather see:
>
>    Probably really bad domain
>       URI with A-labels:  xxxxxxxx
>       IRI with U-labels or placeholders: xxxxxx
>       Unicode code point list for IDN Labels: xxxxxxx
>    ------------           ----------------
>    | Forget It |          | Open it       |
>    | Button    |          | Anyway Button |
>    | (large)   |          | (tiny)        |
>    -------------          -----------------
>
> than some cryptic "Evil!" message with an "Ok" or "Dismiss" box
> whose status wrt the next step(s) is unclear.
>
>
>>...
>> We do actually carry around our own rendering routines:
>> http://en.wikipedia.org/wiki/HarfBuzz#HarfBuzz
>>
>> We don't carry around our own fonts. But what point are you
>> making here? If we find (somehow) that we are unable to
>> display a U-label correctly, we should do something else other
>> than what happens by default now, which I suspect is little
>> boxes with numbers in? (Although I haven't checked - can you
>> point me at a test domain name which uses highly obscure
>> Unicode characters?)
>
> I'll look or try to make one.  Others may be able to help here.
> If the numbers in the little boxes are 4-6 digit Unicode code
> points, I'd be happiest seeing those along with the A-labels.
> If they are, e.g., escaped UTF-8, I'd want to see, or easily get
> at, the Unicode code point list and the A-labels.
>
>>> A-label display or something else.   In other words, at least
>>> in the proposal, I'm trying to get you to separate
>>> "identification of a label that deserves worrying about" (or
>>> "identification of a label which is safe" with all others
>>> defaulting into "worry about") from what you do about that
>>> label or the FQDN of which it is part.
>>
>> OK. If currently we have:
>>
>>                     Can Display          Can't Display
>>
>> Worrisome Label    A-label              A-label
>>
>> Good Label         U-label              Little boxes
>>
>> How would you have us modify that matrix? (This assumes for
>> the sake of argument that it's possible to distinguish between
>> the bottom two boxes.)
>
> For the top two cells, I'd like to be able to access (with
> right-click or some other mechanism) the U-label and code point
> list.  For the bottom pair, I'd like to be able access (ditto)
> the A-label and code point list.   It would also suit me just
> fine to be able to get at the A-label, the U-label, and the code
> point list for all four cell, even though one of those fields
> will be redundant for each cell.   For the user who doesn't know
> the he should right-click the cells, or who knows and doesn't
> care, the display set above is just fine.
>
>>...
>>> (2) It seems to me that one of the problems with any strategy
>>> that tries to decide between "safe" names and ones that need
>>> special treatment on any basis other than experience with, or
>>> the reputation of, the particular domain or site is doomed to
>>...
>> To clarify: you are suggesting that sites with a cert from a
>> trusted CA should get U-labels regardless?
>>
>> CA certificates are about identity, not about "good-ness" or
>> honesty. I have, coincidentally, just been arguing in another,
>> CA-related, forum that they should by no means put
>> restrictions on which domain names people can get certificates
>> for, because deciding which domain names should and shouldn't
>> exist is the job of registries, and once they've done that, a
>> CA should go along with it. It should not be possible to
>> register a domain and yet not be able to get a cert for it
>> because the registry is fine with the name you pick but the CA
>> isn't. I can imagine domain owners being quite put out if that
>> could/did happen.
>
> Let me try a different point of view out on you.  Even accepting
> your "identity" definition, a CA can "identify" anything from
> something associated with the domain (remember that, independent
> of ability to obtain a domain, registries and registrars differ
> widely in how much information needs to be provided, verified,
> and/or exposed to do so.  At least in this context, I don't have
> a problem with someone obtaining a domain under and alias, with
> faked or hidden contact information, if the registry permits
> that... and getting a certificate to match.  But, in that case,
> the "identity" that is being certified is really very thin and
> uncertain.  And, fwiw, your users are probably much more at risk
> from a "almost no one knows who is operating this domain and
> those who know aren't telling" situation than that are from a
> suspect IDN (realistically, the two cases go together) and it
> would be at least as reasonable for you to warn about domains
> with anonymous/ proxy registrations (or registries that permit
> them) as it is to worry about sloppy registry IDN policies.
>
> At the other extreme, a CA can choose to issue an "identity"
> certificate only if it can actually verify the identity of the
> registrant (not just, e.g., reachability). Note that, in
> permitting descriptions of cerificate applicability, X.509
> recognizes the difference as do a number of CAs who issue
> certificates with different levels of quality and assurance.
>
> I am suggesting only that folks who have certificates that you
> can recognize as providing a high level of actual identity
> assurance and authentication may be entitled to a little more
> positive treatment than domains whose certificates indicate only
> that the entity involved was able to obtain a domain.  I'm not
> suggesting treating the latter badly, only that, as your policy
> moves down a path in which someone can get normal, U-label
> display, for their IDNs by any of several mechanisms, it may be
> worth considering high-quality / high-assurance identity
> certificates as another one of those mechanisms.
>
>
>>> (3) Personally, I'd add a user-specific FQDN (not TLD)
>>> whitelist to that list of hard exceptions.
>>
>> Users can technically add TLDs to the list. (I don't think
>> they can add FQDNs.)
>
> The thing I'm circling around is that we already see
> www.KnownGoodGuy.SuspectDomain and see if frequently.  We may
> see it even more with ICANN's new TLD program.  There ought to
> be ways for those <known good guys> who find themselves in TLDs
> who haven't made you happy, to get normal display.  The
> certificate idea above is one way, user-supplied FQDN
> whitelisting (by bookmarks or otherwise)  might be another.  To
> make the problem here more clear, assume that "mozilla" required
> a non-ASCII character to write.  www.mozilla.org would still
> work because you have whitelisted ORG and PIR.  But, if some
> user types in "www.mozilla.com", you'd really like them to see
> native characters even though COM and VGRS are not whitelisted.
>
>>> In order to avoid even more
>>> databases/ tables, suppose you checked a string to be
>>> displayed against the user's bookmark list _and_ created some
>>> extra warning and explanation if the user decided to bookmark
>>> a string that you considered suspicious.  If the user decided
>>> to bookmark the site despite those warnings/ explanations,
>>> maybe you should believe that he or she knows what they are
>>> doing and get out of the way.
>>
>> I'm not sure enough users use bookmarks in the "traditional"
>> way like that (or at all) for it to be a good idea to bake
>> them into the security strategy.
>
> Don't know.  FWIW, I feel the same way about using email address
> lists as a part of a security strategy -- but the approach seems
> to be hugely popular.
>
>>> (4) While I think the new classification data for "Common" and
>>> "Inherited" scripts that Mark identifies will be useful,
>>> especially to registries who are trying to make intelligent
>>> decisions about what to accept, I want to caution against
>>> doing anything at lookup time that is dependent on an inferred
>>> language.
>>
>> I tried to write the document in terms of "script", not
>> "language" - I understand the perils of trying to infer the
>> latter. Are you suggesting I haven't succeeded? If so, point
>> me at the broken bits.
>
> I think your text is fine.  The problem is that one can't get
> very far with labels in some scripts without including some
> "Common" or "Inherited" characters.  But, for a given label,
> which characters from those groups should be treated as part of
> the same script as the other characters in the label is going to
> depend at least on script (e.g., for Latin Script, some Common
> or Inherited characters are going to be appropriate and should
> not result in the label being treated as mixed-script, but
> others are not).  However, if you wanted a high-quality test,
> the answer to "which Common and/or Inherited characters should
> be permitted with the other characters in this label (without
> getting the label flagged as mixed-script)" requires language
> knowledge.  I don't think you should go there but, if you don't,
> everyone needs to understand that the test being made is fairly
> weak.
>
>>> (5) One of the disadvantages of going to A-label display is
>>> that, while A-labels provide a good clue that something
>>> unusual is going on, users may vary widely as to whether that
>>> is taken as a warning sign or just one of many
>>> incomprehensible things that happens on the Internet.  If the
>>> user, grandmother or not, is using the network by rote and
>>> with the assumption that a great deal of it is just magic,
>>> then it is possible that any A-label is confusable with any
>>> other A-label.  As long as A-label display is rare and the
>>> user never has the experience of having an intended and safe
>>> FQDN displayed in A-labels, that is probably ok.  But,
>>> A-label display becomes common and some of the labels thus
>>> displayed are actually associated with reasonable domains and
>>> safe sites, the warning value of that technique will
>>> deteriorate significantly unless users can remember which
>>> A-labels have been visited before and are ok.  I don't think
>>> we can count on that.
>>
>> I agree that seeing A-labels should be a rare thing, and
>> perhaps one of the downsides of the current implementation is
>> that users might see them more often than one would like. I
>> hope the new proposal will reduce the incidence of this. Given
>> that, I can't quite see your point?
>
> Depending on where people browse and whether serious
> confusable-IDN-based attacks actually become popular, I don't
> think your new proposal is going to reduce the frequency with
> which people see A-labels enough to make a big difference.
>
>
>>...
>
> best,
>   john
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update