IDNA 2008 security

Mon Dec 22 17:03:15 CET 2008

--On Monday, 22 December, 2008 14:36 +0100 Patrik Fältström
<patrik at frobbit.se> wrote:

> On 2 dec 2008, at 01.21, Dick Sites wrote:
>...
>> There is nothing in this draft that addresses known real-world
>> phishing exploits and disallows them. That seems like a truly
>> unfortunate oversight. Specifically, "paypal" spelled with
>> one or more Cyrillic lookalike-a characters is allowed. Yet
>> all the mechanism is in place to require U+0430, etc. only to
>> be used in a Cyrillic script label.
> 
> See the other documents.

While I'm really sympathetic about this, one of the other issues
with trying to make global rules about mixed-script labels (or
even, more generally, trying to make rules to prevent phishing
when there are not other reasons for the same rules).  For
example:

(i) What is, and is not, look-alike, is a very subjective
business.    Especially given centuries of use of artistic and
fanciful fonts in many writing systems, the human eye will often
see what it expects to see, even when two characters don't look
very much alike when set side by side.  As a handy example
(although IDNA2008 disallows both for other reasons), consider
"Λ" (Greek Upper Case Lambda, U+039B) and "A" (Latin Upper Case
A, U+0041).  Clearly, they are different when set side by side.
But, to a reader who is not familiar with Greek and who is
expecting Latin, maybe the first is just a clever font.  And,
more important, to someone who is not familiar with either
script, it may not be clear whether the horizontal bar is
important or just decoration.

(ii) While Unicode is clear about the script into which each
letter falls, there is a question about "script in practice"
that may depend on local culture or conventions, rather than on
historical definitions of scripts or origins of characters.  For
example, many people have insisted that Romanji is as much a
part of the modern Japanese writing system as Kana and that
banning mixing them in a label would just make no sense.  We
have also been told that, for computer use, Cyrillic and Latin
mixtures are common and conventional practice, are needed for
identifiers, and need to be treated as a single script. I don't
particularly like it, but I think we are in dangerous territory
if we start banning character combinations, combinations that
registries claim they need, based on perceptions of things
looking alike.  (That local perception about the group of
characters that make up a "script in practice" is the reason for
the comment about perceived scripts that Mark objected to in a
recent note.)  On the other hand, one could make a case that,
since many characters (percentage-wise) in Greek, Latin, and
Cyrillic scripts are visually similar and represent very similar
phonemes, the three scripts should have been "unified" in
Unicode in that same way that CJK was.   Had that been done
based on character glyph shapes alone, the particular
Cyrillic-Latin phishing issue you use as an example would not
exist (although there would be some other problems).

>> Even better would be an inclusion-based  approach that only
>> allows change of script at a hyphen. Legitimate domain owners
>> could then prevent an entire class of phishing by not using
>> hyphen in their actual labels, while domain owners who want
>> foo-бар or Фу-bar   can do
>> so. Hyphen would be enough of a clue for some users in that
>> something unusual might be going on, and would allow only
>>  p-а-yp-а-l
>> for use of two Cyrillic letters intermixed with Latin
>> letters. This enforced simple rule could perhaps replace
>> several of the current more-specialized context rules.

This is an interesting idea and a variant on some that have been
suggested in the past.  There are a couple of problems with it.
One is that, as one moves away from European scripts, the odds
of having a hyphen on a keyboard go down, even though some
mixed-script combinations may still be convenient.   Hyphen
itself raises some issues when used with right-to-left scripts.
Of course, if one merely needs to visually inspect a link and
then click on it, that may not be a problem (or help, depending
on how you look at it).  One could solve that problem by
defining a collection of other punctuation characters that would
be considered as hyphens, but that creates other problems.  

>> The oversight suggests that this draft is just a collection of
>> rules and not a serious effort to improve security on the web.

The bottom line is that we've concluded that character
combinations that are specifically phishing issues should be
dealt with by registries, who presumably know what they are
doing with scripts they choose to support, and by application
implementers who can warn people against hazardous combinations
(and potentially against registries who persistently permit
registration of strings that have no real value other than to
create phishing opportunities.  The decision to eliminate
mappings from the protocol very significantly reduces phishing
opportunities and permitting some characters only with
contextual rules eliminates many others (e.g., permitting ZWNJ,
but only in specific contexts, is actually safer than permitting
it and then discarding it).   But both of those decisions are
supported by reasoning other than visual confusion alone. 

These decisions were the result of explicit (and quite lengthy)
discussion, not an "oversight".

     john