IDNA and U+08A1 and related cases (was: Re: Barry Leiba's Discuss on draft-ietf-json-i-json-05: (with DISCUSS and COMMENT))

Tue Jan 27 22:16:48 CET 2015

Hi,

The cc: list here has gotten rather long, so I've set a Reply-To to
the idna-update list.  I hope it comes through.

On Tue, Jan 27, 2015 at 12:22:59PM +0100, Mark Davis ?️ wrote:

> this strict a definition is productive. Suppose that character X and
> character Y normally look the same in the same fonts, but in 10% of the
> fonts they look similar—similar enough to be confusing—but not identical.
> Does that mean they are not "homographs"? Does it mean what these—what you
> might call "near homographs"—do not represent essentially the same problem
> for confusability as your "strict homographs", that are "(practically
> always)" identical in all fonts?
> 
> That is, I don't think a distinction between "accident" and "intent" is
> useful when it comes to confusability. Based on spoofing and spamming data
> I've seen at Google, the characters don't even have to be identical 90% of
> the time. At body text sizes, the human eye sees what it expects: many
> dissimilarities are glossed over, like between r + n and m, or even much
> more different ones.

I think this observation (at least for IDNA, but see more below) may
be a useful guide for what to do, but there is still an important
difference in what we're talking about.

In any case where there's "more than one way" to enter "the same"
character, and that more than one way is language-sensitive, and there
is not a canonical equivalence, then that is importantly different
from "awfully similar almost all the time".  For example, in the case
that got us started on this, the _beh_ and _hamza_ above, you really
do have to know whether Fula is implicated to know which of the forms
is the one that is important.

For most cases, maybe this doesn't matter, because you have enough
linguistic evidence to make a guess.  But for context-free cases,
particularly in the global environment, it certainly does matter.
It's not the _only_ thing that matters, and I'm not trying to suggest
that the cases Mark is drawing out are unimportant.  But it does
appear that, for identifiers, we've run into a feature that is
somewhat surprising to some of us, and that has some rather nasty side
effects.

I should point out, as well, that this is not only domain names at
issue, or things in the root zone or even very near the root zone.
This is a general problem for identifiers, one important case of which
happens to be domain names.  Email addresses have this problem too.
User names. And so on.  _Worse_, in some protocols, there is no
registration authority, so the tricks that are sort of working for the
root zone and maybe for one layer down won't automatically work
elsewhere.

IMO, it's that much wider context that has led the IAB to put out the
statement it just put out
(https://www.iab.org/documents/correspondence-reports-documents/2015-2/iab-statement-on-identifiers-and-unicode-7-0-0/),
because this has rather wide implications for IETF identifiers
(including the stuff that's been worked on in PRECIS).  It's plain
that most of this information was perfectly obvious to UTC members and
people in a very close relationship to the Standard.  I will tell you,
however, that I've read the entire standard at least once and several
sections quite closely several times, and it still took me quite a bit
of time to fully digest what we just understood.  I am by no means an
expert like many of you, but I'm not a total newbie.  

> People who think that this problem is simple, and can be completely handled
> at the protocol level, are simply just not familiar enough with the problem
> space.

I, at least, do not think the problem is simple.  The question is not,
"Can this be completely handled?" but instead, "What can we do, given
that we need identifiers."  I am not relishing the prospect of
evaluating every code point, partly because I don't think we can do it
reliably anyway.

> I wish that the people who get all fired up about
> 
> U+08A1
>  would talk to security experts to find out what sorts of characters—in
> practice—*do* represent confusability issues:

That's a false alternative, and I don't think it's in any way
reasonable.  We don't argue, "Car accidents cause lots of death, so
death from influenza isn't important."

-- 
Andrew Sullivan
ajs at anvilwalrusden.com