IDNA and U+08A1 and related cases (was: Re: Barry Leiba's Discuss on draft-ietf-json-i-json-05: (with DISCUSS and COMMENT))

Mark Davis ☕️ mark at
Tue Jan 27 12:22:59 CET 2015

On Mon, Jan 26, 2015 at 7:30 AM, Asmus Freytag <asmusf at> wrote:

> Occasionally, because of legacy, occasionally for other reasons, Unicode
> has encoded identical shapes using multiple code points (homographs). A
> homograph pair can be understood as something that if both partners were
> rendered in the same font, they would (practically always) look identical. *Not
> similar, identical.*

​> Not similar, identical.​

​I'm in general agreement with Asmus's points, except that I don't think
this strict a definition is productive. Suppose that character X and
character Y normally look the same in the same fonts, but in 10% of the
fonts they look similar—similar enough to be confusing—but not identical.
Does that mean they are not "homographs"? Does it mean what these—what you
might call "near homographs"—do not represent essentially the same problem
for confusability as your "strict homographs", that are "(practically
always)" identical in all fonts?

That is, I don't think a distinction between "accident" and "intent" is
useful when it comes to confusability. Based on spoofing and spamming data
I've seen at Google, the characters don't even have to be identical 90% of
the time. At body text sizes, the human eye sees what it expects: many
dissimilarities are glossed over, like between r + n and m, or even much
more different ones.

(And as to Pete's question #2, there are a number of similar cases to
U+08A1 in Arabic, because the encoding model was somewhat different than
for most scripts, for historic reasons. But fundamentally the naming
similarities are not really relevant to users: for them the visible
appearance is the key, not whether a formal Unicode name that they will
never see has "WITH" in it.)

But the strictness of Asmus's definition is a small point, compared to the
main point of his message.

People who think that this problem is simple, and can be completely handled
at the protocol level, are simply just not familiar enough with the problem
space. The whole discussion of U+08A1
​is simply a very small corner case
​: a minuscule fraction of
​the issues involved in
​. ​
I wish that the people who get all fired up about
​ would talk to security experts to find out what sorts of characters—in
practice—*do* represent confusability issues:
U+08A1 and related characters would not even be on the radar screen.

As you say, those kinds of issues are best solved by
​higher level protocols; it is simply infeasible to do more than nibble at
the edges with the low-level protocol. It just gives people a false sense
that they are solving the problem.

Mark <>

*— Il meglio è l’inimico del bene —*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Idna-update mailing list