Visually confusable characters (6)

Asmus Freytag asmusf at
Sun Aug 10 21:49:34 CEST 2014

On 8/9/2014 10:48 AM, John C Klensin wrote:

I thought it best to reply to your points individually, as some
branches of the discussion are probably not going to be as deep.

As you wrote them "in no particular order", I'm going to respond
to them in the same way.

This message responds to point (6)

> (6) This discussion has anything to do with visual confusion
> among characters from separate scripts that have similar
> appearances.
> That is an important issue.  It just isn't this problem and,
> with the exception of identifying a few characters that involve
> very high risks of perceptual conflicts with common syntax
> characters (especially in domain names and URIs) and trying to
> figure out how to handle them (some have been DISALLOWED, others
> treated as CONTEXTO rules), IDNA does not address that issue.


I'm going to take this statement:
> The reality is that it is always a tradeoff among the importance
> of the characters involved and usability of labels if they are
> excluded, the risk of either accidental or malicious confusion,
> and the likely costs associated with those risks.  Those
> tradeoffs can be sensibly be assessed only on a label by label
> and zone by zone basis.  But, again, different problem.

and hold you to it. Because, I believe, it applies within a zone and
within a script.

The tradeoff you mention is between usability (or it's obverse, called
"linguistic harm or damage" in this discussion, with the risk (and costs)
of confusion.

If you took your conclusion, that these tradeoffs can only be addressed
on a case-by-case basis to the bank, then you would appear to have
conclusively argued against the attempt to use IDNA2008 repertoire
review to solve the current issue.

In the case of the particular code point, for example, as with
structurally identical or similar instances of singletons that are
not decomposable into what appear to be homograph sequences
the likelihood of accidental confusion are often low. Where a sequence
exists, it is often one that does not have status in an orthography,
whereas the singleton has.

The cost of DISALLOWING a singleton and elevating a sequence
that has no status is, on the other hand, rather high.

There are some instances where the reverse is true, for example
with digraphs. Unicode encodes many digraphs for special technical
use, even though they are homographs for sequences of ordinary
letters. As part of the work for the Root Zone project we have
identified a number of them (in the Latin script, see U+02A3-U+02AB).

All these instances (and the ones cited by others in this discussion
involving overlay diacritics) are unaddressed in IDAN2008, resulting
in a devolution of the issue to other forms of confusables handling.

Given that, one would expect that elevating this current instance
into a proposed update to IDNA itself results in a marked improvement
of the situation, if not for all IDNs, than at least for Arabic.

That does not seem to be the case, and, in fact, the resulting tradeoff
is less than optimal from the linguistic cost side (given that the
sequence appears to be rather specialized and inaccessible to
ordinary users, while the singleton is not, within its language
context - see reply to item (7)).


More information about the Idna-update mailing list