Visually confusable characters (5)

Sun Aug 10 22:15:22 CEST 2014

On 8/9/2014 10:48 AM, John C Klensin wrote:

John,

I thought it best to reply to your points individually, as some
branches of the discussion are probably not going to be as deep.

As you wrote them "in no particular order", I'm going to respond
to them in the same way.

This message responds to point (5)

A./

> (5) Normalization is sufficient to produce equality comparisons
> in characters that are identical in form within the same script,
> type style, etc.
>
> When the IDNA WG read what seemed to be the relevant sections of
> the Unicode Standard and corresponding UAXs and UTRs, and was
> advised about them by various people very close to Unicode
> standardization, we made some inferences about both the use and
> effects of normalization and future plans about new code points
> within a script that were related to older ones.  Obviously a
> wrong conclusion or several of them.
>

Using text as an identifier (as opposed to IP addresses) drags with
it all the messy, conflicting  and sometimes self-inconsistent ways
in which communities use text.

Sometimes that is expressed as being a problem of "Unicode",
when in fact it is not.

Normalization Form C has as design point primarily the folding
of "dual encoding" forced upon Unicode as criterion for acceptance
by the wider technical and user community.

It is emphatically not, and has never been a "confusables" or even
visually equivalent folding.

Normalization Form KC had as design point the folding of additional
instances, mainly caused by Asian standards where short sequences,
of Latin of Kana letters are often represented in a single "cell" and
therefore have a singleton encoding. These are weaker instances
of dual encoding - by applying NFKC one does throw away information
that cannot be regenerated by context or font selection, but it would
be, in principle, possible to carry it out of band (in rich text). In
practice, it is not clear that implementations actually allow this.

In many instances, NFKC foldings are more aggressive than needed
for mere visual folding - for example clustered kana do no look like
strung out kana.

So, in  a sense, if a rigorous confusables folding was desired, then
neither of the normalization forms are complete or appropriate.

On the other hand, I observe that, for example for the Arabic
script, the debate is not finished on what a suitably robust
confusables folding would be. Arabic has letters that share forms
in some positions in the word and differ when placed in other
position. There's still debate whether these letters should be
treated as confusables only in those positions, or always.

The latter is more robust, especially as adherence to the
nominal differences in shape is not universal across type
faces (and the practice among users can involve typing the
"wrong" code point as long as it looks OK).

In other words, I don't see that it would have been possible
at the time IDNA2008 was created to come up with a well-
defined confusibles folding that would have been free of
any perceived problems 5 years later.

I simply think it's not possible to render text absolutely
"safe" by means of a fundamental protocol. I'm pessimistic
in the sense that additional layers of review will be needed
no matter what the protocol looks like, and optimistic that
some of the work that is being pursued at the moment in
the more limited arena of TLDs will ultimately generate
a host of useful information as well as techniques that could
be used to strengthen these additional layers.

A./