Confusability (Re: New version, draft-faltstrom-idnabis-tables-02.txt,available)

Mon Jun 18 11:41:02 CEST 2007

--On Monday, 18 June, 2007 15:57 +0900 Martin Duerst
<duerst at it.aoyama.ac.jp> wrote:

> Hello Harald,
> 
> I only started to talk about confusability because several of
> the preceeding messages re. rule H were mentioning this, or
> things very closely related.
> 
> I think it's wise not trying to address confusability in our
> work, but if we indeed are doing so, I have no clue why
> anybody ever brought up RFC 3743, which is about a method to
> address confusability issues in CJKV ideograms, nothing else.
> 
> In general, I think we have to be careful to just outlaw
> confusability as an issue. As an example, the effort to
> address issues surrounding ZWNBSP are very much related to
> confusability: How can we make sure that things that are
> perceived differently by script users can indeed be
> represented without introducing additional confusability
> through a back door.

Martin,

Therein lies part of the problem, both with this particular
issue and with the idea of a crisp definition of the problem
that will clearly separate characters into "in" and "out".
Permit me to use your comments as the starting point to address
a somewhat broader set of issues.

We'd rather not address confusability because it leads us into a
difficult, and somewhat subjective, mess and because
confusability cannot be "solved", only considered as a tradeoff.
But, while one can't "solve" confusability, one cannot ignore it
because it makes labels less useful.

The goal is to make IDNs as useful and usable as possible,
balancing clarity and integrity of references against as much
mnemonic value in labels as possible.  That was, I believe, the
goal the first time around as well.   What we have discovered is
that the IDNA2003 rules and mappings created some problems, some
with confusability, some with inability to usefully build
mnemonics based in some languages, some with general confusion.
We hope that those issues are fairly well covered in RFC 4690
and draft-klensin-idnabis-issues, but would welcome suggestions
about how they can be made more clear or how topics that have
been left out can be covered (we, and some members of the UTC,
have clearly been looking at these issues too much and may have
lost perspective on what issues might not be obvious to the more
casual reader).

There is clearly an assumption built into the comments above.  I
suppose one can classify it as "policy", but it is policy that
was incorporated into the hostname rules and that has been
reaffirmed many times since.  That policy --if one wants to
think of it as policy-- is that DNS "names" are primarily about
references, referential integrity, and usability of those
references by humans.   In principle, one could make other
decisions.  If one wants to optimize for profitability in the
"names market", then confusion is the friend of those profits
because one can then "sell" protection in the form of
encouraging people to register all possible confusing variations
of names.  If one wants to facilitate various criminal
enterprises, then strings --and presentation forms of strings--
that are easily confused visually with others are a phisher's
paradise.   If one wants to optimize for local choices and
interpretation rather than unique references, then there are no
problems with multiple roots, as RFC 2826 essentially points
out. 

But any of those approaches makes the DNS less useful as a tool
to support references, locators, and identifiers so the
optimization assumptions we are using seem both appropriate and
necessary.  So we come back to optimizing for integrity and
usability of references and all of the complex tradeoffs that go
with it.

After that, I fear one inevitably descends into details, rather
than broad principles.  That is inconvenient, but I don't see
any way to avoid it.  Calling it bad engineering doesn't change
that (and one could argue that it is the very essence of
engineering when constrained by conflicting objectives and an
inconsistent base).   Most of this is nothing new, but goes back
to the dawn of rules about names and identifiers on the ARPANET
--the question is about how to extend those rules into IDNs (and
whether such extensions are appropriate).  For example:

* One key reason why underscore was excluded from the
letter-digit-hyphen hostname rule was because, when written with
a pen, it was too easily confused with hyphen.  As Mark Andrews
suggests, symbols were excluded at the same time because they
didn't reliably and consistently appear on keyboards and because
there is no consistent, predictable, worldwide terminology for
most of them, despite the terms chosen in Unicode.  

* At least partially on the advice of UTC members, IDNA2003
excluded invisible characters, such as zero-width ones, and
other characters that were generally ignored, because they would
be an opportunity for confusion... confusion to the point that I
refer to above as "phisher's paradise".   But, because of fairly
fundamental decisions about presentation made in the compilation
of Unicode, one cannot sensibly construct a wide range of
mnemonics based on a number of languages --notably Indic and
Arabic-based scripts -- so it became important to deal with ZWB
and ZWNB in some way.  That, in turn, requires a contextual
rule, which the design of IDNA2003 prohibited.

* We've got a similar problem with IPA.  The first version of
the tables document excluded the IPA block entirely.  As Harald
mentioned, that resulted in two strong criticisms.  One was that
many of the characters had been adopted into African languages
and (presumably because there were no extant national or
international standards that were specific to those languages)
the IPA characters had to be used if reasonable mnemonics were
to be constructed based on words of those languages.  The other
was that we should refrain from writing rules based on character
blocks, rather than on property lists.   So now we have IPA back
in, which cases problems with IPA characters that are basically
font variations on basic Latin ones.  It is clear to me (at
least) that we can't have any font variations in the
IDN-permitted set and, indeed, that such variations must be
forever excluded if we are not to have major problems.  But that
means we need to either exclude the IPA block and then permit
some specific characters, or that we need to include the IPA
block and then prohibit many specific characters, or that we are
not going to be able to use existing Unicode properties or
something derived from them to define the IDN sets.

That is not a happy situation in terms of cleanliness of design
or definitions, but I don't see how to avoid it.

regards,
    john