Visually confusable characters

Sat Aug 9 19:48:57 CEST 2014

--On Friday, August 08, 2014 19:20 -0700 Asmus Freytag
<asmusf at ix.netcom.com> wrote:

>...
> "Additional protocol" sounds like it's headed in the right
> direction.
> 
> There are already several levels to this
> 
>   * Unicode (repertoire and basic normalization)
>   * IDNA (including repertoire and context rules)
>   * Label Generation Rulesets (including repertoire, context
> rules and
>     blocked variants)
>   * String Review (case by case)
> 
> Of these, the formulation of Label Generation Rulesets allow a
> solution to issues like these that can be used to address
> issues like the current one without the need to pick an
> arbitrary preferred encoding. They provide ways to specify a
> first-come, first-serve, but mutually exclusive selection
> among alternatives, which is much less "linguistically
> damaging" than blunt restrictions repertoire alone.
> 
> What is missing, but what keeps surfacing in the discussions
> around creating the LGR for the Root Zone is the need for
> enforceable "best practices" on LGRs.

Asmus,

I think that this discussion is being complicated by several
fundamental misunderstandings or misconceptions.   I think I
could trace the origins of most of them and that it might lead
to fewer repeated mistakes, but it would probably also lead to
unproductive finger-pointing, so I'll skip that (at least for
now).  Some of them may include, in no particular order:

(1) There a way to establish "language context" in the DNs.  

It just doesn't work.  The DNS is designed to be an
administratively-distributed hierarchy, with the administration
of one node having control of the names it registered and
delegates, but little else.  In a few cases, one can deduce an
intended language from a top-level domain, but few domains (even
if all of the TLD applications of the last few years are
considered as approved) have primary ties to language rather
than products, concepts, or topographical or political
geography.  Even when a language can be inferred from the top
level, there is no way to "enforce" it on subsidiary nodes
because, once delegated, domain administrators ("registries")
are on their own and there is nothing to prevent, e.g., a
Chinese name from being registered in a subtree of an Arabic
domain.  More important, there is nothing to prevent
registration of an Urdu, Farsi, or Fula domain in a subtree of
an Arabic domain or vice verse.   

The only exception might involve contracts that restricted the
labels that could be used in a subtree, required that those
contractual provisions be passed down, and then were enforced
via some draconian procedures such as requiring large bonds when
domains were registered with forfeit of both the bonds and
domains if violations are detected.  That has been tried; in
general, it hasn't worked well.

For technical reasons associated with a hierarchy with weak
aliases and no "came from" function, even if such rules could be
enforced in principle, they would apply to registration and not
use.  If a.b.c.example were actually an alias (of either flavor)
for Fula-name1.Fula-name2.Fula-name3, there would be no
discernable language information about the first form.  And, if
the situation were reversed, there would be no way to obtain the
form that was thought of as containing language information from
the form that the user (or other system) presented.

(2) ICANN has real authority in this space.

ICANN has real authority in only two areas: decisions about what
top-level domains to allocate and delegate and obligations they
can impose on "contracted parties".   Even those authorities are
limited: for example, in the IDN space, attempts some years ago
to impose second-level registration guidelines on the nearly 200
ccTLDs were strongly and effectively resisted to the point that
those domains and their subtrees can effectively do whatever
they want.  A small number of them even ignored IDNA for a while
and registered ISO/IEC 8859 second-level domains.  When they
stopped (if, indeed, all of them have stopped) it wasn't due to
any ICANN authority.

ICANN has also tried to impose requirements on registrations
below the second level on contracted parties, but encountered a
lot of resistance and discovered that about all it could do was
recommend guidelines that could be ignored.  Part, but only
part, of the problem was an extension of the issues identified
in (1) above.   IIR, the registry for the still-important COM
TLD told ICANN that it would consider ICANN's proposed
requirements only as general guidelines and got away with it.

(3) The LGR rules and process have something to do with this or
can be applied to help with it. 

The process that led to the LGR developments, and the LGR
process itself, was agreed-to by the ICANN Board and community
based on some very restrictive conditions.  First, it applies
only to TLDs to be proposed to ICANN in some future (and as yet
undefined) application round, not even to IDN TLDs that are now
approved or in process.  Second, some a priori conditions were
applied to it that would not apply to lower-level registrations,
particularly the prohibition on archaic scripts (and,
presumably, characters and avoidance of IDNA CONTEXTJ and
CONTEXTO characters.   So any applicability to registrations at
the second level or below or to non-contracted parties would be
only as guidelines.  Based on experience, those guidelines would
be strongly resisted if ICANN tried to turn them into policies
(and simply ignored (at best) by any ccTLD that concluded its
interests lay elsewhere).

Incidentally, if a gTLD label was applied for containing a
character that was exclusive to Fula-in-Arabic-script and that
fell into the scope of the LGR rules, I'd expect it to be
rejected unless the Arabic script panel had first recommended
including that character and proven that it had the expertise to
make the relevant judgments, including judgments about possible
conflicts and requirements for variants.  If such a label was
accepted anyway, I'd expect the sort of conflicts and appeals
that would invoke the "start over" provisions of the VIP/IDN/LGR
Process.  If that reasoning is correct, U+08A1 would be excluded
from gTLDs as surely as if it were DISALLOWED by IDNA.  The
_only_ issue with it in that regard is whether it should be
allowed in places where it might otherwise co-occur with the BEH
with HAMZA ABOVE combining sequence (of course, as Vint, Andrew,
and others have pointed out, the real problem where isn't U+08A1
except insofar as it a symptom of a problem that we thought
normalization solved and didn't/ doesn't).

Suggestions about "blocked variant mechanisms" and the like are
instances of either the LGR problem and binding to TLDs or
general issues about ICANN scope and authority.  There has never
been a _DNS_ "variant" mechanism and, for reasons that have been
discussed extensively elsewhere really cannot be without a
complete and fairly fundamental DNS redesign.  DNSng is even
harder to contemplate than IDNA202x; ideas that depend on it
are, at best, irrelevant to any contemporary discussion.

(4) Discussions about applying and enforcing global restrictions
to naming that are not justified in terms about which there is
broad consensus in the domain and Internet domain-user
communities (and the "strings that look alike compare equal"
principle, even though it is a little vague and involves edge
cases, is one of those) can be separated from global political
discussions.

This one is possible, but unlikely, especially when issues or
disagreements can be turned into questions about whether ICANN
can be trusted with self-management and self-oversight or
whether the tendency of various ICANN-associated bodies to want
to overreach (whether for laudable or venial reasons) requires
external supervision and/or appeals mechanisms with significant
authority.

(5) Normalization is sufficient to produce equality comparisons
in characters that are identical in form within the same script,
type style, etc.

When the IDNA WG read what seemed to be the relevant sections of
the Unicode Standard and corresponding UAXs and UTRs, and was
advised about them by various people very close to Unicode
standardization, we made some inferences about both the use and
effects of normalization and future plans about new code points
within a script that were related to older ones.  Obviously a
wrong conclusion or several of them.

(6) This discussion has anything to do with visual confusion
among characters from separate scripts that have similar
appearances.

That is an important issue.  It just isn't this problem and,
with the exception of identifying a few characters that involve
very high risks of perceptual conflicts with common syntax
characters (especially in domain names and URIs) and trying to
figure out how to handle them (some have been DISALLOWED, others
treated as CONTEXTO rules), IDNA does not address that issue.
The reality is that it is always a tradeoff among the importance
of the characters involved and usability of labels if they are
excluded, the risk of either accidental or malicious confusion,
and the likely costs associated with those risks.  Those
tradeoffs can be sensibly be assessed only on a label by label
and zone by zone basis.  But, again, different problem.

(7) This matching mess is so horrible that the only options are
to accept it and live with it or to replace IDNA2008 with yet
another version.

Sorry, but no.  Selectively DISALLOWing selected code points,
especially newly-added ones, was anticipated in the IDNA design.
However horrible the idea may seem, the WG and broader community
understood the issue and accepted the risks and possible costs
(or thought it did).  Had we not done so, there would be no
provision in IDNA for a review process and a mechanism for
excluding newly-added characters.  What is missing is a way to
exclude particular combining sequences that turn out to be
problematic, especially ones that become problematic as a result
of additions to Unicode.   But, Andrew's comments about IDNA201x
or IDNA202x notwithstanding, it appears to me that provisions
for such exclusions would rather easily be added to the existing
model and that it could be done with little or no disruption as
long as one was very careful about cases that would appear to be
retroactive.

(8) Despite all of the above, it should be possible to invent
and enforce some wide-ranging "additional protocol" that would
understand human perceptions about homographs, know what was in
the DNS at all levels and all trees (or at least all relevant
ones, where "relevant" certainly extends below the second level)
and that they could be used to enforce some sort of "variant" or
"similarity prohibition" rules on registrations and delegations
at potentially deep levels of the tree, enforcing those rules on
both types of aliases and maybe web redirects as well as
delegated subdomains and the host-type records in them.

Nice fantasy.  DNS doesn't work that way, the Internet doesn't
either, and that is probably A Good Thing.  For example, except
when required by contract or regulation, we no longer allow
people to obtain a list of the labels allocated in or delegated
from a particular domain.  Once one gets away from the root and
TLD contents, the contractual requirements are very rare.   Some
of us consider the general inability to ask such questions to be
an important privacy mechanism as well as having some
performance and operational advantages.  One can find out is a
name is already associated with DNS records, but that test is
not completely reliable due to various race conditions, hidden
domains and subdomains, etc.

Moreover, the reasons for an administratively-distributed
hierarchy aren't just to spread the workload around.  It it to
allow different parts of the domain tree to have different
policies.  A "one size fits all" naming model doesn't help.   In
addition, which I have some concerned about the desirability and
workability of some of Jefsey's ideas, it is clear to me that
even the most reasonable of them depend heavily of being able to
have different naming conventions and experiences to match
different user (or user group or nationality) preferences or
requirements.  A global "additional protocol" with its own
naming rules would probably make that impossible even if it were
otherwise feasible.

IMO, until we can get past that list, a constructive and focused
conversation about what to do is very nearly impossible.

best,
   john