Visually confusable characters (1)

Sun Aug 10 23:03:43 CEST 2014

--On Sunday, August 10, 2014 12:21 -0700 Asmus Freytag
<asmusf at ix.netcom.com> wrote:

> This message responds to point (1)
> 
> A./
> 
> On 8/9/2014 10:48 AM, John C Klensin wrote:
>> 
>> 
>> (1) There a way to establish "language context" in the DNs.
>> 
>> It just doesn't work.  The DNS is designed to be an
>> administratively-distributed hierarchy, with the
>> administration of one node having control of the names it
>> registered and delegates, but little else.  In a few cases,
>> one can deduce an intended language from a top-level domain,
>> but few domains (even if all of the TLD applications of the
>> last few years are considered as approved) have primary ties
>> to language rather than products, concepts, or topographical
>> or political geography.  Even when a language can be inferred
>> from the top level, there is no way to "enforce" it on
>> subsidiary nodes because, once delegated, domain
>> administrators ("registries") are on their own and there is
>> nothing to prevent, e.g., a Chinese name from being
>> registered in a subtree of an Arabic domain.  More important,
>> there is nothing to prevent registration of an Urdu, Farsi,
>> or Fula domain in a subtree of an Arabic domain or vice verse.
>> 
>> The only exception might involve contracts that restricted the
>> labels that could be used in a subtree, required that those
>> contractual provisions be passed down, and then were enforced
>> via some draconian procedures such as requiring large bonds
>> when domains were registered with forfeit of both the bonds
>> and domains if violations are detected.  That has been tried;
>> in general, it hasn't worked well.
>> 
>> For technical reasons associated with a hierarchy with weak
>> aliases and no "came from" function, even if such rules could
>> be enforced in principle, they would apply to registration
>> and not use.  If a.b.c.example were actually an alias (of
>> either flavor) for Fula-name1.Fula-name2.Fula-name3, there
>> would be no discernable language information about the first
>> form.  And, if the situation were reversed, there would be no
>> way to obtain the form that was thought of as containing
>> language information from the form that the user (or other
>> system) presented.

> John,
> 
> if I understand the implication of your argument, if you were
> running a
> domain for the Fula University and had to do sub-domains you
> would
> advocate then that at that point they not use IDNA 2008?

Wow.  I have no idea how you got that implication from anything
I said, so the miscommunication and/or misunderstandings
apparently run even deeper than I had assumed.  In any event:

(i) Extrapolating from several things that you and others have
said, if I were running a domain for Fula Univeristy, it seems
likely that I'd be doing my domain names in Latin script, except
possibly for/in the department of Ajami studies.  That may or
may not be relevant to this discussion but is almost certainly
relevant to attempts to ban "archaic" scripts from the DNS (for
different, and changing, definitions of "archaic").   For
context, I'd make the same comment about a Norwegian university
that was considering registering domain names in Runic.   One
implication of "administrative hierarchy" is that, within the
university's domain(s), they could do pretty much whatever they
wanted.  If, however, they asked my advice, I'd recommend that
they stick to IDNA and scripts that were likely to be usable
elsewhere.

(ii) If I were running such a domain and decided to use Arabic/
Ajami characters, I would hope that IDNA2008 would allow (indeed
force) the U+08A1 form of BEH WITH HAMZA ABOVe and the composing
sequence to be treated identically (and as equivalent).  I might
well prefer that U+08A1 be used because, as you point out, it is
much more natural to the language.  In addition, like many
others (and the decisions made for both IDNA2003 and IDNA2008),
I prefer, all other things being equal, precombined forms and
shorter codepoint sequences to longer combining sequences. But
I'd want the combining form to work and be equivalent to
accommodate the presumably non-zero number of people who have
been using that sequence for the years before Unicode 7.0.0
started coming into use, to have something that displayed
correctly in fonts that don't yet have representations for 7.0.0
code points, etc.   Note, in that regard, that, if U+08A1 had a
decomposition back to the combining sequence, fonts and
rendering machinery that have not been upgraded to reflect the
new characters in 7.0.0 would be he home free if 7.0.0
normalization (or even just intelligent use of UnicodeData) were
supported.

(iii) My problem would then be how to make the two sequences
equivalent.  My expectation would be that IDNA2008 (or even
IDNA2003+UTR46+ the handwaving associated with "IDNA2003
upgraded to new versions of Unicode") would help me out, but
they don't because the only tool either one has is NFC and the
relevant normalization just isn't there.  If the problem were
limited to the web, I could try doing redirects but, for other
protocols (including email addresses in the relevant
departments), those inherent limitations of the DNS would very
quickly get to me -- either in the form of not being able to do
what I wanted or to impose significant additional maintenance
and overhead versions on me as I tried to keep all subdomains
synchronized.

(iv) Given the above, I would be faced with a difficult choice,
one that is substantially equivalent to the one faced by this
list.  Creating alternative normalization rules that would cause
the equality relationship I want is a huge step and one for
which IDNA2008 has no "hooks" and my university has no way to
persuade anyone else to use.  Modifying DNS servers to treat the
two as equivalent is a non-starter for all of the reasons we
went down the IDNA path in the first place (not coincidentally,
the same reason why language-specific normalization or case
folding can't be made to work in the DNS or IDNA).   Without
either of those choices being available, and with trying to
establish equivalence relationships on my own servers probably
not being a practical option, I'd be down two three choices:

(a) Try to ban the use of the combining sequence, possibly
invalidating names and other uses that have been using it
already (and forcing those who want to use labels containing BEH
WITH HAMZA ABOVE to wait until font sets and perhaps keyboards,
etc., are upgraded.
(b) Ban the use of the new U+08A1 in domain name labels, forcing
use of the combining sequence.
(c) Allow the use of either (and presumably register both in my
domain), hoping that whatever measures I can take to have them
treated as equivalent work well enough and that no one takes
advantage of the attack vectors when they don't.
(d) Disallow Ajami labels entirely on  the grounds that this is
just too much of a mess for what are, given university domain
administrators in much of the world, the resources I have
available and the ways in which I need to prioritize them.

Now, to be clear, from my point of view as your hypothetical
university domain administrator, all four of the choices above
are lousy.  I might well pick (a), and retroactively apply the
same rule to the other Arabic-script single-codepoint characters
that could be built up from combining sequences but that don't
have decompositions.  In making that decision, I would also do
something that IDNA2008 strongly encourages, which is to ban all
Arabic-script characters from use in labels unless they were
actually used in writing Fula in Ajami.

But, as an IDNA expert worried about backward compatibility and
not invalidating possibly-existing labels unless there was no
other option, I wouldn't consider (a) to be an available option.
And, unless I could make a normalization method that would cause
these two ways to form the same character image to compare
equal, choice (b) would be by far the least bad of the remaining
options... unless I considered it so horrible that the right
answer was to note the problem and go with (c), nothing that the
"make these match" tools won't work consistently in any more
global context.

Virtually none of that has anything to do with IDNA2008.  Other
than the observation that U+08A1 isn't defined in Unicode 3.2,
the same considerations would apply to it, to IDNA2008, and, if
one wanted to accept that many problems to causes (including
lack of client support and additional comparison ones) simply
inserting UTF-8 into the zone for labels at that level.

There are a lot of situations in which I would suggest "don't
use that character", "don't use that sequence", or "be really,
really careful if you decide to use those particular characters"
in a domain name label, but none that I've encountered so far
that would lead me to say "don't use IDNA2008".

best regards,
   john