Visually confusable characters (1)

Mon Aug 11 11:02:03 CEST 2014

On 8/10/2014 2:03 PM, John C Klensin wrote:

John,

responses below.

A./
>
> --On Sunday, August 10, 2014 12:21 -0700 Asmus Freytag
> <asmusf at ix.netcom.com> wrote:
>
>> This message responds to point (1)
>>
>> A./
>>
>> On 8/9/2014 10:48 AM, John C Klensin wrote:
>>>
>>> (1) There a way to establish "language context" in the DNs.
>>>
>>> It just doesn't work.  The DNS is designed to be an
>>> administratively-distributed hierarchy, with the
>>> administration of one node having control of the names it
>>> registered and delegates, but little else.  In a few cases,
>>> one can deduce an intended language from a top-level domain,
>>> but few domains (even if all of the TLD applications of the
>>> last few years are considered as approved) have primary ties
>>> to language rather than products, concepts, or topographical
>>> or political geography.  Even when a language can be inferred
>>> from the top level, there is no way to "enforce" it on
>>> subsidiary nodes because, once delegated, domain
>>> administrators ("registries") are on their own and there is
>>> nothing to prevent, e.g., a Chinese name from being
>>> registered in a subtree of an Arabic domain.  More important,
>>> there is nothing to prevent registration of an Urdu, Farsi,
>>> or Fula domain in a subtree of an Arabic domain or vice verse.
>>>
>>> The only exception might involve contracts that restricted the
>>> labels that could be used in a subtree, required that those
>>> contractual provisions be passed down, and then were enforced
>>> via some draconian procedures such as requiring large bonds
>>> when domains were registered with forfeit of both the bonds
>>> and domains if violations are detected.  That has been tried;
>>> in general, it hasn't worked well.
>>>
>>> For technical reasons associated with a hierarchy with weak
>>> aliases and no "came from" function, even if such rules could
>>> be enforced in principle, they would apply to registration
>>> and not use.  If a.b.c.example were actually an alias (of
>>> either flavor) for Fula-name1.Fula-name2.Fula-name3, there
>>> would be no discernable language information about the first
>>> form.  And, if the situation were reversed, there would be no
>>> way to obtain the form that was thought of as containing
>>> language information from the form that the user (or other
>>> system) presented.
>> John,
>>
>> if I understand the implication of your argument, if you were
>> running a
>> domain for the Fula University and had to do sub-domains you
>> would
>> advocate then that at that point they not use IDNA 2008?
> Wow.  I have no idea how you got that implication from anything
> I said, so the miscommunication and/or misunderstandings
> apparently run even deeper than I had assumed.  In any event:
>
> (i) Extrapolating from several things that you and others have
> said, if I were running a domain for Fula Univeristy, it seems
> likely that I'd be doing my domain names in Latin script, except
> possibly for/in the department of Ajami studies.  That may or
> may not be relevant to this discussion but is almost certainly
> relevant to attempts to ban "archaic" scripts from the DNS (for
> different, and changing, definitions of "archaic").   For
> context, I'd make the same comment about a Norwegian university
> that was considering registering domain names in Runic.   One
> implication of "administrative hierarchy" is that, within the
> university's domain(s), they could do pretty much whatever they
> wanted.  If, however, they asked my advice, I'd recommend that
> they stick to IDNA and scripts that were likely to be usable
> elsewhere.

Scripts are funny. But here we are not talking about scripts, but a
repertoire extension (inside a well supported script) for a given language.

It's fine to be skeptical on IDNs altogether. For TLDs, I find it 
telling that
there were no (serious) applications of Latin IDNs. (The exception in this
case proves the rule). There clearly is a pressure towards "lowest common
denominator" as a means of securing more universal access.

I take it that is why you were planning on giving the advice you sketched.

There is a contravening pressure when it comes to writing systems, and
that is the use of writing system (whether script or language-specific
orthography) to mark identity.

Over time, this can be expected to manifest itself.

The threshold of desiring to use a repertoire extension for a given
script is much lower than for a full script. Eventually, whether it's
domains base on personal names, or the Ajami studies department,
there will be pressure to use ordinary words as mnemonics. At
least, pressure, to not use a random subset of words that happens
to work without a certain character.

By baking a restriction into IDNA via the exception mechanism, you
assert that this is an issue where some consideration trumps the
distributed control.
>
> (ii) If I were running such a domain and decided to use Arabic/
> Ajami characters, I would hope that IDNA2008 would allow (indeed
> force) the U+08A1 form of BEH WITH HAMZA ABOVe and the composing
> sequence to be treated identically (and as equivalent).  I might
> well prefer that U+08A1 be used because, as you point out, it is
> much more natural to the language.  In addition, like many
> others (and the decisions made for both IDNA2003 and IDNA2008),
> I prefer, all other things being equal, precombined forms and
> shorter codepoint sequences to longer combining sequences. But
> I'd want the combining form to work and be equivalent to
> accommodate the presumably non-zero number of people who have
> been using that sequence for the years before Unicode 7.0.0
> started coming into use, to have something that displayed
> correctly in fonts that don't yet have representations for 7.0.0
> code points, etc.   Note, in that regard, that, if U+08A1 had a
> decomposition back to the combining sequence, fonts and
> rendering machinery that have not been upgraded to reflect the
> new characters in 7.0.0 would be he home free if 7.0.0
> normalization (or even just intelligent use of UnicodeData) were
> supported.

In this particular case I see the pre-existing data issues as more of a
theoretical concern, than a practical one. (The code points, I am told,
are not accessible without going through some efforts).

(I'm leaving aside entirely the argument that the two are non-identical
on purpose, so that the convenience would actually buy you the
wrong results).
>
> (iii) My problem would then be how to make the two sequences
> equivalent.  My expectation would be that IDNA2008 (or even
> IDNA2003+UTR46+ the handwaving associated with "IDNA2003
> upgraded to new versions of Unicode") would help me out, but
> they don't because the only tool either one has is NFC and the
> relevant normalization just isn't there.  If the problem were
> limited to the web, I could try doing redirects but, for other
> protocols (including email addresses in the relevant
> departments), those inherent limitations of the DNS would very
> quickly get to me -- either in the form of not being able to do
> what I wanted or to impose significant additional maintenance
> and overhead versions on me as I tried to keep all subdomains
> synchronized.

I would phrase the issue differently. If the two sequences aren't 
semantically
the same, then I'm not interested in making them equivalent by having a
single, preferred form substituted for one of them.

However, because appearance matters in identifiers, I want a guarantee
that there aren't two different labels possible that look the same.

That guarantee gets admittedly stronger if it's implemented via
some kind of normalization or repertoire restriction baked into the
protocol.

But I would simultaneously realize that I already have to demand
this kind of guarantee for many, many other strings where neither
normalization of repertoire restrictions can be applied (or were not
applied and now it's too late).

Because of that, I'd have to evaluate whether this particular case is so
egregious that it must come with a stronger guarantee than all the
other cases. I would conclude, that this is not the case, given the
obscurity of either sequence or singleton.

As a result, I would look towards the registry and not the protocol
to address this.
>
> (iv) Given the above, I would be faced with a difficult choice,
> one that is substantially equivalent to the one faced by this
> list.  Creating alternative normalization rules that would cause
> the equality relationship I want is a huge step and one for
> which IDNA2008 has no "hooks" and my university has no way to
> persuade anyone else to use.
That make normalization a sub-par choice.
>   Modifying DNS servers to treat the
> two as equivalent is a non-starter for all of the reasons we
> went down the IDNA path in the first place (not coincidentally,
> the same reason why language-specific normalization or case
> folding can't be made to work in the DNS or IDNA).   Without
> either of those choices being available, and with trying to
> establish equivalence relationships on my own servers probably
> not being a practical option, I'd be down two three choices:

Which means that a goal of treating two things on a perfectly equivalent
footing is unrealistic. But protection from phishing doesn't require
the full equivalence.
>
> (a) Try to ban the use of the combining sequence, possibly
> invalidating names and other uses that have been using it
> already (and forcing those who want to use labels containing BEH
> WITH HAMZA ABOVE to wait until font sets and perhaps keyboards,
> etc., are upgraded.
> (b) Ban the use of the new U+08A1 in domain name labels, forcing
> use of the combining sequence.
> (c) Allow the use of either (and presumably register both in my
> domain), hoping that whatever measures I can take to have them
> treated as equivalent work well enough and that no one takes
> advantage of the attack vectors when they don't.
> (d) Disallow Ajami labels entirely on  the grounds that this is
> just too much of a mess for what are, given university domain
> administrators in much of the world, the resources I have
> available and the ways in which I need to prioritize them.

(e) have a robust mechanism at your registry that allows one, and not
the other to be registered. Unlike (a) and (b) there's no need to decide
up-front which one it will be. Once one is registered, it blocks the other.

In practice, I would expect in this case that the vast majority of 
applications
would be for the singleton. But if, for example the department of Koranic
studies were to apply for a term that used the sequence, it could be
accommodated.

>
> Now, to be clear, from my point of view as your hypothetical
> university domain administrator, all four of the choices above
> are lousy.

That's why I'd pick (e) - it allows for competing allocations from the same
script using different orthographies, even if the orthographies overlap
by using "their" flavor of a homograph.

That approach is the one I've labeled "blocked variants".
>   I might well pick (a), and retroactively apply the
> same rule to the other Arabic-script single-codepoint characters
> that could be built up from combining sequences but that don't
> have decompositions.  In making that decision, I would also do
> something that IDNA2008 strongly encourages, which is to ban all
> Arabic-script characters from use in labels unless they were
> actually used in writing Fula in Ajami.

Of course. This works orthogonal.
>
> But, as an IDNA expert worried about backward compatibility and
> not invalidating possibly-existing labels unless there was no
> other option, I wouldn't consider (a) to be an available option.
> And, unless I could make a normalization method that would cause
> these two ways to form the same character image to compare
> equal, choice (b) would be by far the least bad of the remaining
> options... unless I considered it so horrible that the right
> answer was to note the problem and go with (c), nothing that the
> "make these match" tools won't work consistently in any more
> global context.

Option (e) doesn't have the issue of backward compatibility - all 
existing labels
are automatically retained.
>
> Virtually none of that has anything to do with IDNA2008.  Other
> than the observation that U+08A1 isn't defined in Unicode 3.2,
> the same considerations would apply to it, to IDNA2008, and, if
> one wanted to accept that many problems to causes (including
> lack of client support and additional comparison ones) simply
> inserting UTF-8 into the zone for labels at that level.
>
> There are a lot of situations in which I would suggest "don't
> use that character", "don't use that sequence", or "be really,
> really careful if you decide to use those particular characters"
> in a domain name label, but none that I've encountered so far
> that would lead me to say "don't use IDNA2008".

As I mentioned in a different message, there are some orthographies that
cannot ever be supported (like the one that uses @ as a letter).

There are some orthographies that can only be supported with restriction.
The lack of apostrophe is definitely restrictive in languages that use
it in names. The use of text as identifiers does not mean "full text".

But when the discussion gets to the point of disallowing a letter,
it is worth being really sure that this is the only available option.

A./