Visually confusable characters (1)

Mon Aug 11 16:10:30 CEST 2014

(distribution trimmed to sender and the list)

--On Monday, August 11, 2014 02:02 -0700 Asmus Freytag
<asmusf at ix.netcom.com> wrote:

>...
>>> This message responds to point (1)
>...
>>> On 8/9/2014 10:48 AM, John C Klensin wrote:
>>>> 
>>>> (1) There a way to establish "language context" in the DNs.
>>>> 
>>>> It just doesn't work.  The DNS is designed to be an
>>>> administratively-distributed hierarchy, with the
>>>> administration of one node having control of the names it
>>>> registered and delegates, but little else.  In a few cases,
>>>> one can deduce an intended language from a top-level domain,
>...
>>> if I understand the implication of your argument, if you were
>>> running a
>>> domain for the Fula University and had to do sub-domains you
>>> would
>>> advocate then that at that point they not use IDNA 2008?
>> Wow.  I have no idea how you got that implication from
>> anything I said, so the miscommunication and/or
>> misunderstandings apparently run even deeper than I had
>> assumed.  In any event:
>> 
>> (i) Extrapolating from several things that you and others have
>> said, if I were running a domain for Fula Univeristy, it seems
>> likely that I'd be doing my domain names in Latin script,
>> except possibly for/in the department of Ajami studies.  That
>> may or may not be relevant to this discussion but is almost
>> certainly relevant to attempts to ban "archaic" scripts from
>> the DNS (for different, and changing, definitions of
>> "archaic").   For context, I'd make the same comment about a
>> Norwegian university that was considering registering domain
>> names in Runic.   One implication of "administrative
>> hierarchy" is that, within the university's domain(s), they
>> could do pretty much whatever they wanted.  If, however, they
>> asked my advice, I'd recommend that they stick to IDNA and
>> scripts that were likely to be usable elsewhere.

> Scripts are funny. But here we are not talking about scripts,
> but a
> repertoire extension (inside a well supported script) for
> a given language.

Scripts are, indeed, funny.   From an IDNA (and confusability)
standpoint, we almost certainly would have been better off if
Greek, Latin, and Cyrillic were treated as a single script with
a lot of shared characters and various characters that are less
shared.  Or, given that they are to be treated as separate
scripts, we might be better off if Arabic and Perso-Arabic (at
least) were treated as separate scripts.  It didn't work out
that way, partially because of Unicode's early reliance on
catenations of existing standards and their repertoires.
Unicode's "Unification" criterion for new code points is heavily
dependent on script boundaries, at least as it is described in
Section 2.2.   As Patrik and Andrew have said in other contexts,
what happens going forward is much more important than what
happened earlier (certainly before Unicode 3.2 and probably
before 5.0).

But scripts are the tools we have and we (for a very broad "we")
have been trying to use them.  The generation panels of the LGR
process are organized around scripts as are the team efforts on
which that process was built.  We have different guidelines and
recommendation for domain names on a script basis, including
both suggestions about mixed-script labels, declining to
consider variants (whatever those are) that cross script
boundaries, and having special rules in the ccTLD Fast Track
that apply only if labels are drawn from different scripts.  

Getting closer to the crux of the current issue, when you say
"repertoire extension (inside a well supported script) for a
given language", I go reread the Unification criterion of
Section 2.2, including the statements about language not being
an issue for assigning separate code points, and the various
provisions about decomposition when such code points are added.
I then conclude that what I thought the IDNA WG was told and
what I understood about the predictability of future additions
to Unicode and their relationship to those stated principles was
incorrect.   And that, as Andrew recently pointed out, calls
into doubt both the current IDNA rule structure and the PRECIS
work that affects a lot of other protocols (some of which are
not subject to _any_ name management regime, even one equivalent
to the DNS's administratively distributed hierarchy).

> It's fine to be skeptical on IDNs altogether. For TLDs, I find
> it telling that
> there were no (serious) applications of Latin IDNs. (The
> exception in this case proves the rule). 

I'm not sure what it proves or what the rule is.  It may be that
applicants for TLDs that use Latin script considered the
tradeoffs with clarity and global ease of entry and decided that
including a few non-ASCII characters in their proposed labels
wasn't worth the trouble, especially with the prohibition on
Latin-script "variants" [1] 

The option has long ago been overtaken by events and other
decisions, but I argued long ago that the right way to treat IDN
TLDs was even more different from what is going on today, namely
by not allowing them at all and encouraging local mappings or
translations instead.   It would have solves a lot of the
problems we are facing today although, as we have discovered
with everything else in which IDNs are concerned, also would
have introduced other problems.  But that idea was not
"skepticism on IDNs", it was an attempt to consider a broader
range of possibilities and tradeoffs and find a creative and
non-pessimal solution.  See RFC 4185 if you are curious.

> There clearly is a pressure towards "lowest common
> denominator" as a means of securing more universal access.

Another one of those tradeoffs, although getting from my
position on this (or what I said) to "lowest common denominator"
doesn't reflect the intent.  The lowest common denominator is to
stick with ASCII, transliterating to Basic Latin characters if
needed.  I believe in IDNs.  My believing that the allocation
and "naming" rules should be more restrictive for TLDs and
increasingly less restrictive as one goes deeper into the tree
doesn't indicate skepticism about IDNs, it represents a view
based on relationships to requirements for global scope.  

> I take it that is why you were planning on giving the advice
> you sketched.

"LCD", no.  Increasing the odds of usability within their own
context and, to a lesser degree, for people outside that context
(see the comments about legitimate labels in archaic scripts in
a prior note... and note that application of the top-down LGR
rules to that university would prohibit Runic while my
recommendation would merely suggest not using it.

There is clearly a spectrum between the lowest common
denominator of ASCII-only "LDH" labels and the highest
expectations or fantasies of those who want the extremes of
sensitivity to the characteristics and use of individual
languages.  The former would prohibit IDNs (IDNA notwithstanding
because it introduces it own, non-LCD, issues).  The latter
cannot be supported by Unicode in the DNS environment, at least
without a separate presentation layer with its own metadata.
There is a lot of range in between, with the right balance among
the tradeoffs depending a lot on expectations about the uses of
the label and the capabilities of the users.  Just to illustrate
that example, when I used the Runic example, part of my
consideration was the near-certainty that almost everyone in a
Norwegian university, even scholars in the Viking-era Literature
department, was familiar with Latin script.  If there were a big
community of scholars and users there who had read and written
only in Old Futhark, the balance would quite sensibly be
different.

> There is a contravening pressure when it comes to writing
> systems, and
> that is the use of writing system (whether script or
> language-specific orthography) to mark identity.
> 
> Over time, this can be expected to manifest itself.

Sure.  But, again, the appropriateness of the DNS as the place
to manifest that identity distinction is, at best, another
tradeoff.  The balance for the DNS may weigh differently than
the balance for Unicode.  Coming back to the LGR process as an
analogy and example, whether one considers the prohibition of
archaic scripts or the bias against scripts that lack
well-funded and diverse IDN advocacy communities, the TLD
process contains all sorts of biases against such marking.  

More generally, and more important, I want to stress that, if I
were making these decisions for Unicode, I would probably make
the same decisions you have made.  If I were in a position to do
so, I'd probably make the same decisions for IDNs and sort the
equality issues out on the servers while keeping the code points
separate (including performing normalization only on the
servers).   But the latter is not an option with the DNS and
hence not with IDNA and the DNS has imposed other constraints
for a long time.

As an example that is probably familiar to everyone here, the
"marking" associated with the use of upper-case characters at
the beginning of some words is important to German.   But that
marking has never been permitted to be effective in the DNS,
which discards the information at various points.  The ability
to use Eszett (Sharp-S) as a distinct character is another
example, one that IDNA2008 permits but that many people closely
affiliated with Unicode have vigorously opposed.  

The point is not whether any of those decisions are right or
wrong but whether there are underlying principles that predict
future behavior and predict it consistently and well enough that
a system based on category generation from properties and rules
-- rather than one in which such rules are only guidelines to
whatever appears in normative tables -- is feasible.   If the
answer is globally "no", then we need to figure out a way to get
around it.  Since we are unlikely to discard the IDNA2008 model,
that may translate to disallowing some things we would rather
allow or by putting in another set of rules that somehow smooth
over the issues.

> The threshold of desiring to use a repertoire extension for a
> given
> script is much lower than for a full script. Eventually,
> whether it's
> domains base on personal names, or the Ajami studies
> department,
> there will be pressure to use ordinary words as mnemonics. At
> least, pressure, to not use a random subset of words that
> happens to work without a certain character.

Sure.  But that position is, at least from where I sit, entirely
consistent with vigorous application of intra-script
unification.  If the repertoire addition were of a character
with an entirely new shape and form, or a character "borrowed"
from another script but heavily used by some language in this
one, we wouldn't be having this discussion, just as we haven't
had the discussion when new character (shapes/forms) were added
to Cyrillic to accommodate languages that weren't on the radar
earlier, or to Han to accommodate older (or, earlier, newer)
forms and/or proper names.

As Andrew has pointed out, the problem isn't adding this
character to the repertoire.  It is the combination of:

 -- Adding a code point whose form previously could have
	been synthesized by a based character + combining
	character sequence
 -- Doing so in a way that appears to violate the
	principles in Chapter 2 of The Unicode Standard,
	particularly the principle that language distinctions
	within a script are not considered (even if there are
	other principles that make that ok).
 -- Doing so in a way that causes normalization to fail
	to produce an equality comparison among
	identically-shaped character-objects in the same script.
 -- The uncertainty the above generates about whether we
	can depend sufficiently on Unicode stability and
	normalization rules to support the IDNA2008 model
	without substantive changes.

> By baking a restriction into IDNA via the exception mechanism,
> you
> assert that this is an issue where some consideration trumps
> the distributed control.

No, we support the IDNA principle that things that are in the
same script and look identical with either compare equal (after
normalization) or not be there.  In a more perfect world (and
I'm still not excluding the possibility) we would be preserving
the character and fixing the "comparing equal" part.  But, as
far as I can tell, that would require developing and adopting an
IDNA-specific (or IETF-specific) normalization form.  So far, it
appears that DISALLOWing U+08A1 and any future code point that
is added to Unicode that could previously be synthesized by a
combining sequence within the same script but that does not have
a decomposition back to that combining sequence is less painful
than a new normalization form... especially so because we are
_not_, from an IDNA standpoing, prohibiting the use of that
abstract character in Fula (or any of the cases to come), we are
just prohibiting using that particular code point to write it.
In particular, if one has an application or user interface that
was tuned to Fula that encountered U+08A1 (or the combining
sequence), it could easily map them appropriately -- to the
combining sequence on the way into IDNA and to U+08A1 on the way
to Fula text.  No problem there and no prohibition on doing that
-- _we_ just have no way to require it and no way to preserve
the visual identity criterion without either forcing it (by
DISALLOWing the code point) or having a normalization that
creates the relevant equivalence.

>> (ii) If I were running such a domain and decided to use
>> Arabic/ Ajami characters, I would hope that IDNA2008 would
>> allow (indeed force) the U+08A1 form of BEH WITH HAMZA ABOVe
>> and the composing sequence to be treated identically (and as
>> equivalent).  I might well prefer that U+08A1 be used
>...
> In this particular case I see the pre-existing data issues as
> more of a
> theoretical concern, than a practical one. (The code points, I
> am told,
> are not accessible without going through some efforts).

And so?  Maybe (but only maybe) that is an argument for just
letting this particular code point go.   Maybe it is an argument
for figuring out how to ban the combining sequence (which was
available no matter how how it would be to use through some
particular set of UIs) in IDNA instead of banning U+08Ai.   But
neither of those choices (or others) affect the underlying issue
that Andrew, Patrik, Vint, myself, and others have tried to
point out.

>...
>> (iii) My problem would then be how to make the two sequences
>> equivalent.  My expectation would be that IDNA2008 (or even
>> IDNA2003+UTR46+ the handwaving associated with "IDNA2003
>> upgraded to new versions of Unicode") would help me out, but
>> they don't because the only tool either one has is NFC and the
>> relevant normalization just isn't there.  If the problem were
>> limited to the web, I could try doing redirects but, for other
>> protocols (including email addresses in the relevant
>> departments), those inherent limitations of the DNS would very
>> quickly get to me -- either in the form of not being able to
>> do what I wanted or to impose significant additional
>> maintenance and overhead versions on me as I tried to keep
>> all subdomains synchronized.
> 
> I would phrase the issue differently. If the two sequences
> aren't semantically
> the same, then I'm not interested in making them equivalent by
> having a single, preferred form substituted for one of them.

Ok,   I think that is an entirely sensible position for you (and
Unicode) to talk.  It is not an acceptable position for IDNA, at
least without some very fundamental changes in the way we think
about things.

> However, because appearance matters in identifiers, I want a
> guarantee
> that there aren't two different labels possible that look the
> same.

> That guarantee gets admittedly stronger if it's implemented via
> some kind of normalization or repertoire restriction baked
> into the protocol.
> 
> But I would simultaneously realize that I already have to
> demand
> this kind of guarantee for many, many other strings where
> neither
> normalization of repertoire restrictions can be applied (or
> were not applied and now it's too late).

As Andrew has pointed out, the important issue is predictability
going forward (from around 5.0 or earlier) so that the rule set
works as predicated/promised.  If we are going to "demand" going
back to prior cases, the arguments you are making for separation
of U+08A1 would need to be applied to the difference between
"ö" as a character in Swedish and "ö" as a pronunciation
indicator in German.   It might also apply to the situation with
CaseFold(ß) and some rather complicated issues about "æ" and
"œ" and how they are used, even within what we normally
consider to be the same language.   Of course, all such
"demands" would do serious violence to Unicode's various
stability rules.

> Because of that, I'd have to evaluate whether this particular
> case is so
> egregious that it must come with a stronger guarantee than all
> the
> other cases. I would conclude, that this is not the case,
> given the obscurity of either sequence or singleton.

> As a result, I would look towards the registry and not the
> protocol to address this.

Ok.  See above.  Really not the issue, just a handy and very
specific example that triggers concern about what appears to be
a much more general problem.

FWIW, a change to Section 2.2 of the Unicode standard (possibly
even in the still-not-fixed 7.0.0 text) that would explain these
sorts of variations or exceptions to the current criteria that
appear there  and that addressed the tradeoffs involved in a
clear way would, IMO, go a long way to changing this discussion
in a more focused and useful direction.

>...

> Which means that a goal of treating two things on a perfectly
> equivalent
> footing is unrealistic. But protection from phishing doesn't
> require the full equivalence.

Again, returning to Andrew's note and observing that "full
equivalence" would require crossing script boundaries, which we
have never tried to do in IDNA, if it is unrealistic, then there
are fundamental questions about the IDNA2008 model that we need
to address.

>> (a) Try to ban the use of the combining sequence, possibly
>> invalidating names and other uses that have been using it
>> already (and forcing those who want to use labels containing
>> BEH WITH HAMZA ABOVE to wait until font sets and perhaps
>> keyboards, etc., are upgraded.
>> (b) Ban the use of the new U+08A1 in domain name labels,
>> forcing use of the combining sequence.
>> (c) Allow the use of either (and presumably register both in
>> my domain), hoping that whatever measures I can take to have
>> them treated as equivalent work well enough and that no one
>> takes advantage of the attack vectors when they don't.
>> (d) Disallow Ajami labels entirely on  the grounds that this
>> is just too much of a mess for what are, given university
>> domain administrators in much of the world, the resources I
>> have available and the ways in which I need to prioritize
>> them.
> 
> (e) have a robust mechanism at your registry that allows one,
> and not
> the other to be registered. Unlike (a) and (b) there's no need
> to decide
> up-front which one it will be. Once one is registered, it
> blocks the other.

Doesn't help as long as the DNS allows aliases.  Also doesn't
help because of the user expectation that they can type things
(within a script) the way they like and have predictable
behavior across zones.  This is _not_ all about phishing, even
though that is an important consideration.

> In practice, I would expect in this case that the vast
> majority of applications
> would be for the singleton. But if, for example the department
> of Koranic
> studies were to apply for a term that used the sequence, it
> could be accommodated.

>> Now, to be clear, from my point of view as your hypothetical
>> university domain administrator, all four of the choices above
>> are lousy.
> 
> That's why I'd pick (e) - it allows for competing allocations
> from the same
> script using different orthographies, even if the
> orthographies overlap
> by using "their" flavor of a homograph.
> 
> That approach is the one I've labeled "blocked variants".

See above.  And as whole series of comments from others about
variants.  

>...
>> But, as an IDNA expert worried about backward compatibility
>> and not invalidating possibly-existing labels unless there
>> was no other option, I wouldn't consider (a) to be an
>> available option. And, unless I could make a normalization
>> method that would cause these two ways to form the same
>> character image to compare equal, choice (b) would be by far
>> the least bad of the remaining options... unless I considered
>> it so horrible that the right answer was to note the problem
>> and go with (c), nothing that the "make these match" tools
>> won't work consistently in any more global context.
> 
> Option (e) doesn't have the issue of backward compatibility -
> all existing labels are automatically retained.

But a user who types two URLs containing what seems to her to be
the exact same labels discovers that one works and the other
doesn't.  That is not a good (or, some would say, even
acceptable) result.

>...
>> There are a lot of situations in which I would suggest "don't
>> use that character", "don't use that sequence", or "be really,
>> really careful if you decide to use those particular
>> characters" in a domain name label, but none that I've
>> encountered so far that would lead me to say "don't use
>> IDNA2008".
> 
> As I mentioned in a different message, there are some
> orthographies that cannot ever be supported (like the 
> one that uses @ as a letter).

Actually not true.  And not an IDNA problem either.

> There are some orthographies that can only be supported with
> restriction.
> The lack of apostrophe is definitely restrictive in languages
> that use
> it in names. The use of text as identifiers does not mean
> "full text".
> 
> But when the discussion gets to the point of disallowing a
> letter,
> it is worth being really sure that this is the only available
> option.

If we were disallowing a letter that could not be written
(typed, expressed, coded, mapped, ...) in any other way, I would
completely agree.  But that situation doesn't appear to apply
here.  Put differently, no one has proposed to disallow the
letter, only a particular way of coding that letter.

best,
   john

[1] If Patrik asked my advice about applying for a TLD in his
name, I'd probably tell him he could find lots better ways to
spend USD 180000 (plus or minus).  He would be unlikely to ask
because he has certainly figured that out.   But, in a world in
which IDN TLDs are allowed, such a TLD would be a lot more
interesting (or less uninteresting) if he could get both
"Fältström" and "Faltstrom" than for either one alone.