IDNA and U+08A1 and related cases (was: Re: Barry Leiba's Discuss on draft-ietf-json-i-json-05: (with DISCUSS and COMMENT))

Mon Jan 26 23:05:49 CET 2015

Asmus,

Some reflections on your note with the understanding that I
largely agree with Pete's concerns and will try to avoid
repeating them (more on that below).  I do wish we could either
give a precise and consistent definition to "homograph" (or
"homoglyph" and get agreement on using them) or stop using the
terms entirely because they have, IMO, introduced a great deal
of confusion into our discussions.  FWIW, I like your
definitions, I just despair of getting very many people to use
them consistently.

Nick, most of my response to your observation about "handle as
confusables" appears toward the end of this note.  Since I'm
complaining about terminology, I've also discovered that phrase
is usually followed by a great deal of handwaving about what
that sort of "handling" means in practice.  Again, please see
below but try to explain _exactly_ what you mean/intend by such
handling for both the entire DNS hierarchy and the various
identifiers that might be covered by PRECIS or other systems of
protocols.

More (lots more, unfortunately) below...

--On Sunday, January 25, 2015 22:30 -0800 Asmus Freytag
<asmusf at ix.netcom.com> wrote:

>  From the perspective of a robust identifier system, I would
> want an underlying encoding that is constructed so that it
> serves as a catalog of unique, non-overlapping shapes (that is
> shapes that are positively distinguishable). With that, I
> could ensure that a unique sequence of code values results in
> a unique, distinguishable graphical form (rendered label) that
> users can select with confidence.

As you certainly know, perhaps unless one could keep all
identifiers as unambiguously non-words, one would also want
something that is a good match to people's perceptions of how
the sounds they make are represented in printed form and vice
versa.    Those criteria conflict -- no perfect solution is
possible unless, perhaps, one allows only numeric identifiers
and restricts them to a single set of digits.  

So, while I believe your comment above is correct, I also
believe it is somewhat of a strawman in the real world with
realistic systems, at best, considering those goals as important
decision-making criteria.   In the real world, one ends up
making tradeoff decisions.  One can prefer that one's own
criteria are considered most important but, if the encoding
system is going to serve multiple objectives, probably the most
it is plausible to expect is consistency of application of
whatever conventions and priorities are developed.

FWIW, I note that Harald Alvestrand and I explored some related
ideas in RFC 5242, largely to demonstrate that they were
unworkable.   Whatever one might say about the reasons why that
effort failed, I'd suggest that the original design for what
became ISO 10646 came a bit closer to the objectives above
simply by being intended as a single unified character set with
identical characters (even from multiple scripts) assigned a
single code point and not combining characters.  The Unicode
design of more or less concatenating script-related blocks
(originally based on concatenating existing language or
script-specific national and ISO standards, see below) and of
allowing combining sequences is very different and probably much
more practical (I assume you would drop "probably" from that
statement).

> It's a mistake to assume that this describes in any way the
> primary mission of the Unicode Standard.
> 
> Instead, Unicode is concerned with allowing authors to create
> strings for code values that will be (more or less
> predictably) rendered so that human readers can discern an
> intended textual meaning. "If I want to write "foo" in
> language X, which codes should (and shouldn't) I use?" is the
> question that goes to the heart of this problem.

I think that is an excellent summary and I appreciate your
providing it.  It also means that there is an inherent conflict
between identifier applications and that primary mission.   That
conflict is likely to be less when some language context can be
inferred or assumed than when it cannot, but not completely
absent in either case.  Unless one resolves that conflict by
moving away from Unicode for identifiers or adopting truly
draconian restrictions such as the "all numeric" one above, it
also suggests that, whatever we do, there are going to be some
rough edges in identifier use of Unicode.  Vint's observation
about "argument[s] of the form 'you allowed a case of confusion
therefore you should tolerate all confusion'" and a similar
observations about "we did that before, therefore it is ok to do
it again" are, IMO, both applicable here.   

I believe the Unicode Consortium tried to develop one set of
rules that would work moderately well for general cases with UAX
#31.  The IETF studied that specification when IDNA was first
being developed and concluded that it wasn't a sufficient match
for DNS needs.  While I believe that both UAX #31 and IDNA have
influenced each other over the years, they haven't converged
sufficiently to be interoperable.   I believe that at least some
of the differences are real differences in design criteria, not
just matters of taste.  As with the more general identifier
versus text issue, that is probably because the primary (and
priority) use cases are different and not merely because of
differences in taste.  In any event, I think it was inevitable
that neither is completely adequate to cover all identifier
cases.  

> It's not quite as simple as that, because there's also the
> need to make certain automated processes (from spell-checker
> to sorting) come up with the correct interpretation of the
> text - the latter, for example, is behind the need to
> separately encode a Greek omicron from a Latin o.

Because I think understanding how we go to where we are today is
helpful to informing future decisions, I suggest that, while
that example is helpful, it is important to note that the
_reason_ those two code points are encoded separately is a
direct consequence of the way early versions of Unicode were
created, i.e., that the original Greek letter block of Unicode
was identical in repertoire and order to the Greek letters of
ISO/IEC 8859-7 and ELOT 928.  While I recall (perhaps
incorrectly) that part of the rationale for doing things that
way in the Unicode of the late 1980 was to preserve ISO script
relationships and orderings (that is more or less confirmed in
the chronology at
http://www.unicode.org/history/versionone.html).  It was not
some abstract reasoning about, e.g., sorting or spell correction.

> Another complication is that human readers can be very
> tolerant when associating shapes to letters in well
> established contexts, but not tolerant at all outside of these
> contexts. If you consider all, including decorative type
> faces, the letter 'a' can have a bewildering array of actual
> shapes, without losing its essential "a-ness" --- when used in
> context. Some of the shapes for capital A can look like U
> (check your Fraktur fonts), and out of context of running text
> in Fraktur would be misidentified by many users.

And that, as I understand it, is the reason why Unicode has a
supposedly-firm rule against assigning separate code points to
font (or type style) variations.    At best, that is where more
tradeoffs intrude because the relationship between, e.g., "a"
(U+0061) and "MATHEMATICAL BOLD SMALL A" (U+1D41A) appears to
the casual observer to be entirely about font variations.   Even
if difference between textual and mathematical usage is accepted
as significant enough to justify a separate set of codes, the
relationship between U+1D41A and, e.g., U+1D44E (italic rather
than bold) appears to be nothing more than a font variation
within the same semantic context.

>From my perspective, we can design identifier standards that
adjust to almost any reasonable, consistent, and well-described
set of conventions of a repertoire and coding system optimized
for different goals, including the goals you describe.  The
fundamental problems we seem to be facing here are less that
difference in design goals but failures in "consistent and
well-described", even if we consider only decisions made from
3.2 or 4.1 forward.

> Finally, Unicode is intentionally designed to be the *only*
> such system, so that code conversion (other than trivial
> re-packaging) is in principle needed only for accessing legacy
> data. However, at the start, all data was legacy, and Unicode
> had to be designed to allow migration of both data and systems.

Presumably hence the compatibility decisions with existing
standards, as discussed above.

> Canonical decomposition entered the picture because the legacy
> was at odds with how the underlying writing system was
> analyzed. In looking at the way orthographies were developed
> based on the Latin and Cyrillic alphabets, it's obvious that
> plain letterforms are re-used over and over again, but with
> the addition of a mark or modifier. These modifiers are named,
> have their own identity, and can, in principle, be applied to
> any letter -- often causing a predictable variation of value
> of the base letter.
> 
> Legacy, instead, cataloged the combinations.

And that is exactly what the standard says happened and why, at
least the way we read it and as it was explained to us.  With
the help of what we understood normalization to be for, with the
exclusion of all compatibility characters (perhaps too harshly
restrictive for some Chinese cases, and with a few
character-by-character adjustments, we thought we were coping
fairly well.  However, the assumptions we made --again, based on
both what the Standard appeared to say in plain language and
advice from your colleagues -- included a belief that no new
characters would be added within a given script if it could be
composed from existing characters (i.e., that both the base
character and any required combining characters were already
coded).    What took us by surprise this time around --
surprises that have caused some of us to question a whole series
of basic assumptions -- is that, in addition to the conditions
staged in the standard for adding new code points, there are an
additional set of rules and cases about phonetic, semantic, and
use-case distinctions that justify new code points.

I'm convinced that, with your help, we can develop new rules or
derived properties, perhaps even leading to a new normalization
form that does what we thought NFC and NFD did and that is
better at supporting context-free identifiers.  But we need to
know what the cases and rules that create them are.  I now infer
that, in addition to the "Mathematical" characters as exceptions
to the "no separately-coded font variations" rule (treated as
almost a separate  script with compatibility transformations
back to Latin), there are phonetic description characters that
are treated as part of the Latin script, look just like base
Latin characters with various combining markings or decorations,
but that have ordinary "Lu" or "Ll" properties and no
decompositions, and these Arabic cases that do not have
decompositions because they are phonetically different from the
composing sequence.

> For Latin and Cyrillic primarily, and many other scripts, but
> for some historical reason not for Arabic, Unicode supports
> the system of "applying marks" to base letters, by encoding
> the marks directly. To support legacy, common combinations had
> to be encoded as well. Canonical decomposition is in one sense
> an assertion of which sequences of base + mark a given
> combination is the equivalent. (In another sense,
> decomposition asserts that the ordering of marks in a
> combining sequence does not matter for some of these marks,
> but matters for others).

But then the questions become how far back "legacy" goes,
because some of these combining sequences and precomposed
characters seem to have been added fairly recently.  Be that as
it may...

> Arabic was excluded from this scheme for (largely) historical
> reasons; combinations and precomposed forms are explicitly not
> considered equal or equivalent, and one is not intended to be
> substituted for another. 

Except, of course, where they are -- see U+0623.   I gather from
the above that this is now considered a legacy issue.

> So as to not break with the existing
> system, additional composite forms will be encoded - always
> without a decomposition.

This sounds as if we have reached a point at which the rules for
whether new characters are added, whether they are added in
precomposed form, and whether those forms decompose, are
actually different on a per-script (and, given the phonetic
descriptors, perhaps a per-block or per-character (see below))
basis.  If that is true, it would make the IDNA rule sets
horribly complex, but it might be possible _if_ there is
guidance as to whether the next new script to be coded will be
handled more like Latin or more like Arabic.  The standard seems
to me to rather clearly say "more like Latin", but maybe that is
not true.

> (As an aside: Arabic is full of other, non-composite code
> points that will look identical to other code points in some
> context, but are not supposed to be substituted - yet it's
> trivial to find instances where they have been).

Understood and that is complicated by a lot of font and writing
system variations, especially between Arabic language use of the
script and uses by languages whose writing systems were more
strongly influenced by Persian.  But, so far, that set of issues
has been handled as confusables, not different ways to code
what, by appearance and name, appear to be the same character.

> Latin, for example, also contains cases where, what looks like
> a base letter with a mark (stroke, bar or slash) applied to
> it, is not decomposed canonically. The rationale is that if I
> apply a "stroke" to a letter form, the placement of the stroke
> is not predictable. It may overstrike the whole letter, or
> only a stem, or one side of a bowl. Like the aforementioned
> case, new stroked, barred or slashed forms will be encoded in
> the future, and none of these are (or will be) canonically
> equivalent to sequences including the respective combining
> marks. (This principle also holds for certain other "attached"
> marks, like "hook", cf U+1D92, but not cedilla).

So my "need to have different rules per script" hypothesis above
in insufficient and the rules really have to be "per combining
character"?

> On the other hand, no new composite forms will be encoded of
> those that would have been decomposed in the past.

I don't understand what that means in practice.  It sounds like
the rule in the standard, but it also appears that exceptions
can be made at any point simply by saying "that isn't really the
same abstract character" because it is phonetically,
semantically, or historically different.

> To come to a short summary of a long story: Unicode is
> misunderstood if it's combining marks are seen a set of lego
> bricks for the assemblage of glyphs. Superficially, that's
> what they indeed appear to be. However, they are always marks
> with their own identity that happen to be applied, in writing,
> to certain base letter forms, with the conventional appearance
> being indistinguishable from a "decoration".

> Occasionally, because of legacy, occasionally for other
> reasons, Unicode has encoded identical shapes using multiple
> code points (homographs). A homograph pair can be understood
> as something that if both partners were rendered in the same
> font, they would (practically always) look identical. Not
> similar, identical.

Ok.  I can work with that definition but note, as I said at the
top of this note, that the same term has been used to describe
sets of code points that merely might appear similar to someone,
in some font, on some days.  

> The most common cases occur across scripts - as result of
> borrowing. Scripts are a bit of an artificial distinction,
> when operating on the level of shapes (whether in hot metal or
> in a digital font) there's no need to distinguish whether 'e',
> 's', 'p', 'x', 'q', 'w' and a number of other shapes are
> "Latin" or "Cyrillic". They are the same shape. Whether they
> are used to form English, French or Russian words happens to
> be determined on another level.

I understood that and have been trying to distinguish between
same-script and inter-script differences, similarities, or
identity all through these discussions.  I've even suggested
that our IDNA-style identifier problems would be easier if
Unicode had chosen to treat Latin, Cyrillic, and Greek as a
single script.  The reasons that wasn't done are clear
(including the legacy standard issue discussed above) and it is
clearly too late to undo it (or to treat Arabic and Perso-Arabic
as separate scripts) even if that were desirable more generally).

> Without the script distinction, these are no longer
> homographs, because they would occur in the catalog only once.
> 
> Because we do have script distinction in Unicode, they are
> homographs, and they are usually handled by limiting script
> mixing in a identifier - a rough type of "exclusion
> mechanism".
>...

Fortunately or unfortunately, we learned very early in IDNA
deployment that a "no mixed-script labels" rule was impractical
and would not be accepted by the user community due to various
use cases in which identifiers are formed from multiple
languages.   We can still advise that it is generally a bad idea
(and have done so), but we cannot prohibit it.  Equally
important for the domain name case, most users see an FQDN as a
single identifier, not a set of isolated labels, but the
"distributed administrative hierarchy" character of the DNS
design (and how DNS aliases work) make it impossible   work to
impose "all of the labels in this domain name must be entirely
in the same script" rule.

> Because the reasons why these homographs were encoded are
> still as valid as ever, any new instances that satisfy the
> same justification, will be encoded as well. In all these
> cases, the homographs cannot be substituted without (formally)
> changing the meaning of the text (when interpreted by reading
> code values, of course, not when looking at marks). Therefore,
> they cannot have a canonical decomposition.
> 
> Canonical decomposition, by necessity, thus cannot "solve" the
> issue of turning Unicode into a perfect encoding for the sole
> purpose of constructing a robust identifier syntax - like the
> hypothetical encoding I opened this message with. If there
> was, at any time, a misunderstanding of that, it can't be
> helped -- we need to look for solutions elsewhere.

That is becoming clear (or may already be clear to most people
here).

> The fundamental design limitation of IDNA 2008 is that,
> largely, the rules that it describes pertain to a single label
> in isolation.
> 
> You can look at the string of code points in a putative label,
> and compute whether it is conforming or not.
> 
> What that kind of system handles poorly is the case where two
> labels look identical (or are semantically identical with
> different appearance -- where they look "identical" to the
> mind, not the eyes, of the user).
> 
> In these cases, it's not necessarily possible, a-priori, to
> come to a solid preference of one over the other label (by
> ruling out certain code points). In fact, both may be equally
> usable - if one could guarantee that the name space did not
> contain  a doppelganger.
> 
> That calls for a different mechanism, what I have called
> "exclusion mechanism".
> 
> Having a robust, machine readable specification of which
> labels are equivalent variants of which other labels, so that
> from such a variant set, only one of them gets to be an actual
> identifier. (Presumably the first to be applied for).
> 
> This will immediately open all the labels that do not form a
> plausible 'minimal pair' with one of their variants. For
> example, a word in a language that uses code point X, where
> the homograph variant Y is not on that locale's keyboard would
> not be in contention with an entire different word, were Y
> appears, but in different context, and which is part of a
> language not using X on their keyboard. Only the occasional
> collision (like "chat" and "chat" in French and English) would
> test the formal exclusion mechanism.
> 
> This less draconian system is not something that is easy to
> retrofit on the protocol level.

It is also associated with a requirement that should be made
explicit.  Unless, for example, one language (or other
criterion) is to be universally preferred to others in some sort
of ranked hierarchy, there must be a registry, even if first
come first served, in which the strings can be cataloged.  That
works for IANA-maintained protocol parameter registries
including the DNS root zone and for identifiers associated with
some other type of registration authority.  It does not work for
most more distributed environments, nor in situations where
users pick their own identifiers for their own purposes.  For
the latter case, sometimes we care about absolute uniqueness of
the latter, sometimes statistical uniqueness is good enough, and
sometimes we don't care at all.  

In addition, exclusion rules are good at preventing false
positives but not so helpful for preventing false negatives.  We
(and the users) often, but not always, care that the user can
enter the same string twice, even from different systems on
different days, and get results that are "the same" or will
compare equal.  If that isn't reliable because different systems
have different defaults, assumptions, or input methods, then we
end up in an unhappy situation.

> But, already, outside the protocol level, issues of near (and
> not so near) similarities have to be dealt with. Homographs in
> particular (and "variants" in general) have the nice property
> that they are treatable by mechanistic rules, because the
> "similarities" whether graphical or semantic are "absolute".
> They can be precomputed and do not require case-by-case
> analysis.

Actually, there are enough exceptions that I don't think a
statement that general is appropriate.

> So, seen from the perspective of the entire eco-system around
> the registration of labels, the perceived shortcomings of
> Unicode are not as egregious and as devastating as they would
> appear if one looks only at the protocol level.

> There is a whole spectrum of issues, and a whole set of layers
> in the eco system to potentially deal with them. Just as
> string similarity is not handled in the protocol, these types
> of homographs should not have to be.

Unfortunately, the view that these are just confusable
characters that ought to be handled by subjective, per-registry
and highly-distributed, judgments about what is acceptable
--which I think it where this leads -- may discard a major goal
of IDNA 2008, which was to be able to deal with the overwhelming
majority of cases by algorithmic rules (or tables derived from
them) that could be enforced by lookup-time checking.  That
principle creates a kind of stability and predictability that
are very important for both security and for a user sense that
what is allowed (or disallowed) in one part of the DNS (or one
protocol) will either be allowed elsewhere or the reasons will
be clear.  By contrast, assuming that every zone administrator
("registry" -- remembering that there are hundreds of thousands
of them, maybe more) will be cautious and conservative and will
understand the issues and act in the best interests of the users
and the Internet --or that they will all subscribe to the same
judgment/ evaluation group and its conclusions-- is unrealistic.
We tried that and found, even at the second level of the DNS,
that several important registry operators were quite indignant
that ICANN would presume to tell them what to do and that
browser vendors invented different "who or what do you trust"
rules about individual zones (see Gerv's note about that and
note that the strategy is useless below the second level unless
the top-level registry imposes contractual requirements on every
delegation that apply recursively down the tree and that are
rigorously enforced -- ideas that have proven unpopular and that
may not even be plausible).

I hope we can do better, even if it requires special properties
to identify code points that would have had decompositions under
a more identifier-oriented set of criteria although the current
Unicode criteria do not define decompositions for them or even
an IETF-defined normalization form that has better properties
for identifiers (again, within a script and taking the script
boundaries/ categories as given) than NFC is turning out to have.

> Let's recommend handling them in more appropriate ways.

I hope you have better suggestions than the above.  At this
point, I don't and I find the idea of going down either of those
paths fairly daunting.

best,
    john