IDNA and U+08A1 and related cases (was: Re: Barry Leiba's Discuss on draft-ietf-json-i-json-05: (with DISCUSS and COMMENT))

Mon Jan 26 07:30:54 CET 2015

John, Nico and all,

 From the perspective of a robust identifier system, I would want an 
underlying encoding that is constructed so that it serves as a catalog 
of unique, non-overlapping shapes (that is shapes that are positively 
distinguishable). With that, I could ensure that a unique sequence of 
code values results in a unique, distinguishable graphical form 
(rendered label) that users can select with confidence.

It's a mistake to assume that this describes in any way the primary 
mission of the Unicode Standard.

Instead, Unicode is concerned with allowing authors to create strings 
for code values that will be (more or less predictably) rendered so that 
human readers can discern an intended textual meaning. "If I want to 
write "foo" in language X, which codes should (and shouldn't) I use?" is 
the question that goes to the heart of this problem.

It's not quite as simple as that, because there's also the need to make 
certain automated processes (from spell-checker to sorting) come up with 
the correct interpretation of the text - the latter, for example, is 
behind the need to separately encode a Greek omicron from a Latin o.

Another complication is that human readers can be very tolerant when 
associating shapes to letters in well established contexts, but not 
tolerant at all outside of these contexts. If you consider all, 
including decorative type faces, the letter 'a' can have a bewildering 
array of actual shapes, without losing its essential "a-ness" --- when 
used in context. Some of the shapes for capital A can look like U (check 
your Fraktur fonts), and out of context of running text in Fraktur would 
be misidentified by many users.

Finally, Unicode is intentionally designed to be the *only* such system, 
so that code conversion (other than trivial re-packaging) is in 
principle needed only for accessing legacy data. However, at the start, 
all data was legacy, and Unicode had to be designed to allow migration 
of both data and systems.

Canonical decomposition entered the picture because the legacy was at 
odds with how the underlying writing system was analyzed. In looking at 
the way orthographies were developed based on the Latin and Cyrillic 
alphabets, it's obvious that plain letterforms are re-used over and over 
again, but with the addition of a mark or modifier. These modifiers are 
named, have their own identity, and can, in principle, be applied to any 
letter -- often causing a predictable variation of value of the base letter.

Legacy, instead, cataloged the combinations.

For Latin and Cyrillic primarily, and many other scripts, but for some 
historical reason not for Arabic, Unicode supports the system of 
"applying marks" to base letters, by encoding the marks directly. To 
support legacy, common combinations had to be encoded as well. Canonical 
decomposition is in one sense an assertion of which sequences of base + 
mark a given combination is the equivalent. (In another sense, 
decomposition asserts that the ordering of marks in a combining sequence 
does not matter for some of these marks, but matters for others).

Arabic was excluded from this scheme for (largely) historical reasons; 
combinations and precomposed forms are explicitly not considered equal 
or equivalent, and one is not intended to be substituted for another. So 
as to not break with the existing system, additional composite forms 
will be encoded - always without a decomposition.

(As an aside: Arabic is full of other, non-composite code points that 
will look identical to other code points in some context, but are not 
supposed to be substituted - yet it's trivial to find instances where 
they have been).

Latin, for example, also contains cases where, what looks like a base 
letter with a mark (stroke, bar or slash) applied to it, is not 
decomposed canonically. The rationale is that if I apply a "stroke" to a 
letter form, the placement of the stroke is not predictable. It may 
overstrike the whole letter, or only a stem, or one side of a bowl. Like 
the aforementioned case, new stroked, barred or slashed forms will be 
encoded in the future, and none of these are (or will be) canonically 
equivalent to sequences including the respective combining marks. (This 
principle also holds for certain other "attached" marks, like "hook", cf 
U+1D92, but not cedilla).

On the other hand, no new composite forms will be encoded of those that 
would have been decomposed in the past.

To come to a short summary of a long story: Unicode is misunderstood if 
it's combining marks are seen a set of lego bricks for the assemblage of 
glyphs. Superficially, that's what they indeed appear to be. However, 
they are always marks with their own identity that happen to be applied, 
in writing, to certain base letter forms, with the conventional 
appearance being indistinguishable from a "decoration".

Occasionally, because of legacy, occasionally for other reasons, Unicode 
has encoded identical shapes using multiple code points (homographs). A 
homograph pair can be understood as something that if both partners were 
rendered in the same font, they would (practically always) look 
identical. Not similar, identical.

The most common cases occur across scripts - as result of borrowing. 
Scripts are a bit of an artificial distinction, when operating on the 
level of shapes (whether in hot metal or in a digital font) there's no 
need to distinguish whether 'e', 's', 'p', 'x', 'q', 'w' and a number of 
other shapes are "Latin" or "Cyrillic". They are the same shape. Whether 
they are used to form English, French or Russian words happens to be 
determined on another level.

Without the script distinction, these are no longer homographs, because 
they would occur in the catalog only once.

Because we do have script distinction in Unicode, they are homographs, 
and they are usually handled by limiting script mixing in a identifier - 
a rough type of "exclusion mechanism". (The take away here is buried in 
the reason for the script distinction - it's not needed for human 
readers - they go by the shape. It's needed for automatic processing, 
like sorting, which goes by the code values).

Beyond borrowings, other homographs exist. In Latin and other scripts. 
In Tamil, one of the letters and one of the digits have precisely the 
same shape. Yet, it's not possible for them to be substituted. The 
Arabic script, is full of instances. (Some of these are homographs 
unless they occur in particular locations in a word).

Because the reasons why these homographs were encoded are still as valid 
as ever, any new instances that satisfy the same justification, will be 
encoded as well. In all these cases, the homographs cannot be 
substituted without (formally) changing the meaning of the text (when 
interpreted by reading code values, of course, not when looking at 
marks). Therefore, they cannot have a canonical decomposition.

Canonical decomposition, by necessity, thus cannot "solve" the issue of 
turning Unicode into a perfect encoding for the sole purpose of 
constructing a robust identifier syntax - like the hypothetical encoding 
I opened this message with. If there was, at any time, a 
misunderstanding of that, it can't be helped -- we need to look for 
solutions elsewhere.

The fundamental design limitation of IDNA 2008 is that, largely, the 
rules that it describes pertain to a single label in isolation.

You can look at the string of code points in a putative label, and 
compute whether it is conforming or not.

What that kind of system handles poorly is the case where two labels 
look identical (or are semantically identical with different appearance 
-- where they look "identical" to the mind, not the eyes, of the user).

In these cases, it's not necessarily possible, a-priori, to come to a 
solid preference of one over the other label (by ruling out certain code 
points). In fact, both may be equally usable - if one could guarantee 
that the name space did not contain  a doppelganger.

That calls for a different mechanism, what I have called "exclusion 
mechanism".

Having a robust, machine readable specification of which labels are 
equivalent variants of which other labels, so that from such a variant 
set, only one of them gets to be an actual identifier. (Presumably the 
first to be applied for).

This will immediately open all the labels that do not form a plausible 
'minimal pair' with one of their variants. For example, a word in a 
language that uses code point X, where the homograph variant Y is not on 
that locale's keyboard would not be in contention with an entire 
different word, were Y appears, but in different context, and which is 
part of a language not using X on their keyboard. Only the occasional 
collision (like "chat" and "chat" in French and English) would test the 
formal exclusion mechanism.

This less draconian system is not something that is easy to retrofit on 
the protocol level.

But, already, outside the protocol level, issues of near (and not so 
near) similarities have to be dealt with. Homographs in particular (and 
"variants" in general) have the nice property that they are treatable by 
mechanistic rules, because the "similarities" whether graphical or 
semantic are "absolute". They can be precomputed and do not require 
case-by-case analysis.

So, seen from the perspective of the entire eco-system around the 
registration of labels, the perceived shortcomings of Unicode are not as 
egregious and as devastating as they would appear if one looks only at 
the protocol level.

There is a whole spectrum of issues, and a whole set of layers in the 
eco system to potentially deal with them. Just as string similarity is 
not handled in the protocol, these types of homographs should not have 
to be.

Let's recommend handling them in more appropriate ways.

A./