IDNA and U+08A1 and related cases (was: Re: Barry Leiba's Discuss on draft-ietf-json-i-json-05: (with DISCUSS and COMMENT))
Asmus Freytag
asmusf at ix.netcom.com
Mon Jan 26 07:30:54 CET 2015
John, Nico and all,
From the perspective of a robust identifier system, I would want an
underlying encoding that is constructed so that it serves as a catalog
of unique, non-overlapping shapes (that is shapes that are positively
distinguishable). With that, I could ensure that a unique sequence of
code values results in a unique, distinguishable graphical form
(rendered label) that users can select with confidence.
It's a mistake to assume that this describes in any way the primary
mission of the Unicode Standard.
Instead, Unicode is concerned with allowing authors to create strings
for code values that will be (more or less predictably) rendered so that
human readers can discern an intended textual meaning. "If I want to
write "foo" in language X, which codes should (and shouldn't) I use?" is
the question that goes to the heart of this problem.
It's not quite as simple as that, because there's also the need to make
certain automated processes (from spell-checker to sorting) come up with
the correct interpretation of the text - the latter, for example, is
behind the need to separately encode a Greek omicron from a Latin o.
Another complication is that human readers can be very tolerant when
associating shapes to letters in well established contexts, but not
tolerant at all outside of these contexts. If you consider all,
including decorative type faces, the letter 'a' can have a bewildering
array of actual shapes, without losing its essential "a-ness" --- when
used in context. Some of the shapes for capital A can look like U (check
your Fraktur fonts), and out of context of running text in Fraktur would
be misidentified by many users.
Finally, Unicode is intentionally designed to be the *only* such system,
so that code conversion (other than trivial re-packaging) is in
principle needed only for accessing legacy data. However, at the start,
all data was legacy, and Unicode had to be designed to allow migration
of both data and systems.
Canonical decomposition entered the picture because the legacy was at
odds with how the underlying writing system was analyzed. In looking at
the way orthographies were developed based on the Latin and Cyrillic
alphabets, it's obvious that plain letterforms are re-used over and over
again, but with the addition of a mark or modifier. These modifiers are
named, have their own identity, and can, in principle, be applied to any
letter -- often causing a predictable variation of value of the base letter.
Legacy, instead, cataloged the combinations.
For Latin and Cyrillic primarily, and many other scripts, but for some
historical reason not for Arabic, Unicode supports the system of
"applying marks" to base letters, by encoding the marks directly. To
support legacy, common combinations had to be encoded as well. Canonical
decomposition is in one sense an assertion of which sequences of base +
mark a given combination is the equivalent. (In another sense,
decomposition asserts that the ordering of marks in a combining sequence
does not matter for some of these marks, but matters for others).
Arabic was excluded from this scheme for (largely) historical reasons;
combinations and precomposed forms are explicitly not considered equal
or equivalent, and one is not intended to be substituted for another. So
as to not break with the existing system, additional composite forms
will be encoded - always without a decomposition.
(As an aside: Arabic is full of other, non-composite code points that
will look identical to other code points in some context, but are not
supposed to be substituted - yet it's trivial to find instances where
they have been).
Latin, for example, also contains cases where, what looks like a base
letter with a mark (stroke, bar or slash) applied to it, is not
decomposed canonically. The rationale is that if I apply a "stroke" to a
letter form, the placement of the stroke is not predictable. It may
overstrike the whole letter, or only a stem, or one side of a bowl. Like
the aforementioned case, new stroked, barred or slashed forms will be
encoded in the future, and none of these are (or will be) canonically
equivalent to sequences including the respective combining marks. (This
principle also holds for certain other "attached" marks, like "hook", cf
U+1D92, but not cedilla).
On the other hand, no new composite forms will be encoded of those that
would have been decomposed in the past.
To come to a short summary of a long story: Unicode is misunderstood if
it's combining marks are seen a set of lego bricks for the assemblage of
glyphs. Superficially, that's what they indeed appear to be. However,
they are always marks with their own identity that happen to be applied,
in writing, to certain base letter forms, with the conventional
appearance being indistinguishable from a "decoration".
Occasionally, because of legacy, occasionally for other reasons, Unicode
has encoded identical shapes using multiple code points (homographs). A
homograph pair can be understood as something that if both partners were
rendered in the same font, they would (practically always) look
identical. Not similar, identical.
The most common cases occur across scripts - as result of borrowing.
Scripts are a bit of an artificial distinction, when operating on the
level of shapes (whether in hot metal or in a digital font) there's no
need to distinguish whether 'e', 's', 'p', 'x', 'q', 'w' and a number of
other shapes are "Latin" or "Cyrillic". They are the same shape. Whether
they are used to form English, French or Russian words happens to be
determined on another level.
Without the script distinction, these are no longer homographs, because
they would occur in the catalog only once.
Because we do have script distinction in Unicode, they are homographs,
and they are usually handled by limiting script mixing in a identifier -
a rough type of "exclusion mechanism". (The take away here is buried in
the reason for the script distinction - it's not needed for human
readers - they go by the shape. It's needed for automatic processing,
like sorting, which goes by the code values).
Beyond borrowings, other homographs exist. In Latin and other scripts.
In Tamil, one of the letters and one of the digits have precisely the
same shape. Yet, it's not possible for them to be substituted. The
Arabic script, is full of instances. (Some of these are homographs
unless they occur in particular locations in a word).
Because the reasons why these homographs were encoded are still as valid
as ever, any new instances that satisfy the same justification, will be
encoded as well. In all these cases, the homographs cannot be
substituted without (formally) changing the meaning of the text (when
interpreted by reading code values, of course, not when looking at
marks). Therefore, they cannot have a canonical decomposition.
Canonical decomposition, by necessity, thus cannot "solve" the issue of
turning Unicode into a perfect encoding for the sole purpose of
constructing a robust identifier syntax - like the hypothetical encoding
I opened this message with. If there was, at any time, a
misunderstanding of that, it can't be helped -- we need to look for
solutions elsewhere.
The fundamental design limitation of IDNA 2008 is that, largely, the
rules that it describes pertain to a single label in isolation.
You can look at the string of code points in a putative label, and
compute whether it is conforming or not.
What that kind of system handles poorly is the case where two labels
look identical (or are semantically identical with different appearance
-- where they look "identical" to the mind, not the eyes, of the user).
In these cases, it's not necessarily possible, a-priori, to come to a
solid preference of one over the other label (by ruling out certain code
points). In fact, both may be equally usable - if one could guarantee
that the name space did not contain a doppelganger.
That calls for a different mechanism, what I have called "exclusion
mechanism".
Having a robust, machine readable specification of which labels are
equivalent variants of which other labels, so that from such a variant
set, only one of them gets to be an actual identifier. (Presumably the
first to be applied for).
This will immediately open all the labels that do not form a plausible
'minimal pair' with one of their variants. For example, a word in a
language that uses code point X, where the homograph variant Y is not on
that locale's keyboard would not be in contention with an entire
different word, were Y appears, but in different context, and which is
part of a language not using X on their keyboard. Only the occasional
collision (like "chat" and "chat" in French and English) would test the
formal exclusion mechanism.
This less draconian system is not something that is easy to retrofit on
the protocol level.
But, already, outside the protocol level, issues of near (and not so
near) similarities have to be dealt with. Homographs in particular (and
"variants" in general) have the nice property that they are treatable by
mechanistic rules, because the "similarities" whether graphical or
semantic are "absolute". They can be precomputed and do not require
case-by-case analysis.
So, seen from the perspective of the entire eco-system around the
registration of labels, the perceived shortcomings of Unicode are not as
egregious and as devastating as they would appear if one looks only at
the protocol level.
There is a whole spectrum of issues, and a whole set of layers in the
eco system to potentially deal with them. Just as string similarity is
not handled in the protocol, these types of homographs should not have
to be.
Let's recommend handling them in more appropriate ways.
A./
More information about the Idna-update
mailing list