IDNA and U+08A1 and related cases (was: Re: Barry Leiba's Discuss on draft-ietf-json-i-json-05: (with DISCUSS and COMMENT))
asmusf at ix.netcom.com
Mon Jan 26 09:29:06 CET 2015
On 1/25/2015 11:17 PM, Pete Resnick wrote:
> Thanks for the explication. It is helpful. However, there's one
> section that doesn't answer the question I've had about this entire
> episode, and I hope you can elucidate:
> On 1/25/15 10:30 PM, Asmus Freytag wrote:
>> Occasionally, because of legacy, occasionally for other reasons,
>> Unicode has encoded identical shapes using multiple code points
>> (homographs). A homograph pair can be understood as something that if
>> both partners were rendered in the same font, they would (practically
>> always) look identical. Not similar, identical.
> I think "homographs" is a bit of a red herring here. I think many of
> us understand that there are, and will always be, homographs in
> Unicode, and that some of the homographs will be interestingly related
> (e.g., ones whose code points canonically decompose into other code
> points, like LATIN SMALL LETTER U WITH DIARESIS U+00FC and LATIN SMALL
> LETTER U U+0075 followed by COMBINING DIARESIS U+0308) and some that
> are not interestingly related at all (e.g., DIGIT ZERO U+0030 and
> LATIN CAPITAL LETTER O U+004F).
See, that's where we are at odds.
The latter set are not homographs - they are similar, confusable, or
whatever name you'd like to give two (or more) characters, for which the
set of shapes is merely *overlapping*. That is, some fonts, in some
magnification may (!) render these pixel for pixel identically, but most
fonts in most magnifications will (!) render them as distinguishable.
That's not a homograph.
A homograph is Tamil letter KA and Tamil digit 1. Those look exactly
alike (they are the same mark on paper) but because we need to strictly
distinguish letters and digits in computing (for parsing and number
processing), we now have two code points for this pair of homographs. (I
and my colleagues sometimes use homoglyph, but in this discussion I have
used homograph exclusively).
> The confusion is the present case is not that ARABIC LETTER BEH WITH
> HAMZA ABOVE U+08A1 is a homograph of ARABIC LETTER BEH U+0628 followed
> by ARABIC HAMZA ABOVE U+0654; that the two are homographs seems
> perfectly reasonable. It's that they don't appear to be "interestingly
> related" in the way one would expect given their names, and given the
> apparent semantics of each.
As the discussion above makes clear, homographs often have an
"interesting" relation, but it's not always a formal one in terms of a
normalization between the two. In fact, in the example I gave, you
wouldn't want a normalization, because that would effectively destroy
the ability to designate one a letter (KA) and one a digit (1).
What you call an "interesting relation" is really canonical equivalence.
Canonical equivalence asserts that there is only the weakest reason to
have two coded representation, the reason often being that some contexts
(legacy) require a single code point, while other contexts can equally
well handle a sequence.
So, by having a relation between sequence <base, mark> and single
element sequence <composite> Unicode asserts that in those cases there
isn't a reason to make a distinction and using either encoding should
lead to the same effect.
This assertion is generally made for accent, grave accent, dot above,
cedilla, tilde above and a whole number of other accent marks. The
assertion is NOT made for stroke, bar, slash, reverse slash overlay,
hook, reverse hook and a whole number of other -attached- marks; even if
composites of the latter are named using the same "WITH xxxx mark"
construction as the former.
So the naming convention is not a guide!
(It's also important not to be lured into thinking of combining marks as
a lego set for glyph buiding - on the whole, they
really don't function that way, even if in some areas, like math, and
certain separable marks in Latin, look like they do).
> So here are the questions I'd really like to understand the answers to:
> 1. Is there some semantic relationship between LATIN SMALL LETTER U
> WITH DIARESIS U+00FC and LATIN SMALL LETTER U U+0075 followed by
> COMBINING DIARESIS U+0308 that does not exist between ARABIC LETTER
> BEH WITH HAMZA ABOVE U+08A1 and ARABIC LETTER BEH U+0628 followed by
> ARABIC HAMZA ABOVE U+0654? If so, is there some documented way to know
> this beyond examination of their names (which obviously would give one
> the wrong impression in this case)?
Unicode, as a general principle, treats the Arabic script as
not-decomposable. I'm a bit at a loss to where that is documented, other
than in the relevant chapter of the book - I know it from having lived
through the relevant discussions on the UTC over the decades.
You can argue whether that choice for Arabic was correct, or ideal; at
this point, that's beside the point. UTC is stuck with having to do for
Arabic what it tried not to do for Latin, which is to individually
encode certain productive variations of letter shapes.
The diaresis is an interesting example. Because, semantically, there are
two concepts here. German "umlaut" and marking double vowels as
pronounced in separation, as you have in coöperation. This distinction
is a moot point, because legacy used the same Latin-1 composite for
both. So which convention applies is left to human-interpretable context
and is not carried in the encoding.
Therefore, the combining diaeresis is likewise usable for both.
Having accepted that the semantics can be multivalued in this case, the
two encodings are equivalent <0075, 0308> and <00FC> are both referring
to the same entity that is ambiguous as to whether its u-umlaut or
u-diaeresis; which one it is in the end depends on the context in the
text it is used.
ARABIC HAMZA ABOVE U+0654 is used to mark a glottal stop. The experts
that worked out the encoding for U+08A1 assert that the result is not
the same as applying a glottal stop, and further, that is is not useful
to analyze it as anything other than a "variant letter form" that
"happens to look like a hamzah was drawn over a beh".
As far as I understand this (at some remote from the actual decision)
the issue is that creating a canonical equivalency would assert that
0654 is not always a Hamza (glottal stop), that it only "looks like
one". Not useful - and related to the reason for not treating Arabic as
decomposable in the first place.
(In Arabic, many of the "combining" characters, are actually letters in
their own right, such as vowel marks, which is a distinction from Latin
accent marks and diacritics).
> 2. Are there other homographs in Unicode that appear within the same
> script, and use a similar naming convention to the examples above
> (where the name of one is a combination of the other two with the word
> "WITH" between them), yet they are not related in such a way that one
> canonically decomposes to the others? And again, if so, is there some
> documented way to know why some do and some don't?
The naming convention appears to be the real red herring here. I
personally would be ready to concede that the naming of the new code
point was not fortuitous. It (perhaps a bit mindlessly) followed a
Unicode pattern of describing the most common glyphic representation,
without taking into account that in this case, it's not a Hamza (glottal
stop) but a mark that looks like a hamza - that is theres a big
disconnect between semantics and appearance.
> If the answer in this case is really that U+08A1 is not interestingly
> related to [U+0628 U+0654], or at least not in the way that would
> result in a canonical decomposition, their names notwithstanding, I
> think we'll all be OK with "these are more like U+0030 and U+004F" and
> move on. But I would like to understand what, if any, relationship
> there is and then we can make a judgment about whether U+08A1 should
> be treated as a special case or not in IDNA and elsewhere.
Unicode is clear in documenting that substituting the sequence is
incorrect. Given that, you really have no equivalence on which you could
base a decomposition - it would destroy the text. If you write Fula,
only the combination is permissible. If you write Arabic texts with
glottal stops marked (not the common case) then you use the separate hamza.
As has been mentioned, the Arabic panel that looked at the Root Zone
repertoire decided that Hamza wasn't needed (none of the combining marks
for Arabic). Likewise, the panels would have excluded the composite,
because Arabic is just an alternate way of writing Fula (Latin being the
common way) and therefore not apropos for the root (as it happens, this
decision has actually not been made, because of the delay of the 7.0
IDNA tables, but that's the outcome I would tend to predict if I had to).
I believe you can, in this instance, confidently make the judgment to
ignore the code point as not important enough to matter.
There are tens of thousands of archaic code points that are not well
understood (nor usable by anyone I know) that happen to be PVALID.
Presumably, they are PVALID because this allows anyone who really needs
to create IDNS using hieroglyphics, or cuneiform for example, to do so.
Nobody knows what "interesting" or "uninteresting" issues lurk in that
This case, is equally rarified, except it happens to occupy the margins
of a more common script. The way to treat these issues, is to construct
robust IDN tables that
a) exclude rarified stuff
b) use "exclusion mechanisms" of the kind that are beyond the protocol
level (not repeating that discussion here).
Hope this helps.
More information about the Idna-update