IDNA and U+08A1 and related cases (was: Re: Barry Leiba's Discuss on draft-ietf-json-i-json-05: (with DISCUSS and COMMENT))

Mon Jan 26 09:29:06 CET 2015

.
On 1/25/2015 11:17 PM, Pete Resnick wrote:
> Asmus,
>
> Thanks for the explication. It is helpful. However, there's one 
> section that doesn't answer the question I've had about this entire 
> episode, and I hope you can elucidate:
>
> On 1/25/15 10:30 PM, Asmus Freytag wrote:
>> Occasionally, because of legacy, occasionally for other reasons, 
>> Unicode has encoded identical shapes using multiple code points 
>> (homographs). A homograph pair can be understood as something that if 
>> both partners were rendered in the same font, they would (practically 
>> always) look identical. Not similar, identical.
>
> I think "homographs" is a bit of a red herring here. I think many of 
> us understand that there are, and will always be, homographs in 
> Unicode, and that some of the homographs will be interestingly related 
> (e.g., ones whose code points canonically decompose into other code 
> points, like LATIN SMALL LETTER U WITH DIARESIS U+00FC and LATIN SMALL 
> LETTER U U+0075 followed by COMBINING DIARESIS U+0308) and some that 
> are not interestingly related at all (e.g., DIGIT ZERO U+0030 and 
> LATIN CAPITAL LETTER O U+004F). 

See, that's where we are at odds.

The latter set are not homographs - they are similar, confusable, or 
whatever name you'd like to give two (or more) characters, for which the 
set of shapes is merely *overlapping*. That is, some fonts, in some 
magnification may (!) render these pixel for pixel identically, but most 
fonts in most magnifications will (!) render them as distinguishable. 
That's not a homograph.

A homograph is Tamil letter KA and Tamil digit 1. Those look exactly 
alike (they are the same mark on paper) but because we need to strictly 
distinguish letters and digits in computing (for parsing and number 
processing), we now have two code points for this pair of homographs. (I 
and my colleagues sometimes use homoglyph, but in this discussion I have 
used homograph exclusively).

> The confusion is the present case is not that ARABIC LETTER BEH WITH 
> HAMZA ABOVE U+08A1 is a homograph of ARABIC LETTER BEH U+0628 followed 
> by ARABIC HAMZA ABOVE U+0654; that the two are homographs seems 
> perfectly reasonable. It's that they don't appear to be "interestingly 
> related" in the way one would expect given their names, and given the 
> apparent semantics of each. 

As the discussion above makes clear, homographs often have an 
"interesting" relation, but it's not always a formal one in terms of a 
normalization between the two. In fact, in the example I gave, you 
wouldn't want a normalization, because that would effectively destroy 
the ability to designate one a letter (KA) and one a digit (1).

What you call an "interesting relation" is really canonical equivalence. 
Canonical equivalence asserts that there is only the weakest reason to 
have two coded representation, the reason often being that some contexts 
(legacy) require a single code point, while other contexts can equally 
well handle a sequence.

So, by having a relation between sequence <base, mark> and single 
element sequence <composite> Unicode asserts that in those cases there 
isn't a reason to make a distinction and using either encoding should 
lead to the same effect.

This assertion is generally made for accent, grave accent, dot above, 
cedilla, tilde above and a whole number of other accent marks. The 
assertion is NOT made for stroke, bar, slash, reverse slash overlay, 
hook, reverse hook and a whole number of other -attached- marks; even if 
composites of the latter are named using the same "WITH xxxx mark" 
construction as the former.

So the naming convention is not a guide!

(It's also important not to be lured into thinking of combining marks as 
a lego set for glyph buiding - on the whole, they
  really don't function that way, even if in some areas, like math, and 
certain separable marks in Latin, look like they do).
> So here are the questions I'd really like to understand the answers to:
>
> 1. Is there some semantic relationship between LATIN SMALL LETTER U 
> WITH DIARESIS U+00FC and LATIN SMALL LETTER U U+0075 followed by 
> COMBINING DIARESIS U+0308 that does not exist between ARABIC LETTER 
> BEH WITH HAMZA ABOVE U+08A1 and ARABIC LETTER BEH U+0628 followed by 
> ARABIC HAMZA ABOVE U+0654? If so, is there some documented way to know 
> this beyond examination of their names (which obviously would give one 
> the wrong impression in this case)?

Unicode, as a general principle, treats the Arabic script as 
not-decomposable. I'm a bit at a loss to where that is documented, other 
than in the relevant chapter of the book - I know it from having lived 
through the relevant discussions on the UTC over the decades.

You can argue whether that choice for Arabic was correct, or ideal; at 
this point, that's beside the point. UTC is stuck with having to do for 
Arabic what it tried not to do for Latin, which is to individually 
encode certain productive variations of letter shapes.

The diaresis is an interesting example. Because, semantically, there are 
two concepts here. German "umlaut" and marking double vowels as 
pronounced in separation, as you have in coöperation. This distinction 
is a moot point, because legacy used the same Latin-1 composite for 
both. So which convention applies is left to human-interpretable context 
and is not carried in the encoding.

Therefore, the combining diaeresis is likewise usable for both.

Having accepted that the semantics can be multivalued in this case, the 
two encodings are equivalent <0075, 0308> and <00FC> are both referring 
to the same entity that is ambiguous as to whether its u-umlaut or 
u-diaeresis; which one it is in the end depends on the context in the 
text it is used.

ARABIC HAMZA ABOVE U+0654 is used to mark a glottal stop. The experts 
that worked out the encoding for U+08A1 assert that the result is not 
the same as applying a glottal stop, and further, that is is not useful 
to analyze it as anything other than a "variant letter form" that 
"happens to look like a hamzah was drawn over a beh".

As far as I understand this (at some remote from the actual decision) 
the issue is that creating a canonical equivalency would assert that 
0654 is not always a Hamza (glottal stop), that it only "looks like 
one". Not useful - and related to the reason for not treating Arabic as 
decomposable in the first place.

(In Arabic, many of the "combining" characters, are actually letters in 
their own right, such as vowel marks, which is a distinction from Latin 
accent marks and diacritics).
>
> 2. Are there other homographs in Unicode that appear within the same 
> script, and use a similar naming convention to the examples above 
> (where the name of one is a combination of the other two with the word 
> "WITH" between them), yet they are not related in such a way that one 
> canonically decomposes to the others? And again, if so, is there some 
> documented way to know why some do and some don't?

The naming convention appears to be the real red herring here. I 
personally would be ready to concede that the naming of the new code 
point was not fortuitous. It (perhaps a bit mindlessly) followed a 
Unicode pattern of describing the most common glyphic representation, 
without taking into account that in this case, it's not a Hamza (glottal 
stop) but a mark that looks like a hamza - that is theres a big 
disconnect between semantics and appearance.

>
>
> If the answer in this case is really that U+08A1 is not interestingly 
> related to [U+0628 U+0654], or at least not in the way that would 
> result in a canonical decomposition, their names notwithstanding, I 
> think we'll all be OK with "these are more like U+0030 and U+004F" and 
> move on. But I would like to understand what, if any, relationship 
> there is and then we can make a judgment about whether U+08A1 should 
> be treated as a special case or not in IDNA and elsewhere.

Unicode is clear in documenting that substituting the sequence is 
incorrect. Given that, you really have no equivalence on which you could 
base a decomposition - it would destroy the text. If you write Fula, 
only the combination is permissible. If you write Arabic texts with 
glottal stops marked (not the common case) then you use the separate hamza.

As has been mentioned, the Arabic panel that looked at the Root Zone 
repertoire decided that Hamza wasn't needed (none of the combining marks 
for Arabic). Likewise, the panels would have excluded the composite, 
because Arabic is just an alternate way of writing Fula (Latin being the 
common way) and therefore not apropos for the root (as it happens, this 
decision has actually not been made, because of the delay of the 7.0 
IDNA tables, but that's the outcome I would tend to predict if I had to).

I believe you can, in this instance, confidently make the judgment to 
ignore the code point as not important enough to matter.

There are tens of thousands of archaic code points that are not well 
understood (nor usable by anyone I know) that happen to be PVALID. 
Presumably, they are PVALID because this allows anyone who really needs 
to create IDNS using hieroglyphics, or cuneiform for example, to do so. 
Nobody knows what "interesting" or "uninteresting" issues lurk in that 
space.

This case, is equally rarified, except it happens to occupy the margins 
of a more common script. The way to treat these issues, is to construct 
robust IDN tables that
a) exclude rarified stuff
b) use "exclusion mechanisms" of the kind that are beyond the protocol 
level (not repeating that discussion here).

Hope this helps.

A./