IDNA and U+08A1 and related cases (was: Re: Barry Leiba's Discuss on draft-ietf-json-i-json-05: (with DISCUSS and COMMENT))

Mon Jan 26 08:17:13 CET 2015

Asmus,

Thanks for the explication. It is helpful. However, there's one section 
that doesn't answer the question I've had about this entire episode, and 
I hope you can elucidate:

On 1/25/15 10:30 PM, Asmus Freytag wrote:
> Occasionally, because of legacy, occasionally for other reasons, 
> Unicode has encoded identical shapes using multiple code points 
> (homographs). A homograph pair can be understood as something that if 
> both partners were rendered in the same font, they would (practically 
> always) look identical. Not similar, identical.

I think "homographs" is a bit of a red herring here. I think many of us 
understand that there are, and will always be, homographs in Unicode, 
and that some of the homographs will be interestingly related (e.g., 
ones whose code points canonically decompose into other code points, 
like LATIN SMALL LETTER U WITH DIARESIS U+00FC and LATIN SMALL LETTER U 
U+0075 followed by COMBINING DIARESIS U+0308) and some that are not 
interestingly related at all (e.g., DIGIT ZERO U+0030 and LATIN CAPITAL 
LETTER O U+004F). The confusion is the present case is not that ARABIC 
LETTER BEH WITH HAMZA ABOVE U+08A1 is a homograph of ARABIC LETTER BEH 
U+0628 followed by ARABIC HAMZA ABOVE U+0654; that the two are 
homographs seems perfectly reasonable. It's that they don't appear to be 
"interestingly related" in the way one would expect given their names, 
and given the apparent semantics of each. So here are the questions I'd 
really like to understand the answers to:

1. Is there some semantic relationship between LATIN SMALL LETTER U WITH 
DIARESIS U+00FC and LATIN SMALL LETTER U U+0075 followed by COMBINING 
DIARESIS U+0308 that does not exist between ARABIC LETTER BEH WITH HAMZA 
ABOVE U+08A1 and ARABIC LETTER BEH U+0628 followed by ARABIC HAMZA ABOVE 
U+0654? If so, is there some documented way to know this beyond 
examination of their names (which obviously would give one the wrong 
impression in this case)?

2. Are there other homographs in Unicode that appear within the same 
script, and use a similar naming convention to the examples above (where 
the name of one is a combination of the other two with the word "WITH" 
between them), yet they are not related in such a way that one 
canonically decomposes to the others? And again, if so, is there some 
documented way to know why some do and some don't?

If the answer in this case is really that U+08A1 is not interestingly 
related to [U+0628 U+0654], or at least not in the way that would result 
in a canonical decomposition, their names notwithstanding, I think we'll 
all be OK with "these are more like U+0030 and U+004F" and move on. But 
I would like to understand what, if any, relationship there is and then 
we can make a judgment about whether U+08A1 should be treated as a 
special case or not in IDNA and elsewhere.

pr

-- 
Pete Resnick<http://www.qualcomm.com/~presnick/>
Qualcomm Technologies, Inc. - +1 (858)651-4478