IDNA and U+08A1 and related cases (was: Re: Barry Leiba's Discuss on draft-ietf-json-i-json-05: (with DISCUSS and COMMENT))
Pete Resnick
presnick at qti.qualcomm.com
Mon Jan 26 08:17:13 CET 2015
Asmus,
Thanks for the explication. It is helpful. However, there's one section
that doesn't answer the question I've had about this entire episode, and
I hope you can elucidate:
On 1/25/15 10:30 PM, Asmus Freytag wrote:
> Occasionally, because of legacy, occasionally for other reasons,
> Unicode has encoded identical shapes using multiple code points
> (homographs). A homograph pair can be understood as something that if
> both partners were rendered in the same font, they would (practically
> always) look identical. Not similar, identical.
I think "homographs" is a bit of a red herring here. I think many of us
understand that there are, and will always be, homographs in Unicode,
and that some of the homographs will be interestingly related (e.g.,
ones whose code points canonically decompose into other code points,
like LATIN SMALL LETTER U WITH DIARESIS U+00FC and LATIN SMALL LETTER U
U+0075 followed by COMBINING DIARESIS U+0308) and some that are not
interestingly related at all (e.g., DIGIT ZERO U+0030 and LATIN CAPITAL
LETTER O U+004F). The confusion is the present case is not that ARABIC
LETTER BEH WITH HAMZA ABOVE U+08A1 is a homograph of ARABIC LETTER BEH
U+0628 followed by ARABIC HAMZA ABOVE U+0654; that the two are
homographs seems perfectly reasonable. It's that they don't appear to be
"interestingly related" in the way one would expect given their names,
and given the apparent semantics of each. So here are the questions I'd
really like to understand the answers to:
1. Is there some semantic relationship between LATIN SMALL LETTER U WITH
DIARESIS U+00FC and LATIN SMALL LETTER U U+0075 followed by COMBINING
DIARESIS U+0308 that does not exist between ARABIC LETTER BEH WITH HAMZA
ABOVE U+08A1 and ARABIC LETTER BEH U+0628 followed by ARABIC HAMZA ABOVE
U+0654? If so, is there some documented way to know this beyond
examination of their names (which obviously would give one the wrong
impression in this case)?
2. Are there other homographs in Unicode that appear within the same
script, and use a similar naming convention to the examples above (where
the name of one is a combination of the other two with the word "WITH"
between them), yet they are not related in such a way that one
canonically decomposes to the others? And again, if so, is there some
documented way to know why some do and some don't?
If the answer in this case is really that U+08A1 is not interestingly
related to [U+0628 U+0654], or at least not in the way that would result
in a canonical decomposition, their names notwithstanding, I think we'll
all be OK with "these are more like U+0030 and U+004F" and move on. But
I would like to understand what, if any, relationship there is and then
we can make a judgment about whether U+08A1 should be treated as a
special case or not in IDNA and elsewhere.
pr
--
Pete Resnick<http://www.qualcomm.com/~presnick/>
Qualcomm Technologies, Inc. - +1 (858)651-4478
More information about the Idna-update
mailing list