[Json] Json and U+08A1 and related cases (was: Re: Barry Leiba's Discuss on draft-ietf-json-i-json-05: (with DISCUSS and COMMENT))

Wed Jan 21 21:04:22 CET 2015

(JSON list removed, per Paul Hoffman's request)

--On Wednesday, January 21, 2015 13:42 -0600 Nico Williams
<nico at cryptonector.com> wrote:

>...
>> That is the understanding that several -- I dare to say most
>> or all of  the IDNAbis WG participants -- of us had.  What has
>> actually occurred either violates that assumption or
>> introduces an extra case, depending on how one looks at the
>> problem.   [...]
> 
> Because...
> 
>> But, while U+08A1 is abstract-character-identical and even
>> plausible-name-identical to U+0628 U+0654, it does _not_
>> decompose into the latter.  Instead, NFD(U+08A1) =
>> NFC(U+08A1) =
> 
> ...this is a desirable property of that particular character,
> or because  the UC screwed up?  See below.

Please read draft-klensin-idna-5892upd-unicode70, as Eliot
suggested.  This is explored there in some detail.  The answer
to your particular question depends on who you ask and what
their perspective is.

>> U+08A1.  NFC (U+0628 U+0654) is U+0628 U+0654 as one would
>> expect from the stability rules; from that perspective, it is
>> the failure of U+08A1 to have a (non-identity) decomposition
>> that is the issue.
> 
> Is it identical, as rendered as well as semantically, to
> U+0628 U+0654?

As Martin points out, one can always design fonts that make
things look different.  With a pair of appropriate fonts, "a"
(U+0061) might not even render the same as "a" (U+0061).
Indeed, nothing would prevent a type designer from having
different rendering form of U+0061 depending on what it was next
to or whether it was in standalone, initial, medial, or final
positions.  For normative or abstract purposes, we usually
assume those sorts of properties are important only for a
handful of scripts, none of which are Latin but type design,
detailed rendering design, and page layout are, to a
considerable extent, still art forms.

However, as an abstraction, you can probably get significant
information from the observation that the formal Unicode name
for U+08A1 is "ARABIC BEH WITH HAMZA ABOVE" and that the most
likely and plausible name for U+0628 U+0654 is "ARABIC BEH with
HAMZA ABOVE" (where the lower-case "with" is purely my
notational convention).

"Semantically" is a more complex question, in part because I
would contend, that, in general, the only letters that have
actual semantics are those that belong to "ideographic" scripts
like Han (CJK).  There are some exceptions (or debates about
whether they are "letters") like certain symbols for honorifics
and currencies, but...

On the other hand, the Unicode Standard justifies all of this
one the basis of phonetic differences between  U+08A1 and the
U+0628 U+0654 sequence.  See the I-D and the sections of the
Unicode Standard that it cites for more information, but note
that most of the reasons for which the IETF is interested in
characters in identifiers are not associated with enough
information to determine either language or phonetics.

> If U+08A1 identical to U+0628 U+0654 in every way then I think
> the UC erred.  If it is not, then U+08A1 strikes me as a new
> case that IDNA should treat as though NFC(U+08A1) == U+0628
> U+0654 (because what else could IDNA reasonably do??).  In
> what ways is U+08A1 not identical to U+0628 U+0654? (besides,
> of course, being a different codepoint sequence)

See above and note that treating U+08A1 as identical to U+0628
U+0654 for IDNA (and perhaps other IETF identifier purposes) is
one of the options the I-D identifies and that, even if that
choice is made, there are several ways to think about it
conceptually (the choice is complicated by the observation that
there are a few other precomposed characters that, like U+08A1,
do not have decompositions and that some of them have been
around for a very long time).

    john