Json and U+08A1 and related cases (was: Re: [Json] Barry Leiba's Discuss on draft-ietf-json-i-json-05: (with DISCUSS and COMMENT))

Wed Jan 21 20:29:22 CET 2015

(Resending from my IETF address, which did not appear in the
first version I received but that doesn't work on some of the
multiple mailing lists copied)
(Changing subject line per Barry's request and addressing that
one issue only.)

--On Wednesday, January 21, 2015 13:04 +0900 "\"Martin J.
Dürst\"" <duerst at it.aoyama.ac.jp> wrote:

>...
>> outside the context of a locale.
> 
> There's no guarantee that all text in a Fula locale (e.g. on a
> computer with a Fula OS, if there ever is such a thing)

Note that isn't "a Fula locale", but "a Fula locale that uses
the Arabic script to write the language".  There are no known
such locales of even country-scope; Fula is normally written in
Latin characters and has been for at least a few centuries.
There are lots of problems with the analogy,  but thinking about
Fula-in-Arabic in ordinary locale terms is a little bit like
thinking about Japanese-exclusively-in-Romanji (and Romanji with
a few characters that don't appear anywhere else in Latin
script) as a locale.  Probably possible to make it work, but not
the way most of us usually think about things.

> will
> have Behs with Hamzas above represented with U+08A1 (composed)
> rather than decomposed.
> بٔ U+0628 U+0654 (decomposed) may get in there from Arabic
> text easily.

I presume that is one of the reasons TUS 7.0 says
"preferentially" should use the precombined (composed) form
rather than the decomposed one rather than something we would
recognize as "MUST".

>> John Klensin
>> has explained the problem well in Section 2 of
>> draft-klensin-idna-5892upd-unicode70-03.txt.  You may wish to
>> review that work, and ask i8n program for a copy of the draft
>> statement, because it may have security implications on your
>> work, in as much as i-json is used to pass identifiers.
> 
> Given that JSON implementations, in contrast to IDNA, compare
> object member names codepoint-by-codepoint, the much more
> obvious addition to the security section would be to point out
> the potential consequences for precomposed/decomposed
> confusions in general. The chance that this becomes an actual
> issue is much much higher (although still rather low) in e.g.
> French or German than in Fula, where Arabic isn't even the
> main script used for writing the language.

Agreed.  And, fwiw, that would be my preferred solution.   I
think Fula and U+08A1 are important only as illustrative
examples of what appears to me to be a very nasty situation when
the nature of the identifier(s) or their uses does not provide a
reliable language context.

> P.S.: Please note that the comments above don't mean that I'm
> happy with the inclusion of U+08A1 in Unicode 7.0.0, and that
> I sincerely hope the Unicode Consortium will weight the
> problems of identifier confusability higher in their future
> decisions.

Agreed, although the way in which the U+08A1 decision has been
defended, and the antecedents for it, do not predict to that
result.  See below.

--On Wednesday, January 21, 2015 10:22 -0600 Nico Williams
<nico at cryptonector.com> wrote:

> I thought that NFC was closed to new precompositions though new
> precompositions might be added to Unicode.  That is, the NFC
> form of U+08A1 must be the same as the NFD form of U+08A1,
> which is to say: U+0628 U+0654.
> 
> Is my memory wrong about that?

That is the understanding that several -- I dare to say most or
all of  the IDNAbis WG participants -- of us had.  What has
actually occurred either violates that assumption or introduces
an extra case, depending on how one looks at the problem.   We
understood that, if precompositions (your term) are added, they
would have decompositions and NFC of the character would yield
the decomposed form.  That understanding derived both from what
the WG was told and from text in the Unicode Standard and UAX
#15 (see draft-klensin-idna-5892upd-unicode70-02, especially
Section 2.1, for details and references).  

But, while U+08A1 is abstract-character-identical and even
plausible-name-identical to U+0628 U+0654, it does _not_
decompose into the latter.  Instead, NFD(U+08A1) = NFC(U+08A1) =
U+08A1.  NFC (U+0628 U+0654) is U+0628 U+0654 as one would
expect from the stability rules; from that perspective, it is
the failure of U+08A1 to have a (non-identity) decomposition
that is the issue.

     john