Json and U+08A1 and related cases

Wed Jan 21 21:43:19 CET 2015

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

Replying in the (hopefully not vain) attempt to strip the recipient list
to something our poor system can handle.

Continue the conversation on this thread.

Thanks,

- -- your JSON WG co-chair

On 1/21/15 12:26 PM, John C Klensin wrote:
> (Changing subject line per Barry's request and addressing that
> one issue only.)
>
> --On Wednesday, January 21, 2015 13:04 +0900 "\"Martin J.
> Dürst\"" <duerst at it.aoyama.ac.jp> wrote:
>
>> ...
>>> outside the context of a locale.
>>
>> There's no guarantee that all text in a Fula locale (e.g. on a
>> computer with a Fula OS, if there ever is such a thing)
>
> Note that isn't "a Fula locale", but "a Fula locale that uses
> the Arabic script to write the language".  There are no known
> such locales of even country-scope; Fula is normally written in
> Latin characters and has been for at least a few centuries.
> There are lots of problems with the analogy,  but thinking about
> Fula-in-Arabic in ordinary locale terms is a little bit like
> thinking about Japanese-exclusively-in-Romanji (and Romanji with
> a few characters that don't appear anywhere else in Latin
> script) as a locale.  Probably possible to make it work, but not
> the way most of us usually think about things.
>
>> will
>> have Behs with Hamzas above represented with U+08A1 (composed)
>> rather than decomposed.
>> بٔ U+0628 U+0654 (decomposed) may get in there from Arabic
>> text easily.
>
> I presume that is one of the reasons TUS 7.0 says
> "preferentially" should use the precombined (composed) form
> rather than the decomposed one rather than something we would
> recognize as "MUST".
>
>>> John Klensin
>>> has explained the problem well in Section 2 of
>>> draft-klensin-idna-5892upd-unicode70-03.txt.  You may wish to
>>> review that work, and ask i8n program for a copy of the draft
>>> statement, because it may have security implications on your
>>> work, in as much as i-json is used to pass identifiers.
>>
>> Given that JSON implementations, in contrast to IDNA, compare
>> object member names codepoint-by-codepoint, the much more
>> obvious addition to the security section would be to point out
>> the potential consequences for precomposed/decomposed
>> confusions in general. The chance that this becomes an actual
>> issue is much much higher (although still rather low) in e.g.
>> French or German than in Fula, where Arabic isn't even the
>> main script used for writing the language.
>
> Agreed.  And, fwiw, that would be my preferred solution.   I
> think Fula and U+08A1 are important only as illustrative
> examples of what appears to me to be a very nasty situation when
> the nature of the identifier(s) or their uses does not provide a
> reliable language context.
>
>> P.S.: Please note that the comments above don't mean that I'm
>> happy with the inclusion of U+08A1 in Unicode 7.0.0, and that
>> I sincerely hope the Unicode Consortium will weight the
>> problems of identifier confusability higher in their future
>> decisions.
>
> Agreed, although the way in which the U+08A1 decision has been
> defended, and the antecedents for it, do not predict to that
> result.  See below.
>
>
> --On Wednesday, January 21, 2015 10:22 -0600 Nico Williams
> <nico at cryptonector.com> wrote:
>
>> I thought that NFC was closed to new precompositions though new
>> precompositions might be added to Unicode.  That is, the NFC
>> form of U+08A1 must be the same as the NFD form of U+08A1,
>> which is to say: U+0628 U+0654.
>>
>> Is my memory wrong about that?
>
> That is the understanding that several -- I dare to say most or
> all of  the IDNAbis WG participants -- of us had.  What has
> actually occurred either violates that assumption or introduces
> an extra case, depending on how one looks at the problem.   We
> understood that, if precompositions (your term) are added, they
> would have decompositions and NFC of the character would yield
> the decomposed form.  That understanding derived both from what
> the WG was told and from text in the Unicode Standard and UAX
> #15 (see draft-klensin-idna-5892upd-unicode70-02, especially
> Section 2.1, for details and references). 
>
> But, while U+08A1 is abstract-character-identical and even
> plausible-name-identical to U+0628 U+0654, it does _not_
> decompose into the latter.  Instead, NFD(U+08A1) = NFC(U+08A1) =
> U+08A1.  NFC (U+0628 U+0654) is U+0628 U+0654 as one would
> expect from the stability rules; from that perspective, it is
> the failure of U+08A1 to have a (non-identity) decomposition
> that is the issue.
>
>      john
>
>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
Comment: GPGTools - https://gpgtools.org

iQEcBAEBCgAGBQJUwA9mAAoJEOz0ck4QngW7rNwH/jaHvHhiz7UOUSr2ZcXLPdbg
DvcA68c11dnPIAQ4Rx9QzRZB1RpPnukYWqanvh7wc+9nhd6oAhW2524b2PYDu20h
ZPFcrz6ZBynrpXBCGoQIzCHX/1MLlc5Kk6miqaqJDSMwk9RbAqhpzqqpXBCQ/x2N
+9M7gJ30V5/v+7f08nlqPsnobDGaOZGjlVn9V9GYkuFqGvP6eoHujYjoaYYKPxbZ
ceRrrFkcGWTVQQZzO9Ft26grrB3e0sM7pklQ7hj6O/sqoi1UVEYCdB6PgqeptMnr
IdVNNxhiYhruVaRGF9Y5UMd6YZh3yiAjB4XBwqg447r9mFy0q5s5m1wnBzp2jgg=
=iIu4
-----END PGP SIGNATURE-----