[Json] Json and U+08A1 and related cases (was: Re: Barry Leiba's Discuss on draft-ietf-json-i-json-05: (with DISCUSS and COMMENT))

Wed Jan 21 20:35:30 CET 2015

I see nothing here about JSON. As pointed out, JSON implementations compare codepoint by codepoint. Please clarify, or please let's drop this from the JSON WG list.

--Paul Hoffman

> On Jan 21, 2015, at 11:29 AM, John C Klensin <john-ietf at jck.com> wrote:
> 
> (Resending from my IETF address, which did not appear in the
> first version I received but that doesn't work on some of the
> multiple mailing lists copied)
> (Changing subject line per Barry's request and addressing that
> one issue only.)
> 
> --On Wednesday, January 21, 2015 13:04 +0900 "\"Martin J.
> Dürst\"" <duerst at it.aoyama.ac.jp> wrote:
> 
>> ...
>>> outside the context of a locale.
>> 
>> There's no guarantee that all text in a Fula locale (e.g. on a
>> computer with a Fula OS, if there ever is such a thing)
> 
> Note that isn't "a Fula locale", but "a Fula locale that uses
> the Arabic script to write the language".  There are no known
> such locales of even country-scope; Fula is normally written in
> Latin characters and has been for at least a few centuries.
> There are lots of problems with the analogy,  but thinking about
> Fula-in-Arabic in ordinary locale terms is a little bit like
> thinking about Japanese-exclusively-in-Romanji (and Romanji with
> a few characters that don't appear anywhere else in Latin
> script) as a locale.  Probably possible to make it work, but not
> the way most of us usually think about things.
> 
>> will
>> have Behs with Hamzas above represented with U+08A1 (composed)
>> rather than decomposed.
>> بٔ U+0628 U+0654 (decomposed) may get in there from Arabic
>> text easily.
> 
> I presume that is one of the reasons TUS 7.0 says
> "preferentially" should use the precombined (composed) form
> rather than the decomposed one rather than something we would
> recognize as "MUST".
> 
>>> John Klensin
>>> has explained the problem well in Section 2 of
>>> draft-klensin-idna-5892upd-unicode70-03.txt.  You may wish to
>>> review that work, and ask i8n program for a copy of the draft
>>> statement, because it may have security implications on your
>>> work, in as much as i-json is used to pass identifiers.
>> 
>> Given that JSON implementations, in contrast to IDNA, compare
>> object member names codepoint-by-codepoint, the much more
>> obvious addition to the security section would be to point out
>> the potential consequences for precomposed/decomposed
>> confusions in general. The chance that this becomes an actual
>> issue is much much higher (although still rather low) in e.g.
>> French or German than in Fula, where Arabic isn't even the
>> main script used for writing the language.
> 
> Agreed.  And, fwiw, that would be my preferred solution.   I
> think Fula and U+08A1 are important only as illustrative
> examples of what appears to me to be a very nasty situation when
> the nature of the identifier(s) or their uses does not provide a
> reliable language context.
> 
>> P.S.: Please note that the comments above don't mean that I'm
>> happy with the inclusion of U+08A1 in Unicode 7.0.0, and that
>> I sincerely hope the Unicode Consortium will weight the
>> problems of identifier confusability higher in their future
>> decisions.
> 
> Agreed, although the way in which the U+08A1 decision has been
> defended, and the antecedents for it, do not predict to that
> result.  See below.
> 
> 
> --On Wednesday, January 21, 2015 10:22 -0600 Nico Williams
> <nico at cryptonector.com> wrote:
> 
>> I thought that NFC was closed to new precompositions though new
>> precompositions might be added to Unicode.  That is, the NFC
>> form of U+08A1 must be the same as the NFD form of U+08A1,
>> which is to say: U+0628 U+0654.
>> 
>> Is my memory wrong about that?
> 
> That is the understanding that several -- I dare to say most or
> all of  the IDNAbis WG participants -- of us had.  What has
> actually occurred either violates that assumption or introduces
> an extra case, depending on how one looks at the problem.   We
> understood that, if precompositions (your term) are added, they
> would have decompositions and NFC of the character would yield
> the decomposed form.  That understanding derived both from what
> the WG was told and from text in the Unicode Standard and UAX
> #15 (see draft-klensin-idna-5892upd-unicode70-02, especially
> Section 2.1, for details and references).  
> 
> But, while U+08A1 is abstract-character-identical and even
> plausible-name-identical to U+0628 U+0654, it does _not_
> decompose into the latter.  Instead, NFD(U+08A1) = NFC(U+08A1) =
> U+08A1.  NFC (U+0628 U+0654) is U+0628 U+0654 as one would
> expect from the stability rules; from that perspective, it is
> the failure of U+08A1 to have a (non-identity) decomposition
> that is the issue.
> 
>     john
> 
> 
> _______________________________________________
> json mailing list
> json at ietf.org
> https://www.ietf.org/mailman/listinfo/json
>