Json and U+08A1 and related cases
Matthew A. Miller
linuxwolf at outer-planes.net
Wed Jan 21 21:43:19 CET 2015
-----BEGIN PGP SIGNED MESSAGE-----
Replying in the (hopefully not vain) attempt to strip the recipient list
to something our poor system can handle.
Continue the conversation on this thread.
- -- your JSON WG co-chair
On 1/21/15 12:26 PM, John C Klensin wrote:
> (Changing subject line per Barry's request and addressing that
> one issue only.)
> --On Wednesday, January 21, 2015 13:04 +0900 "\"Martin J.
> Dürst\"" <duerst at it.aoyama.ac.jp> wrote:
>>> outside the context of a locale.
>> There's no guarantee that all text in a Fula locale (e.g. on a
>> computer with a Fula OS, if there ever is such a thing)
> Note that isn't "a Fula locale", but "a Fula locale that uses
> the Arabic script to write the language". There are no known
> such locales of even country-scope; Fula is normally written in
> Latin characters and has been for at least a few centuries.
> There are lots of problems with the analogy, but thinking about
> Fula-in-Arabic in ordinary locale terms is a little bit like
> thinking about Japanese-exclusively-in-Romanji (and Romanji with
> a few characters that don't appear anywhere else in Latin
> script) as a locale. Probably possible to make it work, but not
> the way most of us usually think about things.
>> have Behs with Hamzas above represented with U+08A1 (composed)
>> rather than decomposed.
>> بٔ U+0628 U+0654 (decomposed) may get in there from Arabic
>> text easily.
> I presume that is one of the reasons TUS 7.0 says
> "preferentially" should use the precombined (composed) form
> rather than the decomposed one rather than something we would
> recognize as "MUST".
>>> John Klensin
>>> has explained the problem well in Section 2 of
>>> draft-klensin-idna-5892upd-unicode70-03.txt. You may wish to
>>> review that work, and ask i8n program for a copy of the draft
>>> statement, because it may have security implications on your
>>> work, in as much as i-json is used to pass identifiers.
>> Given that JSON implementations, in contrast to IDNA, compare
>> object member names codepoint-by-codepoint, the much more
>> obvious addition to the security section would be to point out
>> the potential consequences for precomposed/decomposed
>> confusions in general. The chance that this becomes an actual
>> issue is much much higher (although still rather low) in e.g.
>> French or German than in Fula, where Arabic isn't even the
>> main script used for writing the language.
> Agreed. And, fwiw, that would be my preferred solution. I
> think Fula and U+08A1 are important only as illustrative
> examples of what appears to me to be a very nasty situation when
> the nature of the identifier(s) or their uses does not provide a
> reliable language context.
>> P.S.: Please note that the comments above don't mean that I'm
>> happy with the inclusion of U+08A1 in Unicode 7.0.0, and that
>> I sincerely hope the Unicode Consortium will weight the
>> problems of identifier confusability higher in their future
> Agreed, although the way in which the U+08A1 decision has been
> defended, and the antecedents for it, do not predict to that
> result. See below.
> --On Wednesday, January 21, 2015 10:22 -0600 Nico Williams
> <nico at cryptonector.com> wrote:
>> I thought that NFC was closed to new precompositions though new
>> precompositions might be added to Unicode. That is, the NFC
>> form of U+08A1 must be the same as the NFD form of U+08A1,
>> which is to say: U+0628 U+0654.
>> Is my memory wrong about that?
> That is the understanding that several -- I dare to say most or
> all of the IDNAbis WG participants -- of us had. What has
> actually occurred either violates that assumption or introduces
> an extra case, depending on how one looks at the problem. We
> understood that, if precompositions (your term) are added, they
> would have decompositions and NFC of the character would yield
> the decomposed form. That understanding derived both from what
> the WG was told and from text in the Unicode Standard and UAX
> #15 (see draft-klensin-idna-5892upd-unicode70-02, especially
> Section 2.1, for details and references).
> But, while U+08A1 is abstract-character-identical and even
> plausible-name-identical to U+0628 U+0654, it does _not_
> decompose into the latter. Instead, NFD(U+08A1) = NFC(U+08A1) =
> U+08A1. NFC (U+0628 U+0654) is U+0628 U+0654 as one would
> expect from the stability rules; from that perspective, it is
> the failure of U+08A1 to have a (non-identity) decomposition
> that is the issue.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
Comment: GPGTools - https://gpgtools.org
-----END PGP SIGNATURE-----
More information about the Idna-update