[Json] Barry Leiba's Discuss on draft-ietf-json-i-json-05: (with DISCUSS and COMMENT)

Wed Jan 21 05:04:22 CET 2015

Hello Eliot,

Many thanks for the information. Some comments below. Note that I'm not 
assuming that the IAB statement will contain as much 'near misses' as 
your explanation below, but I think it's better to address potential 
errors before it's too late. And yes, in this sense it would be good to 
get a draft of that statement.

I have also taken the liberty of adding the IDNA mailing list, because 
most of the Unicode side experts are on that list but not on the json list.

On 2015/01/20 16:24, Eliot Lear wrote:
> Hi,
>
> Sorry for the late comment, but...
>
> The IAB is preparing a cautionary statement regarding Unicode 7.0.0 and

I very much hope that the statement will be specific to the actual 
concerns, and not, as this sentence seems to suggest, about Unicode 
7.0.0 in general. The later would be throwing out the baby with the 
bathwater.

> specifically the introduction of pre-composed characters involving HAMZA
> where it is impossible to typographically distinguish the precomposed
> and decomposed characters

Please note that first, there are no cases where it's *impossible* to 
typographically distinguish precomposed and decomposed characters, or 
some other usually not distinguished pairs (see below for examples). 
There's always the choice of choosing (for a font designer) of choosing 
somewhat different glyphs, or for a rendering engine to include some 
distinctions (such as showing the precomposed variant with a separated 
combining character).

Second, there are other cases where in general, no typographic 
distinction is used. க and ௧ (Tamil 'ka' and '1') would be a good example.

> outside the context of a locale.

There's no guarantee that all text in a Fula locale (e.g. on a computer 
with a Fula OS, if there ever is such a thing) will have Behs with 
Hamzas above represented with U+08A1 (composed) rather than decomposed.
بٔ U+0628 U+0654 (decomposed) may get in there from Arabic text easily.

> John Klensin
> has explained the problem well in Section 2 of
> draft-klensin-idna-5892upd-unicode70-03.txt.  You may wish to review
> that work, and ask i8n program for a copy of the draft statement,
> because it may have security implications on your work, in as much as
> i-json is used to pass identifiers.

Given that JSON implementations, in contrast to IDNA, compare object 
member names codepoint-by-codepoint, the much more obvious addition to 
the security section would be to point out the potential consequences 
for precomposed/decomposed confusions in general. The chance that this 
becomes an actual issue is much much higher (although still rather low) 
in e.g. French or German than in Fula, where Arabic isn't even the main 
script used for writing the language.

Regards,   Martin.

P.S.: Please note that the comments above don't mean that I'm happy with 
the inclusion of U+08A1 in Unicode 7.0.0, and that I sincerely hope the 
Unicode Consortium will weight the problems of identifier confusability 
higher in their future decisions.

> Eliot
>
> On 1/20/15 4:25 AM, Tim Bray wrote:
>>
>> Yup, those are Unicode notions, Unicode is the right reference for them.
>>
>> On Jan 19, 2015 7:22 PM, "Pete Resnick" <presnick at qti.qualcomm.com
>> <mailto:presnick at qti.qualcomm.com>> wrote:
>>
>>      On 1/19/15 4:43 PM, Barry Leiba wrote:
>>
>>          ----------------------------------------------------------------------
>>          DISCUSS:
>>          ----------------------------------------------------------------------
>>
>>          This should be quite simple to sort out:
>>
>>          -- Section 2.1 --
>>
>>              Object member names, and string values in arrays and
>>          object members,
>>              MUST NOT include code points which identify Surrogates or
>>              Noncharacters.
>>
>>          Where are the definitions of "Surrogates" and
>>          "Noncharacters"?  Because
>>          you say they MUST NOT be included, I think they need to be
>>          defined in
>>          normative reference(s) and cited here (they're not defined in
>>          3629, nor
>>          does 3620 cite a definition).
>>
>>
>>
>>      The codepoints used for UTF-16 surrogate pairs (U+D800 -> U+DFFF)
>>      are mentioned in 3629 in the first paragraph at the top of page 5
>>      <https://tools.ietf.org/html/rfc3629#page-5>, though I'm surprised
>>      not to see a reference to 2781, which also talks about surrogates.
>>      There is discussion of non-characters (as well as surrogates) in
>>      RFC 3454 (which is referenced by 6885).
>>
>>      None of those are great citations. We could simply cite Unicode
>>      <http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf>.
>>
>>      pr
>>
>>      --
>>      Pete Resnick<http://www.qualcomm.com/~presnick/
>>      <http://www.qualcomm.com/%7Epresnick/>>
>>      Qualcomm Technologies, Inc. - +1 (858)651-4478
>>      <tel:%2B1%20%28858%29651-4478>
>>
>>      _______________________________________________
>>      json mailing list
>>      json at ietf.org <mailto:json at ietf.org>
>>      https://www.ietf.org/mailman/listinfo/json
>>
>>
>>
>> _______________________________________________
>> json mailing list
>> json at ietf.org
>> https://www.ietf.org/mailman/listinfo/json
>
>
>
>
> _______________________________________________
> json mailing list
> json at ietf.org
> https://www.ietf.org/mailman/listinfo/json
>