[Json] Json and U+08A1 and related cases (was: Re: Barry Leiba's Discuss on draft-ietf-json-i-json-05: (with DISCUSS and COMMENT))

Thu Jan 22 03:58:09 CET 2015

On 1/21/2015 1:31 PM, Nico Williams wrote:
> On Wed, Jan 21, 2015 at 03:33:12PM -0500, cowan at ccil.org wrote:
>> John C Klensin scripsit:
>>> But, while U+08A1 is abstract-character-identical and even
>>> plausible-name-identical to U+0628 U+0654, it does _not_
>>> decompose into the latter.  Instead, NFD(U+08A1) = NFC(U+08A1) =
>>> U+08A1.  NFC (U+0628 U+0654) is U+0628 U+0654 as one would
>>> expect from the stability rules; from that perspective, it is
>>> the failure of U+08A1 to have a (non-identity) decomposition
>>> that is the issue.
>> If U+08A1 had such a decomposition, it would violate Unicode's
>> no-new-NFC rule.  What it violates is the (false) assumption that
>> base1 + combining is never confusable with a canonically
>> non-equivalent base2.  Even outside Arabic there are already
>> such cases:

I would go further, and claim that the notion that "*all homographs are 
the**
**same abstract character*" is *misplaced, if not incorrect*. The notion 
of canonical
normalization was created to identify cases where homographs, characters or
sequences of normally identical appearance, were really cases of the 
same thing
being encoded twice, and where that was not the case the homographs are 
either
not equivalent under normalization (or sometimes, esp. in cases of near 
homographs)
there is a "compatibility" normalization relation (e.g. NF*K*C).

U+08A1 is not the only character that has a non-decomposable homograph, and
because the encoding of it wasn't an accident, but follows a principle 
applied
by the Unicode Technical Committee, it won't, and can't be the last 
instance of
a non-decomposable homograph.

The "failure of U+08A1 to have a (non-identity) decomposition", while it 
perhaps
complicates the design of a system of robust mnemonic identifiers (such 
as IDNs)
it appears not be be due to a "breakdown" of the encoding process and 
also does
not constitute a break of any encoding stability promises  by the Unicode
Consortium.

Rather, it represents reasoned, and principled judgment of what is or 
isn't the
"same abstract character". That judgment has to be made somewhere in the
process, and the bodies responsible for character encoding get to make the
determination.

Asserting, to the contrary, that there should be a principle that 
requires that all
homographs are the same abstract character, would mean to base encoding
decisions entirely on the shape, or appearance of characters and code point
sequences. Under that logic, Tamil LETTER KA and TAMIL DIGIT 1 would be the
same abstract character, and a (non-identity) decomposition would be 
required.

That's just not how it works.

That said, Unicode is generally (and correctly) reluctant to encode 
homographs.
One of the earliest and most ardently requested changes was the proposed
separation of "period" and "decimal point". It got rejected, and it was 
not the
only one. Where homographs are encoded, they generally follow certain
principles. And while these principles will, over time, lead to the 
encoding of
a few more homographs, they in turn, keep things predictable.

 From my understanding, the case in question fully follows these principles
as they are applicable to the encoding of characters for the Arabic script.

>>
>> [...]
> Should we treat all of these as confusables?
Yes, that's the obvious way to handle them. If you have zones that support
the concept of (blocked) variants, you can go further and make them that,
which has the effect of making them confusables that are declared up front
as such in the policy, not "discovered" in later steps of string review 
and analysis.

A./
> Nico

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20150121/15452bfa/attachment.html>