[Json] Json and U+08A1 and related cases (was: Re: Barry Leiba's Discuss on draft-ietf-json-i-json-05: (with DISCUSS and COMMENT))
asmusf at ix.netcom.com
Thu Jan 22 03:58:09 CET 2015
On 1/21/2015 1:31 PM, Nico Williams wrote:
> On Wed, Jan 21, 2015 at 03:33:12PM -0500, cowan at ccil.org wrote:
>> John C Klensin scripsit:
>>> But, while U+08A1 is abstract-character-identical and even
>>> plausible-name-identical to U+0628 U+0654, it does _not_
>>> decompose into the latter. Instead, NFD(U+08A1) = NFC(U+08A1) =
>>> U+08A1. NFC (U+0628 U+0654) is U+0628 U+0654 as one would
>>> expect from the stability rules; from that perspective, it is
>>> the failure of U+08A1 to have a (non-identity) decomposition
>>> that is the issue.
>> If U+08A1 had such a decomposition, it would violate Unicode's
>> no-new-NFC rule. What it violates is the (false) assumption that
>> base1 + combining is never confusable with a canonically
>> non-equivalent base2. Even outside Arabic there are already
>> such cases:
I would go further, and claim that the notion that "*all homographs are
**same abstract character*" is *misplaced, if not incorrect*. The notion
normalization was created to identify cases where homographs, characters or
sequences of normally identical appearance, were really cases of the
being encoded twice, and where that was not the case the homographs are
not equivalent under normalization (or sometimes, esp. in cases of near
there is a "compatibility" normalization relation (e.g. NF*K*C).
U+08A1 is not the only character that has a non-decomposable homograph, and
because the encoding of it wasn't an accident, but follows a principle
by the Unicode Technical Committee, it won't, and can't be the last
a non-decomposable homograph.
The "failure of U+08A1 to have a (non-identity) decomposition", while it
complicates the design of a system of robust mnemonic identifiers (such
it appears not be be due to a "breakdown" of the encoding process and
not constitute a break of any encoding stability promises by the Unicode
Rather, it represents reasoned, and principled judgment of what is or
"same abstract character". That judgment has to be made somewhere in the
process, and the bodies responsible for character encoding get to make the
Asserting, to the contrary, that there should be a principle that
requires that all
homographs are the same abstract character, would mean to base encoding
decisions entirely on the shape, or appearance of characters and code point
sequences. Under that logic, Tamil LETTER KA and TAMIL DIGIT 1 would be the
same abstract character, and a (non-identity) decomposition would be
That's just not how it works.
That said, Unicode is generally (and correctly) reluctant to encode
One of the earliest and most ardently requested changes was the proposed
separation of "period" and "decimal point". It got rejected, and it was
only one. Where homographs are encoded, they generally follow certain
principles. And while these principles will, over time, lead to the
a few more homographs, they in turn, keep things predictable.
From my understanding, the case in question fully follows these principles
as they are applicable to the encoding of characters for the Arabic script.
> Should we treat all of these as confusables?
Yes, that's the obvious way to handle them. If you have zones that support
the concept of (blocked) variants, you can go further and make them that,
which has the effect of making them confusables that are declared up front
as such in the policy, not "discovered" in later steps of string review
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Idna-update