[Json] Barry Leiba's Discuss on draft-ietf-json-i-json-05: (with DISCUSS and COMMENT)

Wed Jan 21 21:36:04 CET 2015

On 1/20/2015 8:04 PM, "Martin J. Dürst" wrote:
> P.S.: Please note that the comments above don't mean that I'm happy 
> with the inclusion of U+08A1 in Unicode 7.0.0, and that I sincerely 
> hope the Unicode Consortium will weight the problems of identifier 
> confusability higher in their future decisions. 

Given that an almost exactly parallel issue existed for a code point 
used in several Western European languages from Unicode 1.0, I think it 
would be wrong for Unicode to suddenly change the way non-Western 
scripts are encoded. Treating this particular case in isolation obscures 
the real issue and will possibly prevent or delay a proper design of 
identifier repertoires.

To put the issue in general terms, it concerns the fact that there are 
homographs in Unicode. These are two distinct code point sequences with 
normally identical appearance. In particular, the concern is with 
homographs that are present after normalization.

I'm very much on board with highlighting that issue, which is 
that*applying **NFC does not eliminate homographs*. And, having just 
finished the exercise of reviewing the full repertoire of modern scripts 
for suitability for the DNS root zone, I can attest that there are more 
homographs than people might initially suspect, and quite a few of them 
in Latin.

In most, but not all cases, one of the two is a code point or code point 
sequence that exists for a very specialized purpose. Often only one of 
the forms is actually used in general orthographies and only one of the 
forms would therefore be expected to occur in a set of mnemonic identifiers.

Unfortunately, it is not always the composite one that should be 
supported. For example, Unicode has several non-normalizable Latin 
digraphs that are encoded for special usage scenarios; in these cases 
the individual code points must be supported.  In some cases, a letter 
and digit may be homographs (the example of க and ௧ (Tamil 'ka' and '1') 
as mentioned in the preceding post). Both would be supported for 
different purposes. Finally, in some cases, there's a combining mark 
that (given its name and general appearance) might be expected to yield 
the same appearance when applied to some base letters, as certain 
precomposed forms. In Latin, this applies to combining overlays, 
because, on principle, the Unicode standard does not decompose 
orthographic characters for which the shape is derived by striking 
through part or all of the letter form.

Like the case of the Arabic script, any such characters needed for an as 
yet unencoded Latin orthography, would be encoded with a composite glyph 
shape, but without decomposition.

*The proper response for IDNA2008* would be to inventorize these cases 
and *strongly warn* that they not be incorporated unexamined into 
general repertoires; or, if they have to be supported, that Label 
Generation Rulesets (aka IDN tables) support context or variant rules 
that prevent these from co-occurring in any minimal pair of labels.

For language-specific IDN tables, it's often possible to eliminate one 
or the other alternative.

For example, a Danish IDN table would rule out 0338 (combining slash), 
so that <o, 0338> cannot exist alongside o-slash. For a Fula-specific 
IDN table, one would rule out the combining Hamza - it has not place in 
that orthography.

Eliminating any particular homographs on an ad-hoc basis in IDN2008 by 
making one of the code points INVALID does not solve the general 
problem, but unnecessarily prevents language-specific solutions in a way 
that is at best inconsistent and at worst discriminatory.

The Fula character is a good example of a pseudo-decomposable character 
that is needed for consistent encoding of a hitherto not fully supported 
orthography, while the code point sequence serves a specialized purpose 
elsewhere.

It is very important, that whatever the solution is decided on for 
IDNA2008, that IETF not haphazardly single out a particular instance of 
a general pattern.

A./
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20150121/7494b78b/attachment-0001.html>