[Json] Barry Leiba's Discuss on draft-ietf-json-i-json-05: (with DISCUSS and COMMENT)
Asmus Freytag
asmusf at ix.netcom.com
Wed Jan 21 21:36:04 CET 2015
On 1/20/2015 8:04 PM, "Martin J. Dürst" wrote:
> P.S.: Please note that the comments above don't mean that I'm happy
> with the inclusion of U+08A1 in Unicode 7.0.0, and that I sincerely
> hope the Unicode Consortium will weight the problems of identifier
> confusability higher in their future decisions.
Given that an almost exactly parallel issue existed for a code point
used in several Western European languages from Unicode 1.0, I think it
would be wrong for Unicode to suddenly change the way non-Western
scripts are encoded. Treating this particular case in isolation obscures
the real issue and will possibly prevent or delay a proper design of
identifier repertoires.
To put the issue in general terms, it concerns the fact that there are
homographs in Unicode. These are two distinct code point sequences with
normally identical appearance. In particular, the concern is with
homographs that are present after normalization.
I'm very much on board with highlighting that issue, which is
that*applying **NFC does not eliminate homographs*. And, having just
finished the exercise of reviewing the full repertoire of modern scripts
for suitability for the DNS root zone, I can attest that there are more
homographs than people might initially suspect, and quite a few of them
in Latin.
In most, but not all cases, one of the two is a code point or code point
sequence that exists for a very specialized purpose. Often only one of
the forms is actually used in general orthographies and only one of the
forms would therefore be expected to occur in a set of mnemonic identifiers.
Unfortunately, it is not always the composite one that should be
supported. For example, Unicode has several non-normalizable Latin
digraphs that are encoded for special usage scenarios; in these cases
the individual code points must be supported. In some cases, a letter
and digit may be homographs (the example of க and ௧ (Tamil 'ka' and '1')
as mentioned in the preceding post). Both would be supported for
different purposes. Finally, in some cases, there's a combining mark
that (given its name and general appearance) might be expected to yield
the same appearance when applied to some base letters, as certain
precomposed forms. In Latin, this applies to combining overlays,
because, on principle, the Unicode standard does not decompose
orthographic characters for which the shape is derived by striking
through part or all of the letter form.
Like the case of the Arabic script, any such characters needed for an as
yet unencoded Latin orthography, would be encoded with a composite glyph
shape, but without decomposition.
*The proper response for IDNA2008* would be to inventorize these cases
and *strongly warn* that they not be incorporated unexamined into
general repertoires; or, if they have to be supported, that Label
Generation Rulesets (aka IDN tables) support context or variant rules
that prevent these from co-occurring in any minimal pair of labels.
For language-specific IDN tables, it's often possible to eliminate one
or the other alternative.
For example, a Danish IDN table would rule out 0338 (combining slash),
so that <o, 0338> cannot exist alongside o-slash. For a Fula-specific
IDN table, one would rule out the combining Hamza - it has not place in
that orthography.
Eliminating any particular homographs on an ad-hoc basis in IDN2008 by
making one of the code points INVALID does not solve the general
problem, but unnecessarily prevents language-specific solutions in a way
that is at best inconsistent and at worst discriminatory.
The Fula character is a good example of a pseudo-decomposable character
that is needed for consistent encoding of a hitherto not fully supported
orthography, while the code point sequence serves a specialized purpose
elsewhere.
It is very important, that whatever the solution is decided on for
IDNA2008, that IETF not haphazardly single out a particular instance of
a general pattern.
A./
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20150121/7494b78b/attachment-0001.html>
More information about the Idna-update
mailing list