<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 1/20/2015 8:04 PM, "Martin J. Dürst"
wrote:<br>
</div>
<blockquote cite="mid:54BF2546.9030006@it.aoyama.ac.jp" type="cite">P.S.:
Please note that the comments above don't mean that I'm happy with
the inclusion of U+08A1 in Unicode 7.0.0, and that I sincerely
hope the Unicode Consortium will weight the problems of identifier
confusability higher in their future decisions.
</blockquote>
<br>
<font face="Candara">Given that an almost exactly parallel issue
existed for a code point used in several Western European
languages from Unicode </font>1.0, I think it would be wrong for
Unicode to suddenly change the way non-Western scripts are encoded.
Treating this particular case in isolation obscures the real issue
and will possibly prevent or delay a proper design of identifier
repertoires.<br>
<br>
To put the issue in general terms, it concerns the fact that there
are homographs in Unicode. These are two distinct code point
sequences with normally identical appearance. In particular, the
concern is with homographs that are present after normalization. <br>
<br>
I'm very much on board with highlighting that issue, which is that<b>
applying </b><b>NFC does not eliminate homographs</b>. And,
having just finished the exercise of reviewing the full repertoire
of modern scripts for suitability for the DNS root zone, I can
attest that there are more homographs than people might initially
suspect, and quite a few of them in Latin.<br>
<br>
In most, but not all cases, one of the two is a code point or code
point sequence that exists for a very specialized purpose. Often
only one of the forms is actually used in general orthographies and
only one of the forms would therefore be expected to occur in a set
of mnemonic identifiers.<br>
<br>
Unfortunately, it is not always the composite one that should be
supported. For example, Unicode has several non-normalizable Latin
digraphs that are encoded for special usage scenarios; in these
cases the individual code points must be supported. In some cases,
a letter and digit may be homographs (the example of க and ௧ (Tamil
'ka' and '1') as mentioned in the preceding post). Both would be
supported for different purposes. Finally, in some cases, there's a
combining mark that (given its name and general appearance) might be
expected to yield the same appearance when applied to some base
letters, as certain precomposed forms. In Latin, this applies to
combining overlays, because, on principle, the Unicode standard does
not decompose orthographic characters for which the shape is derived
by striking through part or all of the letter form.<br>
<br>
Like the case of the Arabic script, any such characters needed for
an as yet unencoded Latin orthography, would be encoded with a
composite glyph shape, but without decomposition.<br>
<br>
<b>The proper response for IDNA2008</b> would be to inventorize
these cases and <b>strongly warn</b> that they not be incorporated
unexamined into general repertoires; or, if they have to be
supported, that Label Generation Rulesets (aka IDN tables) support
context or variant rules that prevent these from co-occurring in any
minimal pair of labels.<br>
<br>
For language-specific IDN tables, it's often possible to eliminate
one or the other alternative.<br>
<br>
For example, a Danish IDN table would rule out 0338 (combining
slash), so that <o, 0338> cannot exist alongside o-slash. For
a Fula-specific IDN table, one would rule out the combining Hamza -
it has not place in that orthography.<br>
<br>
Eliminating any particular homographs on an ad-hoc basis in IDN2008
by making one of the code points INVALID does not solve the general
problem, but unnecessarily prevents language-specific solutions in a
way that is at best inconsistent and at worst discriminatory. <br>
<br>
The Fula character is a good example of a pseudo-decomposable
character that is needed for consistent encoding of a hitherto not
fully supported orthography, while the code point sequence serves a
specialized purpose elsewhere.<br>
<br>
It is very important, that whatever the solution is decided on for
IDNA2008, that IETF not haphazardly single out a particular instance
of a general pattern.<br>
<br>
A./<br>
</body>
</html>