<html>

  <head>

    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">On 1/20/2015 8:04 PM, "Martin J. Dürst"

      wrote:<br>

    </div>

    <blockquote cite="mid:54BF2546.9030006@it.aoyama.ac.jp" type="cite">P.S.:

      Please note that the comments above don't mean that I'm happy with

      the inclusion of U+08A1 in Unicode 7.0.0, and that I sincerely

      hope the Unicode Consortium will weight the problems of identifier

      confusability higher in their future decisions.

    </blockquote>

    <br>

    <font face="Candara">Given that an almost exactly parallel issue

      existed for a code point used in several Western European

      languages from Unicode </font>1.0, I think it would be wrong for

    Unicode to suddenly change the way non-Western scripts are encoded.

    Treating this particular case in isolation obscures the real issue

    and will possibly prevent or delay a proper design of identifier

    repertoires.<br>

    <br>

    To put the issue in general terms, it concerns the fact that there

    are homographs in Unicode. These are two distinct code point

    sequences with normally identical appearance. In particular, the

    concern is with homographs that are present after normalization. <br>

    <br>

    I'm very much on board with highlighting that issue, which is that<b>

      applying </b><b>NFC does not eliminate homographs</b>. And,

    having just finished the exercise of reviewing the full repertoire

    of modern scripts for suitability for the DNS root zone, I can

    attest that there are more homographs than people might initially

    suspect, and quite a few of them in Latin.<br>

    <br>

    In most, but not all cases, one of the two is a code point or code

    point sequence that exists for a very specialized purpose. Often

    only one of the forms is actually used in general orthographies and

    only one of the forms would therefore be expected to occur in a set

    of mnemonic identifiers.<br>

    <br>

    Unfortunately, it is not always the composite one that should be

    supported. For example, Unicode has several non-normalizable Latin

    digraphs that are encoded for special usage scenarios; in these

    cases the individual code points must be supported.  In some cases,

    a letter and digit may be homographs (the example of க and ௧ (Tamil

    'ka' and '1') as mentioned in the preceding post). Both would be

    supported for different purposes. Finally, in some cases, there's a

    combining mark that (given its name and general appearance) might be

    expected to yield the same appearance when applied to some base

    letters, as certain precomposed forms. In Latin, this applies to

    combining overlays, because, on principle, the Unicode standard does

    not decompose orthographic characters for which the shape is derived

    by striking through part or all of the letter form.<br>

    <br>

    Like the case of the Arabic script, any such characters needed for

    an as yet unencoded Latin orthography, would be encoded with a

    composite glyph shape, but without decomposition.<br>

    <br>

    <b>The proper response for IDNA2008</b> would be to inventorize

    these cases and <b>strongly warn</b> that they not be incorporated

    unexamined into general repertoires; or, if they have to be

    supported, that Label Generation Rulesets (aka IDN tables) support

    context or variant rules that prevent these from co-occurring in any

    minimal pair of labels.<br>

    <br>

    For language-specific IDN tables, it's often possible to eliminate

    one or the other alternative.<br>

    <br>

    For example, a Danish IDN table would rule out 0338 (combining

    slash), so that <o, 0338> cannot exist alongside o-slash. For

    a Fula-specific IDN table, one would rule out the combining Hamza -

    it has not place in that orthography.<br>

    <br>

    Eliminating any particular homographs on an ad-hoc basis in IDN2008

    by making one of the code points INVALID does not solve the general

    problem, but unnecessarily prevents language-specific solutions in a

    way that is at best inconsistent and at worst discriminatory. <br>

    <br>

    The Fula character is a good example of a pseudo-decomposable

    character that is needed for consistent encoding of a hitherto not

    fully supported orthography, while the code point sequence serves a

    specialized purpose elsewhere.<br>

    <br>

    It is very important, that whatever the solution is decided on for

    IDNA2008, that IETF not haphazardly single out a particular instance

    of a general pattern.<br>

    <br>

    A./<br>

  </body>

</html>