<html>

  <head>

    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">On 1/23/2015 1:14 AM, "Martin J. Dürst"

      wrote:<br>

    </div>

    <blockquote cite="mid:54C210E8.9000608@it.aoyama.ac.jp" type="cite">Hello

      Asmus,

      <br>

      <br>

      On 2015/01/22 11:58, Asmus Freytag wrote:

      <br>

      <br>

      <blockquote type="cite">I would go further, and claim that the

        notion that "*all homographs are

        <br>

        the**

        <br>

        **same abstract character*" is *misplaced, if not incorrect*.

        <br>

      </blockquote>

      <br>

      That's fine. Nobody would claim that 8 (U+0038) and ৪ (Bengali 4,

      U+09EA) are the same abstract character. (How 'homographic' they

      look will depend on what fonts your mail user agent uses :-)

      <br>

    </blockquote>

    <br>

    When I use the term homograph, it is with reference to shapes that

    are <b>the same by design, no</b>t some degree of similarity, and

    certainly not any degree of "<i><b>accidenta</b></i><i><b>l</b></i>"

    similarity. For example, the ideograph for 'one' is not a homograph

    of the dash or hyphen, even if both are based on the idea of a

    single horizontal line - those instances are merely of potential

    confusable similarity. For a true homograph situation, you really

    have to have a case where two code points were assigned to the "same

    thing", or, since the term homograph refers to the appearance, to

    two functions of the "same mark on paper".<br>

    <br>

    Had Unicode encoded a base-line decimal point in distinction from

    the period, that would be a case of a homograph relation.<br>

    <br>

    For non-hypothetical homographs look no further then Greek omicron

    and Latin o, or the TAMIL KA and TAMIL digit 1.<br>

    <br>

    In all of these examples, while the shape is utterly the same, there

    is a need for a separate, non-normalizable coded representation. And

    (as the hypothetical example of the decimal point shows) it is not

    usually enough for some mark to have different conventions around

    its use for it to be encoded "by function". Many conventions are

    handled by software as a matter or context, just as they are by the

    human reader -- but not all.<br>

    <br>

    <blockquote cite="mid:54C210E8.9000608@it.aoyama.ac.jp" type="cite">

      <br>

      <br>

      <blockquote type="cite">U+08A1 is not the only character that has

        a non-decomposable homograph, and

        <br>

        because the encoding of it wasn't an accident, but follows a

        principle

        <br>

        applied

        <br>

        by the Unicode Technical Committee, it won't, and can't be the

        last

        <br>

        instance of

        <br>

        a non-decomposable homograph.

        <br>

        <br>

        The "failure of U+08A1 to have a (non-identity) decomposition",

        while it

        <br>

        perhaps

        <br>

        complicates the design of a system of robust mnemonic

        identifiers (such

        <br>

        as IDNs)

        <br>

        it appears not be be due to a "breakdown" of the encoding

        process and

        <br>

        also does

        <br>

        not constitute a break of any encoding stability promises  by

        the Unicode

        <br>

        Consortium.

        <br>

        <br>

        Rather, it represents reasoned, and principled judgment of what

        is or

        <br>

        isn't the

        <br>

        "same abstract character". That judgment has to be made

        somewhere in the

        <br>

        process, and the bodies responsible for character encoding get

        to make the

        <br>

        determination.

        <br>

      </blockquote>

      <br>

      While I can agree with this characterization, many judgements on

      character encoding are by their very nature borderline, and U+08A1

      definitely in many aspects is borderline. </blockquote>

    <br>

    Totally agreed. I would phrase it differently. In character encoding

    few questions are black and white. Most are more akin to dark-gray

    vs. light gray. Some can look like neutral gray all around. <br>

    <br>

    A few issues don't have a good solution, because with every context

    you choose to view the question under, the trade-off are different.

    Whichever solution you pick for one of those issues, some

    implementation will be burdened with costs. Luckily, these

    situations are not that frequent.<br>

    <br>

    But they do exist, and it is well understood that there is no single

    set of principles that will help you arrive at a "correct" solution

    in these cases. They follow from the universal nature of the

    universal character set. Therefore, it is well understood that some

    of the principles cannot be satisfied simultaneously in such cases.<br>

    <br>

    If that is what you mean by "borderline", I might agree that this

    could be one of those cases.<br>

    <br>

    <br>

    <blockquote cite="mid:54C210E8.9000608@it.aoyama.ac.jp" type="cite">What

      I hope is that the Unicode Technical Committee, when making

      future, similar decisions, hopefully puts the borderline a bit

      more in support of applications such as identifiers, and a bit

      less in favor of splitting. Also, that it realize that when

      principles lead to more and more homograph encodings, it may very

      well pay off to reexamine some of these principles before going

      down a slippery slope.

      <br>

    </blockquote>

    <br>

    In this particular case, it looks like the orthography supported is

    (at this point) not even mainstream, because Latin appears to be the

    main script to write the language(s) in question.<br>

    <br>

    In designing repertoire tables for identifiers, one of the easiest

    ways to make the more robust is to <b>remove irrelevant code points</b>.

    I've been engaged in a process designed to create a repertoire for

    the DNS Root Zone. In that process, we rigorously remove code points

    that are not in widespread modern use (where that is measured

    relative to the community affected).<br>

    <br>

    Some other poster likened the case of U+08A1 to cuneiform. Because

    IDNA2008 allows all the historic scripts (like cuneiform), there

    seems to be no principled position to single out this one code

    point. Instead, if it's troublesome, treat it like cuneiform and

    don't admit it in your zone repertoire.<br>

    <br>

    Speaking of cuneiform (and any number of other ancient writing

    systems, especially those with more than a few hundred elements). I

    don't believe anybody knows how to create a robust system of

    identifiers that includes these writing systems because very few

    people really understand them well enough to understand whether they

    harbor issues of homographs or confusables by similarity and to what

    degree.<br>

    <br>

    Compared to those systems then, the way to treat U+0A81 is to not

    allow it in any zone that doesn't explicitly need to cater to those

    writing Fula in Arabic. If that zone should need to support more

    than the Fula language, it still may not need to support the

    combining hamza at 0654. Certainly, the current draft Arabic

    repertoire for the Root Zone does not include that code point (and

    the group of community members and local experts have good reasons

    for that exclusion).<br>

    <br>

    The whole issue arose because people were staring at it from what

    can be / should be done at the protocol level, where it would not be

    possible to declare 0654 INVALID. While it would have been nice to

    remove this issue in the protocol, it's the wrong place to do so,

    because it's not possible to tell at the protocol level which of

    these code points are in fact irrelevant. (That's toally similar to

    other kinds of homographs).<br>

    <br>

    However, in the final analysis, both 0654 and 08A1 are equally

    specialized and the most natural solution is to not support either

    or only 08A1 in a given zone repertoire, because the necessity to

    support 0654 for a system of robust identifiers is to be questioned.<br>

    <br>

    ----<br>

    <br>

    What should be the outcome of this storm in a tea-cup?<br>

    <br>

    <ul>

      <li>The full Unicode 7.0 IDNA tables should be released (with

        08A1).</li>

      <li>The mistaken notion that normalization eliminates all

        homographs should be highlighted.</li>

      <li>A list of known homographs (confusables by design, not

        accident) should be maintained</li>

      <li>A recommendation should be made for zone operators to robustly

        handle them by:</li>

    </ul>

    <blockquote>- supporting only one<br>

      - supporting both, but implement the equivalent of a Pauli

      exclusion principle<br>

        (in other words, make them "blocked variants")<br>

    </blockquote>

    -----<br>

    <br>

    A./<br>

    <br>

    <br>

  </body>

</html>