<html>

  <head>

    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">On 1/21/2015 8:39 PM, Nico Williams

      wrote:<br>

    </div>

    <blockquote cite="mid:20150122043909.GX2350@localhost" type="cite">

      <pre wrap="">

We should treat U+08A1 as confusable with U+0628 U+0654, advise

registrars to disallow it, and otherwise let IDNA treat the two as

distinct because Unicode does.</pre>

    </blockquote>

    <br>

    Agreed.<br>

    <blockquote cite="mid:20150122043909.GX2350@localhost" type="cite">

      <pre wrap="">

On Wed, Jan 21, 2015 at 06:58:09PM -0800, Asmus Freytag wrote:

</pre>

      <blockquote type="cite">

        <pre wrap="">On 1/21/2015 1:31 PM, Nico Williams wrote:

</pre>

        <blockquote type="cite">

          <pre wrap="">On Wed, Jan 21, 2015 at 03:33:12PM -0500, <a class="moz-txt-link-abbreviated" href="mailto:cowan@ccil.org">cowan@ccil.org</a> wrote:

</pre>

          <blockquote type="cite">

            <pre wrap="">John C Klensin scripsit:

</pre>

          </blockquote>

          <pre wrap="">[...]

</pre>

        </blockquote>

        <pre wrap="">

Asserting, to the contrary, that there should be a principle that

requires that all

homographs are the same abstract character, would mean to base encoding

[...]

</pre>

      </blockquote>

      <pre wrap="">

No one made that assertion.  (I trimmed the quotes, but they're in the

archive; readers can go look for themselves.)</pre>

    </blockquote>

    <br>

    I overstated that a bit to make a point.<br>

    <blockquote cite="mid:20150122043909.GX2350@localhost" type="cite">

      <pre wrap="">

But I am curious as to how people writing in Arabic make this

distinction when writing with pen and paper.  And if they don't, why

that distinction should be made in Unicode (I can think of good

reasons). </pre>

    </blockquote>

    <br>

    How can people distinguish Tamil KA and TAMIL DIGIT 1 with hot metal

    typography?<br>

    <br>

    They don't. But to get both number processing and sorting to happen

    in a sane fashion, the encoding has to respect that both are part of

    their own respective sequences that do not normally overlap.<br>

    <br>

    In other words (and that is very much part of what I am driving at)

    display - or identifier uniqueness - is not the only constraint

    Unicode faces in its design.<br>

    <br>

    <blockquote cite="mid:20150122043909.GX2350@localhost" type="cite">

      <pre wrap=""> (I'm NOT saying that there shouldn't be such a distinction,

just curious as to why there is one.) Unicode 7.0 doesn't answer this

question.  I doubt many here might know, and it will be just fine if I

never get an answer to that question.</pre>

    </blockquote>

    <br>

    I was not involved in the actual decisions on Unicode 7.0.0, so I'm

    sidestepping the reply on the Arabic code point in question. Others

    so involved have written very cogent summaries of the issue from the

    encoding perspective; but perhaps on a different mailing list.<br>

    <blockquote cite="mid:20150122043909.GX2350@localhost" type="cite">

      <pre wrap="">

</pre>

      <blockquote type="cite">

        <pre wrap="">decisions entirely on the shape, or appearance of characters and code point

sequences. Under that logic, Tamil LETTER KA and TAMIL DIGIT 1 would be the

same abstract character, and a (non-identity) decomposition would be

required.

That's just not how it works.

</pre>

      </blockquote>

      <pre wrap="">

Clearly similar letters from different scripts should get different

codepoints, confusables be damned.  I think no one _today_ will argue

otherwise.</pre>

    </blockquote>

    <br>

    The strict separation between related scripts (as in the case of

    Latin, Greek and Cyrillic) works well for some tasks, but by

    necessity leads to the existence of a significant number of

    cross-script homographs. Especially, as people continue to "borrow"

    from one script into another ('q' and 'w' were borrowed from Latin

    to write Kurdish in Cyrillic, to give one historically recent

    example -- they are now separately encoded).<br>

    <br>

    I do think, it's the correct decision for a universal encoding

    standard (and that's why it's become the accepted solution).

    However, from pure display or unique identifier approach, an

    encoding where identical (not merely similar) shapes are only

    encoded once would be appealing in many ways.<br>

    <br>

    <blockquote cite="mid:20150122043909.GX2350@localhost" type="cite">

      <pre wrap="">

</pre>

      <blockquote type="cite">

        <pre wrap="">That said, Unicode is generally (and correctly) reluctant to encode

homographs.

One of the earliest and most ardently requested changes was the proposed

separation of "period" and "decimal point". It got rejected, and it

was not the

only one. Where homographs are encoded, they generally follow certain

</pre>

      </blockquote>

      <pre wrap="">

We have enough periods (and spaces, and...).  It's nice to know we have

one fewer than we could have ended up with.</pre>

    </blockquote>

    <br>

    Way more than one fewer. I gave you as example only one of the <u>earliest

    </u>(and oft-repeated) requests for such disunification.<br>

    <blockquote cite="mid:20150122043909.GX2350@localhost" type="cite">

      <pre wrap="">

</pre>

      <blockquote type="cite">

        <pre wrap="">principles. And while these principles will, over time, lead to the

encoding of

a few more homographs, they in turn, keep things predictable.

>From my understanding, the case in question fully follows these principles

as they are applicable to the encoding of characters for the Arabic script.

</pre>

        <blockquote type="cite">

          <blockquote type="cite">

            <pre wrap="">

[...]

</pre>

          </blockquote>

          <pre wrap="">Should we treat all of these as confusables?

</pre>

        </blockquote>

        <pre wrap="">Yes, that's the obvious way to handle them. If you have zones that support

the concept of (blocked) variants, you can go further and make them that,

which has the effect of making them confusables that are declared up front

as such in the policy, not "discovered" in later steps of string

review and analysis.

</pre>

      </blockquote>

      <pre wrap="">

Agreed.

Nico

</pre>

    </blockquote>

    A./<br>

  </body>

</html>