<html>

  <head>

    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">On 1/21/2015 1:31 PM, Nico Williams

      wrote:<br>

    </div>

    <blockquote cite="mid:20150121213124.GV2350@localhost" type="cite">

      <pre wrap="">On Wed, Jan 21, 2015 at 03:33:12PM -0500, <a class="moz-txt-link-abbreviated" href="mailto:cowan@ccil.org">cowan@ccil.org</a> wrote:

</pre>

      <blockquote type="cite">

        <pre wrap="">John C Klensin scripsit:

</pre>

        <blockquote type="cite">

          <pre wrap="">But, while U+08A1 is abstract-character-identical and even

plausible-name-identical to U+0628 U+0654, it does _not_

decompose into the latter.  Instead, NFD(U+08A1) = NFC(U+08A1) =

U+08A1.  NFC (U+0628 U+0654) is U+0628 U+0654 as one would

expect from the stability rules; from that perspective, it is

the failure of U+08A1 to have a (non-identity) decomposition

that is the issue.

</pre>

        </blockquote>

        <pre wrap="">

If U+08A1 had such a decomposition, it would violate Unicode's

no-new-NFC rule.  What it violates is the (false) assumption that

base1 + combining is never confusable with a canonically

non-equivalent base2.  Even outside Arabic there are already

such cases:</pre>

      </blockquote>

    </blockquote>

    <br>

    I would go further, and claim that the notion that "<b>all

      homographs are the</b><b><br>

    </b><b>same abstract character</b>" is <b>misplaced, if not

      incorrect</b>. The notion of canonical<br>

    normalization was created to identify cases where homographs,

    characters or <br>

    sequences of normally identical appearance, were really cases of the

    same thing<br>

    being encoded twice, and where that was not the case the homographs

    are either<br>

    not equivalent under normalization (or sometimes, esp. in cases of

    near homographs)<br>

    there is a "compatibility" normalization relation (e.g. NF<b>K</b>C).<br>

    <br>

    U+08A1 is not the only character that has a non-decomposable

    homograph, and<br>

    because the encoding of it wasn't an accident, but follows a

    principle applied<br>

    by the Unicode Technical Committee, it won't, and can't be the last

    instance of<br>

    a non-decomposable homograph.<br>

    <br>

    The "failure of U+08A1 to have a (non-identity) decomposition",

    while it perhaps<br>

    complicates the design of a system of robust mnemonic identifiers

    (such as IDNs)<br>

    it appears not be be due to a "breakdown" of the encoding process

    and also does<br>

    not constitute a break of any encoding stability promises  by the

    Unicode <br>

    Consortium.<br>

    <br>

    Rather, it represents reasoned, and principled judgment of what is

    or isn't the<br>

    "same abstract character". That judgment has to be made somewhere in

    the<br>

    process, and the bodies responsible for character encoding get to

    make the<br>

    determination.<br>

    <br>

    Asserting, to the contrary, that there should be a principle that

    requires that all<br>

    homographs are the same abstract character, would mean to base

    encoding<br>

    decisions entirely on the shape, or appearance of characters and

    code point<br>

    sequences. Under that logic, Tamil LETTER KA and TAMIL DIGIT 1 would

    be the<br>

    same abstract character, and a (non-identity) decomposition would be

    required.<br>

    <br>

    That's just not how it works.<br>

    <br>

    That said, Unicode is generally (and correctly) reluctant to encode

    homographs.<br>

    One of the earliest and most ardently requested changes was the

    proposed<br>

    separation of "period" and "decimal point". It got rejected, and it

    was not the <br>

    only one. Where homographs are encoded, they generally follow

    certain <br>

    principles. And while these principles will, over time, lead to the

    encoding of<br>

    a few more homographs, they in turn, keep things predictable. <br>

    <br>

    From my understanding, the case in question fully follows these

    principles<br>

    as they are applicable to the encoding of characters for the Arabic

    script.<br>

    <br>

    <blockquote cite="mid:20150121213124.GV2350@localhost" type="cite">

      <blockquote type="cite">

        <pre wrap="">

[...]

</pre>

      </blockquote>

      <pre wrap="">

Should we treat all of these as confusables?

</pre>

    </blockquote>

    Yes, that's the obvious way to handle them. If you have zones that

    support<br>

    the concept of (blocked) variants, you can go further and make them

    that,<br>

    which has the effect of making them confusables that are declared up

    front<br>

    as such in the policy, not "discovered" in later steps of string

    review and analysis.<br>

    <br>

    A./<br>

    <blockquote cite="mid:20150121213124.GV2350@localhost" type="cite">

      <pre wrap="">

Nico

</pre>

    </blockquote>

    <br>

  </body>

</html>