<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 1/21/2015 1:31 PM, Nico Williams
wrote:<br>
</div>
<blockquote cite="mid:20150121213124.GV2350@localhost" type="cite">
<pre wrap="">On Wed, Jan 21, 2015 at 03:33:12PM -0500, <a class="moz-txt-link-abbreviated" href="mailto:cowan@ccil.org">cowan@ccil.org</a> wrote:
</pre>
<blockquote type="cite">
<pre wrap="">John C Klensin scripsit:
</pre>
<blockquote type="cite">
<pre wrap="">But, while U+08A1 is abstract-character-identical and even
plausible-name-identical to U+0628 U+0654, it does _not_
decompose into the latter. Instead, NFD(U+08A1) = NFC(U+08A1) =
U+08A1. NFC (U+0628 U+0654) is U+0628 U+0654 as one would
expect from the stability rules; from that perspective, it is
the failure of U+08A1 to have a (non-identity) decomposition
that is the issue.
</pre>
</blockquote>
<pre wrap="">
If U+08A1 had such a decomposition, it would violate Unicode's
no-new-NFC rule. What it violates is the (false) assumption that
base1 + combining is never confusable with a canonically
non-equivalent base2. Even outside Arabic there are already
such cases:</pre>
</blockquote>
</blockquote>
<br>
I would go further, and claim that the notion that "<b>all
homographs are the</b><b><br>
</b><b>same abstract character</b>" is <b>misplaced, if not
incorrect</b>. The notion of canonical<br>
normalization was created to identify cases where homographs,
characters or <br>
sequences of normally identical appearance, were really cases of the
same thing<br>
being encoded twice, and where that was not the case the homographs
are either<br>
not equivalent under normalization (or sometimes, esp. in cases of
near homographs)<br>
there is a "compatibility" normalization relation (e.g. NF<b>K</b>C).<br>
<br>
U+08A1 is not the only character that has a non-decomposable
homograph, and<br>
because the encoding of it wasn't an accident, but follows a
principle applied<br>
by the Unicode Technical Committee, it won't, and can't be the last
instance of<br>
a non-decomposable homograph.<br>
<br>
The "failure of U+08A1 to have a (non-identity) decomposition",
while it perhaps<br>
complicates the design of a system of robust mnemonic identifiers
(such as IDNs)<br>
it appears not be be due to a "breakdown" of the encoding process
and also does<br>
not constitute a break of any encoding stability promises by the
Unicode <br>
Consortium.<br>
<br>
Rather, it represents reasoned, and principled judgment of what is
or isn't the<br>
"same abstract character". That judgment has to be made somewhere in
the<br>
process, and the bodies responsible for character encoding get to
make the<br>
determination.<br>
<br>
Asserting, to the contrary, that there should be a principle that
requires that all<br>
homographs are the same abstract character, would mean to base
encoding<br>
decisions entirely on the shape, or appearance of characters and
code point<br>
sequences. Under that logic, Tamil LETTER KA and TAMIL DIGIT 1 would
be the<br>
same abstract character, and a (non-identity) decomposition would be
required.<br>
<br>
That's just not how it works.<br>
<br>
That said, Unicode is generally (and correctly) reluctant to encode
homographs.<br>
One of the earliest and most ardently requested changes was the
proposed<br>
separation of "period" and "decimal point". It got rejected, and it
was not the <br>
only one. Where homographs are encoded, they generally follow
certain <br>
principles. And while these principles will, over time, lead to the
encoding of<br>
a few more homographs, they in turn, keep things predictable. <br>
<br>
From my understanding, the case in question fully follows these
principles<br>
as they are applicable to the encoding of characters for the Arabic
script.<br>
<br>
<blockquote cite="mid:20150121213124.GV2350@localhost" type="cite">
<blockquote type="cite">
<pre wrap="">
[...]
</pre>
</blockquote>
<pre wrap="">
Should we treat all of these as confusables?
</pre>
</blockquote>
Yes, that's the obvious way to handle them. If you have zones that
support<br>
the concept of (blocked) variants, you can go further and make them
that,<br>
which has the effect of making them confusables that are declared up
front<br>
as such in the policy, not "discovered" in later steps of string
review and analysis.<br>
<br>
A./<br>
<blockquote cite="mid:20150121213124.GV2350@localhost" type="cite">
<pre wrap="">
Nico
</pre>
</blockquote>
<br>
</body>
</html>