<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

<meta name="Generator" content="Microsoft Word 14 (filtered medium)">

<style><!--

/* Font Definitions */

@font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

@font-face

        {font-family:Tahoma;

        panose-1:2 11 6 4 3 5 4 4 2 4;}

@font-face

        {font-family:Consolas;

        panose-1:2 11 6 9 2 2 4 3 2 4;}

@font-face

        {font-family:Candara;

        panose-1:2 14 5 2 3 3 3 2 2 4;}

/* Style Definitions */

p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0cm;

        margin-bottom:.0001pt;

        font-size:12.0pt;

        font-family:"Times New Roman","serif";

        color:black;}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:blue;

        text-decoration:underline;}

a:visited, span.MsoHyperlinkFollowed

        {mso-style-priority:99;

        color:purple;

        text-decoration:underline;}

pre

        {mso-style-priority:99;

        mso-style-link:"HTML Preformatted Char";

        margin:0cm;

        margin-bottom:.0001pt;

        font-size:10.0pt;

        font-family:"Courier New";

        color:black;}

span.HTMLPreformattedChar

        {mso-style-name:"HTML Preformatted Char";

        mso-style-priority:99;

        mso-style-link:"HTML Preformatted";

        font-family:Consolas;

        color:black;}

span.EmailStyle19

        {mso-style-type:personal-reply;

        font-family:"Calibri","sans-serif";

        color:#1F497D;}

.MsoChpDefault

        {mso-style-type:export-only;

        font-size:10.0pt;}

@page WordSection1

        {size:612.0pt 792.0pt;

        margin:72.0pt 90.0pt 72.0pt 90.0pt;}

div.WordSection1

        {page:WordSection1;}

--></style><!--[if gte mso 9]><xml>

<o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

<o:shapelayout v:ext="edit">

<o:idmap v:ext="edit" data="1" />

</o:shapelayout></xml><![endif]-->

</head>

<body bgcolor="white" lang="EN-US" link="blue" vlink="purple">

<div class="WordSection1">

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">Dear All,<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D"><o:p> </o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">I haven’t been up-to-date on the IDNA mailing group lately, due working with TF-AIDN (the group which assigned by ICANN LGR for the Arabic script). I just read

 the IAB stamen once it got forwarded by ICANN staff to us. We were aware about the character 08A1 and the confusability caused by not make the NFC for it. Their concern is valid and reasonable but in ONLY this character (U+08A1).

</span><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">The problem is in the end of their statement, IAB said an inaccurate information which is  the characters U+0623, U+0624, U+0626, U+0677, U+06C2 and U+06D3 aren’t canonically

 equivalent to <character> followed by U+0654, ARABIC HAMZA ABOVE. This statement has two problems: 1) inaccurate information and 2)these characters are safe and they are very important characters to the languages belong to Arabic script. It is like dropping

 vowels from English!.  Their statement should be restricted to ONLY character U+08A1, ARABIC LETTER BEH WITH HAMZA ABOVE. If this statement got adapted it is going to murder the language and it’ll be very hard for normal users to form a lot of words!<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D"><o:p> </o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">I hope my concern is clear and we should reconsider their statement with the concerns I mentioned previously.<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D"><o:p> </o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">AbdulRahman,</span><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D"><o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D"><o:p> </o:p></span></p>

<div>

<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0cm 0cm 0cm">

<p class="MsoNormal"><b><span style="font-size:10.0pt;font-family:"Tahoma","sans-serif";color:windowtext">From:</span></b><span style="font-size:10.0pt;font-family:"Tahoma","sans-serif";color:windowtext"> Idna-update [mailto:idna-update-bounces@alvestrand.no]

<b>On Behalf Of </b>Asmus Freytag<br>

<b>Sent:</b> Thursday, January 29, 2015 7:35 AM<br>

<b>To:</b> John C Klensin; Shawn Steele; Vint Cerf<br>

<b>Cc:</b> IDNA update work<br>

<b>Subject:</b> Re: IAB Statement on Identifiers and Unicode 7.0.0<o:p></o:p></span></p>

</div>

</div>

<p class="MsoNormal"><o:p> </o:p></p>

<div>

<p class="MsoNormal">On 1/28/2015 7:02 PM, John C Klensin wrote:<o:p></o:p></p>

</div>

<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

<pre>This is getting tedious.  Vint has explained, Andrew has<o:p></o:p></pre>

<pre>explained, Pete has explained, Patrik has explained, and I have<o:p></o:p></pre>

<pre>explained, each in different ways (and my apologies to anyone<o:p></o:p></pre>

<pre>I've left out), that the examples in the IAB statement are not<o:p></o:p></pre>

<pre>the problem.  They are symptoms of what may be a fundamental<o:p></o:p></pre>

<pre>misunderstanding in the IDNA design (specifically that we may<o:p></o:p></pre>

<pre>not have used the right set of properties) and perhaps an even<o:p></o:p></pre>

<pre>more fundamental one (specifically that the necessary set of<o:p></o:p></pre>

<pre>properties may not exist or be complete).<o:p></o:p></pre>

</blockquote>

<p class="MsoNormal" style="margin-bottom:12.0pt"><br>

<span style="font-family:"Candara","sans-serif"">What we've heard, mostly, was the assertion that there are cases<br>

for which there needs to be "normalization"-plus.<br>

</span><br>

<span style="font-family:"Candara","sans-serif"">(In this contribution, I will attempt to triage *all* known, and not<br>

yet known cases, based on differences in their structural typology<br>

and usage scenarios. Therefore, please compare the discussion below<br>

not just to a few, but any cases that you are aware of, and let me<br>

know if you have examples that you think that do not fit.)<br>

<br>

Let me start with the basics:<br>

<br>

Normalization asserts that two sequences are equivalent; canonical<br>

normalization asserts that they are fully equivalent based on their<br>

underlying identity.<br>

<br>

For the cases where Unicode does not provide normalization,<br>

the argument by the UTC is that the equivalence is, if at all,<br>

is in appearance only and not manifest in the underlying identity.<br>

<br>

In most of these cases, attempting to assert such an equivalence<br>

after the fact with some additional normalization step, based on<br>

whatever additional properties is *<u>not</u>* the correct strategy.<br>

<br>

In the vast majority of cases this is the wrong strategy for the <br>

simple reason that nobody (other than malicious users) would <br>

ever use what looks like the decomposed form. Only the<br>

composite form is actually used - the Danish o-slash is a typical<br>

example. Given that, the simple solution on the protocol level <br>

for such cases is adding a context rule that <u>prevents </u>"fake" <br>

compositions - instead of making up false equivalences.<br>

<br>

It would be even easier to disallow certain combining marks<br>

altogether, but, if that's seen as too drastic, then by all means<br>

disallow them where they appear to form a composite that<br>

a) is visually equivalent to an encoded character<br>

b) is never expressed as a sequence in ordinary use<br>

<br>

Again, disallowing the mark altogether would be easiest; but<br>

the because requirement (a) by definition is met by only a<br>

limited and enumerable set code points for composites, its<br>

possible to use context rules.  <br>

<br>

With clever design of some properties for the purpose, <br>

creating a general context rule appears possible.<br>

<br>

With this approach, I estimate that you catch 90%+ of <br>

all cases "missed" by NFC, including 90%+ of those that <br>

have been identified as concerns in this discussion.<br>

<br>

There are a small number of remaining cases, some of them <br>

digraphs (things that look exactly like two letters) and perhaps<br>

a few other ones.<br>

<br>

For the digraphs, asserting an equivalence via some algorithm<br>

that works like extended normalization <u>may</u> make sense, if<br>

the conclusion is that they must remain allowed (some are<br>

really special like the case of Latin digraphs for writing poetry<br>

in some African language).<br>

<br>

Because the limited way these are used, the sequence of <br>

ordinary letters must be a not uncommon fall-back in many<br>

non-DNS situations anyway, so taking that fallback as the <br>

preferred form would make some sense. (Still, it's probably<br>

better to disallow all or most of the digraphs altogether - they<br>

really don't need to be supported.)<br>

<br>

Finally, we come to the 1% of 1% of cases where there may be<br>

actual use of both composites and look-alike sequences.<br>

This is the Arabic case that started all this. In such cases, if both<br>

forms are really used (by separate constituents) it's not <br>

possible to come to a sensible "preferred" form that doesn't<br>

play favorites in an arbitrary way. So, it's not possible to know <br>

what to normalize to!<br>

<br>

From a pragmatic point of view, and given that "look-alike" fades<br>

gradually into "look-nearly-alike" and then into "look-confusingly-<br>

similar" and so on, whenever such remaining edge cases are<br>

exhibited by *<u>rarely</u>* or *<u>very rarely</u>* used code points, the<br>

benefit of addressing them in the protocol, vs. relying on upper<br>

layers (like string similarity) becomes <u>vanishingly small</u>.<br>

<br>

This is Mark's and Shawn's point (and shared by many).<br>

<br>

If we can get some recognition and acknowledgement that<br>

solving arbitrarily minuscule problems on the protocol level<br>

(when bigger problems can only be addressed outside), is not<br>

productive, then we have a basis on which we can come<br>

together - such that we can look at the wider, and perhaps<br>

more relevant subset of cases and discuss solutions for them.<br>

<br>

Now, to come back out of that rat-hole, I want to reiterate<br>

that I see a number of classes of cases for which it is possible<br>

to construct a robust solution on the protocol level that <br>

does not necessarily have to be arbitrary -- or force users<br>

into creating strings that contain code point sequences that<br>

are explicitly discouraged for their language.<br>

<br>

The vast majority of these cases, to repeat from above, are<br>

those, where only one form is in practical or recommended<br>

use. In these cases, finding a way to disallow the competing<br>

representation (usually a sequence) would be the answer<br>

that impacts non-malicious users the least.<br>

<br>

It would also be implementable using a combination of <br>

properties and context rules not too dissimilar from existing<br>

rules, and not require a new or modified normalization <br>

algorithm.<br>

<br>

In careful review, it might even be possible to establish that<br>

the set of code points that could be successfully normalized<br>

to a preferred form is empty, or contains only code points of<br>

such rarity that, speaking pragmatically, adding a whole <br>

algorithm for their sake is not appropriate.<br>

<br>

(For the root, the draft designs for Arabic side-step the issue<br>

by disallowing the combining hamza, along with a number <br>

of other combining marks that are felt unnecessary for the<br>

purpose of creating identifiers).<br>

<br>

This radical step may not be possible on the protocol level,<br>

particularly as some combining marks may be needed for<br>

novel combinations, which may be difficult to enumerate<br>

in advance for the entire DNS.<br>

<br>

(For the root, limiting combining marks to very specific<br>

contexts, which would then be explicitly enumerated,<br>

is one of the strategies we are looking at).<br>

<br>

A./<br>

<br>

PS: in the meantime, I continue to consider the IAB <br>

statement in its totality, and particular in its immediate<br>

recommendations regarding Arabic not merely as not<br>

really helpful, but outright harmful.<br>

<br>

</span><o:p></o:p></p>

</div>

</body>

</html>