Forgot to &quot;Reply to all&quot;<br><br>---------- Forwarded message ----------<br><span class="gmail_quote">From: <b class="gmail_sendername">Mark Davis</b> &lt;<a href="mailto:mark.davis@icu-project.org">mark.davis@icu-project.org

</a>&gt; Date: Dec 16, 2006 6:49 PM Subject: Re: Criteria for exceptional characters To: Michael Everson &lt;<a href="mailto:everson@evertype.com">everson@evertype.com</a>&gt; Thanks, comments below.

<br><br><div><span class="q"><span class="gmail_quote">On 12/16/06, <b class="gmail_sendername">Michael Everson</b> &lt;<a href="mailto:everson@evertype.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">

everson@evertype.com</a>&gt; wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

At 15:58 -0800 2006-12-16, Mark Davis wrote:<br><br>&gt;The major problems I see with the current system* are:<br>&gt;<br>&gt;1. It does not allow Unicode 5.0 characters.<br><br>To be honest, we MUST refer to Unicode 5.1

. Of

course, all characters in Unicode 5.0 are important, but if Unicode 5.1 is not taken as the benchmark, the Myanmar (Burmese) script will be left out, and that is simply not something that can be countenanced.

</blockquote></span><div><br>According to John, we can't wait that long. What I should have added is <br><br>1a. some kind of process that makes it easy to update to successive versions of Unicode. [Having the kind of property-based approach that we are developing is a solid step in that direction.]

</div><span class="q"> <blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">&gt;2. It restricts some combinations that are required for certain languages.

<br>&gt;&nbsp;&nbsp; a) Mn at the end of BIDI fields<br><br>This prevents Thaana from being used, as well as<br>Yiddish, and probably a number of languages which<br>use the Arabic script.</blockquote></span><div><br>right <br></div>

<span class="q"><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

&gt;&nbsp;&nbsp; b) ZWJ/NJ in limited contexts<br><br>A problem for some Brahmic scripts, at least some<br>of the major scripts of India.</blockquote></span><div><br>right <br></div><span class="q"><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

&gt;3. There are concerns about the stability of normalization<br><br>Are they valid? What are they, specifically?</blockquote></span><div><br>See <a href="http://www.unicode.org/reports/tr15/#Versioning" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">

http://www.unicode.org/reports/tr15/#Versioning

</a> <br></div><span class="q"><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">&gt;4. There are opportunities for spoofing. This<br>&gt;breaks down into a number of sub-problems, of

<br>&gt;which the major ones are:<br>&gt;&nbsp;&nbsp; a) non-letter confusables, like fraction slash<br>&gt;in<br>&gt;&lt;<a href="http://amazon.com/badguy.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">

http://amazon.com/badguy.com</a>&gt;<a href="http://amazon.com/badguy.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">

amazon.com/badguy.com</a><br>&gt;&nbsp;&nbsp; b) confusable letters/numbers within mixtures<br>&gt;of scripts, like cyrillic 'a' in paypal.<br><br>I thought we agreed to ban this kind of mixing long ago.</blockquote></span><div><br>

That is not in any of the proposals on the table (eg in internet drafts or our rule development), as far as I know. It is among the Unicode recommendations, but nobody has proposed it for the protocol. One has to be a bit careful to get the right level, given that certain orthographies (eg Japanese) use multiple scripts. For that reason, it is unclear whether this should be baked into the protocol, or up to more flexible mechanisms, like the user-agents.

<br><br>See also:<br><br><a href="http://www.unicode.org/reports/tr36/#Security_Levels_and_Alerts" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">http://www.unicode.org/reports/tr36/#Security_Levels_and_Alerts

</a><br></div><div><span class="e" id="q_10f8e500c2cae2ae_11"><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

One ramification of this would that it would no longer be possible to say that Kurdisk Ww was LATIN W instead of CYRILLIC WE. We would be obliged, for security's sake, to encode CYRILLIC WE. There would be no disadvantage here. In fact,

<br>it would be better for Kurdish. Consider a<br>glossary of Kurdish words in its three<br>orthographies, Arabic, Latin, and Cyrillic. If<br>LATIN W and CYRILLIC WE are encoded separately,<br>it is possible to correctly sort (or search, à la

<br>Google) the multi-script list. If they are<br>unified with LATIN W (as at present), there is no<br>solution.<br><br>&gt;c) confusable letters in same script, like &lt;<a href="http://inte1.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">

http://inte1.com</a>&gt;

<a href="http://inte1.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">inte1.com</a><br>&gt;[There is a finer breakdown in<br>&gt;&lt;<a href="http://unicode.org/reports/tr36/" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">

http://unicode.org/reports/tr36/</a>&gt;<a href="http://unicode.org/reports/tr36/" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">http://unicode.org/reports/tr36/

</a>] Well worth reading. &gt;The reason I say &quot;system*&quot; is that the options &gt;for solutions can be at different points: &gt; &gt;A. What should the protocol allow? &gt;B. What should a registry allow?

<br>&gt;C. What should the user agent block (or flag, eg with raw punycode)?<br>&gt;<br>&gt;For example, nobody has yet proposed that the<br>&gt;protocol disallow mixtures of scripts, even<br>&gt;though that represents by far the largest

<br>&gt;opportunity for spoofing.<br><br>This is **NOT** correct. I have advocated this<br>more or less loudly since September 2005, when I<br>discussed the question at length with Cary Karp<br>when I was at the Sophia Antipolis meeting of WG2

<br>and advised him on the draft recommendations he<br>was writing. I continue to favour this<br>anti-spoofing solution, and if, as has been<br>suggested, Unicode script properties of<br>characters can be used to ensure that scripts are

<br>not mixed or mixable (modulo Jpan for instance)<br>then there should be no problem with this.<br><br>(The only problem I could see is that UTC would<br>have to accept CYRILLIC WE, and possibly LATIN<br>SOFT SIGN, LATIN THETA as characters used for

<br>specialist purposes. We are talking less than two<br>dozen characters here, and I'm being pretty<br>generous in my estimate. A small price to pay for<br>security.)</blockquote></span></div><div><br>see above <br></div>

<span class="q"><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

&gt;Instead, it appears that the solutions taken by<br>&gt;the user agents are sufficient there: while the<br>&gt;&quot; &lt;<a href="http://paypal.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">

http://paypal.com</a>&gt;<a href="http://paypal.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">paypal.com</a>&quot; case got a lot

<br>&gt;of attention, when you look at the actual,<br>&gt;practical impact in terms of real, reported<br>&gt;security problems, it is not in practice<br>&gt;significant. I have no doubt that the<br>&gt;user-agents will continue to refine and improve

&gt;their approaches. I don't understand how you can say that the <a href="http://paypal.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">paypal.com</a> case is &quot;insignificant&quot;.

</blockquote><div> What I said was:  &quot;when you look at the actual, practical impact in terms of real, reported security problems, it is not in practice significant&quot;. And this is, I believe, because of steps taken in the browsers to alert users to this. Such cases are quite easy to detect in the user-agent.

<br><br>Listings of actual reported fraud using this technique, and their impact, would be useful.<br></div><span class="q"><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

&gt;So, what progress are we making?<br>&gt;<br>&gt;1. Looks like we have a solution<br><br>*If* Unicode 5.1, and *if* IETF bites the bullet<br>and realizes that there will be a Unicode 6, and<br>7, and 8, which may have needed characters. (No,

<br>Vint, I'm not talking about non-essential<br>characters.)</blockquote></span><div><br>see above <br></div><span class="q"><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

&gt;2a. Also looks like we have a solution If we change the rule.</blockquote><div> Everything I have to say is conditional on successful completion. What I mean by &quot;we have a solution&quot; is that it looks like we have consensus on an approach.

<br></div><span class="q"><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">&gt;2b. Not yet consensus on this<br><br>Going to have to bite this bullet for some

<br>scripts, but if script properties are accessed,<br>and the use of the joiners is restricted to<br>certain script (or even certain character)<br>environments, this may not be a problem.</blockquote></span><div><br>See also  

<a href="http://www.unicode.org/review/pr-96.html" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">

http://www.unicode.org/review/pr-96.html</a><br></div><span class="q"><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">&gt;3. Looks like we have a solution (restrict the

<br>&gt;sequences that could change between 3.2 and 5.0;<br>&gt;the Unicode consortium is tightening stability<br>&gt;to disallow further changes)<br><br>Please create a separate thread to discuss this particular issue.</blockquote>

</span><div><br>If and when it requires further discussion.<br></div><span class="q"><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">&gt;4a. Our proposed rules fix this. By tossing out

<br>&gt;all non-LMN, we remove the bad cases. Although<br>&gt;the problematic characters are a small fraction<br>&gt;of the few thousand characters in question, most<br>&gt;of which are not problematic, there is general<br>

&gt;agreement that as a class these are not needed,<br>&gt;and we are not worried about tossing out any<br>&gt;babies with the bathwater.<br><br>The list should be reviewed. I'm not saying you<br>haven't done a good job, but I haven't reviewed

<br>it, and I don't know if anyone else has either.</blockquote></span><div><br>The lists are there, and Patrik, Ken, myself and others have been working on them. If you are going to review them, you can start anytime ;-)

<br></div><span class="q">

<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">&gt;4b. We're not tackling this in the protocol,<br>&gt;leaving it up to user agents (and to some degree

<br>&gt;registries).<br><br>I think it is NOT A GOOD IDEA not to tackle this<br>in the protocol. I think IT WOULD BE A VERY GOOD<br>IDEA for this to be dealt with in the protocol.<br>It would be far safer for the end user, because

there would be no danger of error (intentional or unintentional) on the part of agents or registries. We *should* police this because we can.</blockquote><div> I have no strong feeling either way. It is not difficult to do this in the user-agent.

<br></div><span class="q"><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">&gt;4c. Here I also suspect that the principle<br>&gt;solution is in the user agents, but what we can

<br>&gt;do at the protocol level is to make some<br>&gt;exclusions where there are clear cases that we<br>&gt;can handle via well-established properties, or<br>&gt;particular exception cases where we add or<br>&gt;remove particular characters(s). What we have

<br>&gt;done so far is to toss out certain classes of<br>&gt;characters that are clearly not needed for<br>&gt;modern languages (historic scripts). [Here<br>&gt;again, frankly, their removal doesn't<br>&gt;fundamentally reduce spoofability, but it does

<br>&gt;little hard to remove them. But because there is<br>&gt;not much benefit to their removal, we don't<br>&gt;really need to argue whether there is a real<br>&gt;need for ones like Runic, because there aren't<br>&gt;really demonstrable problems with allowing it,

<br>&gt;given solutions in (4b).]<br><br>For this I suspect that the best we can do is<br>make recommendations. STRONG recommendations<br>based on real linguistic knowledge and data.<br>Recommendations so strong that a given registry

should have to give reasons for deviating from them.</blockquote><div> We're only discussing here the protocol. The question of who can force registries to &quot;give reasons&quot; is not one I want to get into here.

<br></div><span class="q"><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">&gt;(4c) is where your current question falls. These<br>&gt;are characters that are not covered by the rules

<br>&gt;we have developed so far. My suggestion for<br>&gt;criteria are:<br>&gt;<br>&gt;A. If there is clearly defined class of<br>&gt;characters that are clearly never needed in<br>&gt;modern languages (in this case Hebrew/Yiddish),

&gt;we can exclude them. &quot;In this case&quot;? But I agree, linguistic expertise can help weed out characters which are really not needed.</blockquote><div> &quot;in this case&quot;: Cary was raising this issue with regard to certain Hebrew characters.

<br></div><div><span class="e" id="q_10f8e500c2cae2ae_31"><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">&gt;B. If there are particular characters that may

<br>&gt;be used as a normal part of the language that we

<br>&gt;want to consider including or excluding, then we<br>&gt;consider two factors in weighing the question:<br>&gt;<br>&gt;&nbsp;&nbsp; B1. Can this character cause a spoofing<br>&gt;problem in a monoscript string, and if so, how

&gt;severe is the problem? &gt; &gt;&nbsp;&nbsp; B2. Is this character used in the regular &gt;orthography of a modern language, and if so, how &gt;essential is it? Good questions, requiring linguistic expertise.

<br>This would be a &quot;white list&quot; sort of thing, not<br>something that could be done algorithmically.<br><br>&gt;We want to keep the exceptional characters<br>&gt;(included or excluded) that are not covered by<br>

&gt;the normal rules we've developed so far to a &gt;minimum, so only those with a large negative &gt;weight should get exceptionally excluded, and &gt;only those with a high positive weight get &gt;exceptionally included.

<br><br>Agreed.<br><br>&gt;For example, a character that looks like a<br>&gt;period or a slash (important syntax characters<br>&gt;in URLs), and is optional in the language (eg<br>&gt;used in abbreviations, but not regular words)

&gt;gets a large negative weight. A character that &gt;doesn't look like a syntax character or another &gt;Hebrew character, and is required by common &gt;Hebrew or Yiddish words would get a high &gt;positive weight.

<br><br>Well, the Ethiopic wordspace looks like a colon<br>to readers of Latin script, and from a distance,<br>though its dots are square and not round.<br>However, it can ONLY occur between two ethiopic<br>SYLLABLEs, and (obviously) if it were entered

<br>accidentally inside &quot;http://&quot; it would cause no<br>difficulty, because that would be no different<br>from entering &quot;http$//&quot; -- it would have no<br>effect because it is not a protocol element.</blockquote>

</span></div><div><br>This is where one has to have a more thorough knowledge of the syntax, which the DNS honchos here can obviously supply. For example, a URL can contain a colon in several other positions,</div><span>

 such as:<br><br>

http://&lt;user&gt;:&lt;password&gt;@&lt;host&gt;:&lt;port&gt;/&lt;url-path&gt;<br><br>From your end, any characters that you can identify that could cause problems would be useful. That is, they are typically letters that resemble either other letters, or the ASCII syntax characters (dot, colon, slash, ...)

<br><br>A good place to start is the data table in <a href="http://www.unicode.org/reports/tr39/#Confusable_Detection" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">http://www.unicode.org/reports/tr39/#Confusable_Detection

</a><br><br>We have some mappings there, but we can definitely add more. Eg for colon we have currently:

<br><br></span><pre>FF1A ;        003A ;        SA        #* ( ： → : ) FULLWIDTH COLON → COLON        # {nfkc:65307}<br>0589 ;        003A ;        SA        #* ( ։ → : ) ARMENIAN FULL STOP → COLON        # {source:12}<br>FE30 ;        003A ;        SA        #* ( ︰ → : ) PRESENTATION FORM FOR VERTICAL TWO DOT LEADER → COLON        # {source:3328}

<br><br>05C3 ;        003A ;        SA        #* ( ׃ → : ) HEBREW PUNCTUATION SOF PASUQ → COLON        # {source:13}</pre><span class="q"><span><br>&gt; I think we are making progress, and I hope my comments are helpful.<br></span></span><div><br>Yes, thanks.

<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

--<span class="q"><br>Michael Everson * <a href="http://www.evertype.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">http://www.evertype.com</a><br>_______________________________________________

<br>Idna-update mailing list<br><a href="mailto:Idna-update@alvestrand.no" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">Idna-update@alvestrand.no

</a><br><a href="http://www.alvestrand.no/mailman/listinfo/idna-update" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">http://www.alvestrand.no/mailman/listinfo/idna-update</a><br></span></blockquote>

</div><br>