Sorry, I misunderstood.<br><br>Tatweel is in the same boat as many other characters. It is irrelevant to mappings, since the only relevant mappings are those where the results are all valid U-Label characters. So those characters are off the table anyway.<br>

<br clear="all">Mark<br>

<br><br><div class="gmail_quote">On Thu, Apr 2, 2009 at 13:08, Erik van der Poel <span dir="ltr">&lt;<a href="mailto:erikv@google.com">erikv@google.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

My email says &quot;disallow Tatweel&quot;, not map Tatweel.<br>

<font color="#888888"><br>

Erik<br>

</font><div><div></div><div class="h5"><br>

On Thu, Apr 2, 2009 at 11:44 AM, Mark Davis &lt;<a href="mailto:mark@macchiato.com">mark@macchiato.com</a>&gt; wrote:<br>

&gt; I agree that the construction, like Tables, can be Property based. And large<br>

&gt; swathes of characters can be excluded pretty much out the box. However, it<br>

&gt; does mean going through the exercise of looking at the excluded characters<br>

&gt; to see if any of them should be exceptions. While I don&#39;t think it is worth<br>

&gt; the effort, worth the further incompatibility with IDNA2003, or worth not<br>

&gt; being able to use off-the-shelf NFKC and case-folding code, it is not an<br>

&gt; unreasonable compromise.<br>

&gt;<br>

&gt; As far as Tatweel goes, we really don&#39;t want to add any mappings that were<br>

&gt; not in IDNA2003; for security and interoperability we need all Unicode 3.2<br>

&gt; characters to either (a) not map, or (b) map to the exactly the same as<br>

&gt; IDNA2003 has. (You pointed out the problem earlier.)<br>

&gt;<br>

&gt; Mark<br>

&gt;<br>

&gt;<br>

&gt; On Thu, Apr 2, 2009 at 10:50, Erik van der Poel &lt;<a href="mailto:erikv@google.com">erikv@google.com</a>&gt; wrote:<br>

&gt;&gt;<br>

&gt;&gt; It may not be necessary to do character-by-character analysis of NFKC.<br>

&gt;&gt; We may be able to select a small number of the NFKC tags:<br>

&gt;&gt;<br>

&gt;&gt; &lt;font&gt;          A font variant (e.g. a blackletter form).<br>

&gt;&gt; &lt;noBreak&gt;       A no-break version of a space or hyphen.<br>

&gt;&gt; &lt;initial&gt;       An initial presentation form (Arabic).<br>

&gt;&gt; &lt;medial&gt;        A medial presentation form (Arabic).<br>

&gt;&gt; &lt;final&gt;         A final presentation form (Arabic).<br>

&gt;&gt; &lt;isolated&gt;      An isolated presentation form (Arabic).<br>

&gt;&gt; &lt;circle&gt;        An encircled form.<br>

&gt;&gt; &lt;super&gt;         A superscript form.<br>

&gt;&gt; &lt;sub&gt;   A subscript form.<br>

&gt;&gt; &lt;vertical&gt;      A vertical layout presentation form.<br>

&gt;&gt; &lt;wide&gt;          A wide (or zenkaku) compatibility character.<br>

&gt;&gt; &lt;narrow&gt;        A narrow (or hankaku) compatibility character.<br>

&gt;&gt; &lt;small&gt;         A small variant form (CNS compatibility).<br>

&gt;&gt; &lt;square&gt;        A CJK squared font variant.<br>

&gt;&gt; &lt;fraction&gt;      A vulgar fraction form.<br>

&gt;&gt; &lt;compat&gt;        Otherwise unspecified compatibility character.<br>

&gt;&gt;<br>

&gt;&gt; Of these, I would suggest that &lt;wide&gt; and &lt;narrow&gt; are needed for East<br>

&gt;&gt; Asian input methods.<br>

&gt;&gt;<br>

&gt;&gt; We should also remember that a number of WG participants would have to<br>

&gt;&gt; compromise to some extent, in order to accept mapping as a requirement<br>

&gt;&gt; on the lookup side. Those that pushed for lookup mappings should also<br>

&gt;&gt; be willing to make some compromises.<br>

&gt;&gt;<br>

&gt;&gt; One example where we seem to have consensus for getting &quot;stricter&quot; is<br>

&gt;&gt; the Tatweel. The consensus seems to be to disallow Tatweel.<br>

&gt;&gt;<br>

&gt;&gt; So my suggestion is that those who are pushing for lookup mapping, be<br>

&gt;&gt; willing to get &quot;stricter&quot; about the input to the mapping function.<br>

&gt;&gt; Otherwise, I fear that this WG will not reach a final consensus,<br>

&gt;&gt; possibly leading to a &quot;fork&quot; between the Web protocol stack and<br>

&gt;&gt; others.<br>

&gt;&gt;<br>

&gt;&gt; Erik<br>

&gt;&gt;<br>

&gt;&gt; On Thu, Apr 2, 2009 at 10:07 AM, Mark Davis &lt;<a href="mailto:mark@macchiato.com">mark@macchiato.com</a>&gt; wrote:<br>

&gt;&gt; &gt; It would be possible to do a Tables section for mappings, that went<br>

&gt;&gt; &gt; through<br>

&gt;&gt; &gt; the same kind of process that we did for Tables, of fine tuning the<br>

&gt;&gt; &gt; mapping.<br>

&gt;&gt; &gt; That is, we could go through all of the mappings and figure out which<br>

&gt;&gt; &gt; ones<br>

&gt;&gt; &gt; we need, and which ones we don&#39;t.<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; Frankly, I don&#39;t think we need to go through the effort. The only<br>

&gt;&gt; &gt; problem I<br>

&gt;&gt; &gt; see is where a disallowed character X looks most like one PVALID<br>

&gt;&gt; &gt; character<br>

&gt;&gt; &gt; P1, but maps to a different PVALID character P2, and P1 is not<br>

&gt;&gt; &gt; confusable<br>

&gt;&gt; &gt; with P2 already. I don&#39;t know of any cases like that.<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; BTW, my earlier figures were including the &quot;Remove Default Ignorables&quot;<br>

&gt;&gt; &gt; from<br>

&gt;&gt; &gt; my earlier mail. Here are the figures with that broken out:<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; NFKC-CF-RDI,    Remapped:    5290<br>

&gt;&gt; &gt; NFKC-LC-RDI,    Remapped:    5224<br>

&gt;&gt; &gt; NFKC-CF,    Remapped:    4896<br>

&gt;&gt; &gt; NFKC-LC,    Remapped:    4830<br>

&gt;&gt; &gt; NFC-CF-RDI,    Remapped:    2485<br>

&gt;&gt; &gt; NFC-LC-RDI,    Remapped:    2394<br>

&gt;&gt; &gt; NFC-CF,    Remapped:    2091<br>

&gt;&gt; &gt; NFC-LC,    Remapped:    2000<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; CF = Unicode toCaseFold<br>

&gt;&gt; &gt; LC = Unicode toLowercase<br>

&gt;&gt; &gt; RDI = Remove default ignorables<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; And of course, the mappings would be restricted to only mapping<br>

&gt;&gt; &gt; characters<br>

&gt;&gt; &gt; that were not PVALID in any event, so the above figures would vary<br>

&gt;&gt; &gt; depending<br>

&gt;&gt; &gt; on what we end up with there.<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; Mark<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; On Thu, Apr 2, 2009 at 09:53, Erik van der Poel &lt;<a href="mailto:erikv@google.com">erikv@google.com</a>&gt;<br>

&gt;&gt; &gt; wrote:<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; Please, let&#39;s not let size and complexity issues derail this IDNAbis<br>

&gt;&gt; &gt;&gt; effort. Haste makes waste. IDNA2003 was a good first cut, that took<br>

&gt;&gt; &gt;&gt; advantage of several Unicode tables, adopting them wholesale. IDNA2008<br>

&gt;&gt; &gt;&gt; is a much more careful effort, with detailed dissection, as you can<br>

&gt;&gt; &gt;&gt; see in the Table draft. We should apply similar care to the &quot;mapping&quot;<br>

&gt;&gt; &gt;&gt; table.<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; I suggest that we come up with principles, that we then apply to the<br>

&gt;&gt; &gt;&gt; question of mapping. For example, the reason for lower-casing<br>

&gt;&gt; &gt;&gt; non-ASCII letters is to compensate for the lack of matching on the<br>

&gt;&gt; &gt;&gt; server side. The reason for mapping full-width Latin to normal is<br>

&gt;&gt; &gt;&gt; because it is easy to type those characters in East Asian input<br>

&gt;&gt; &gt;&gt; methods. (Of course, we need to see if there is consensus, for each of<br>

&gt;&gt; &gt;&gt; these &quot;reasons&quot;.)<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; I also suggest that we automate the process of finding problematic<br>

&gt;&gt; &gt;&gt; characters. For example, we have already seen that 3-way relationships<br>

&gt;&gt; &gt;&gt; are problematic. One example of this is Final/Normal/Capital Sigma. We<br>

&gt;&gt; &gt;&gt; can automatically find these in Unicode&#39;s CaseFold tables. We can also<br>

&gt;&gt; &gt;&gt; look for cases where one character becomes two when upper- or<br>

&gt;&gt; &gt;&gt; lower-cased (e.g. Eszett -&gt; SS).<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; We should definitely not let the current size of Unicode-related<br>

&gt;&gt; &gt;&gt; libraries like ICU affect the decision-making process in IETF. Thin<br>

&gt;&gt; &gt;&gt; clients can always let big servers do the heavy lifting.<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; Erik<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; On Thu, Apr 2, 2009 at 8:22 AM, Mark Davis &lt;<a href="mailto:mark@macchiato.com">mark@macchiato.com</a>&gt; wrote:<br>

&gt;&gt; &gt;&gt; &gt; Mark<br>

&gt;&gt; &gt;&gt; &gt;<br>

&gt;&gt; &gt;&gt; &gt; On Wed, Apr 1, 2009 at 12:51, John C Klensin &lt;<a href="mailto:klensin@jck.com">klensin@jck.com</a>&gt; wrote:<br>

&gt;&gt; &gt;&gt; &gt;<br>

&gt;&gt; &gt;&gt; &gt;&gt; We are, of course, there already.  While<br>

&gt;&gt; &gt;&gt; &gt;&gt; NFKC(CaseFold(NFKC(string))) is a good predictor of the<br>

&gt;&gt; &gt;&gt; &gt;&gt; Stringprep mappings, it is not an exact one, and IDNA2003<br>

&gt;&gt; &gt;&gt; &gt;&gt; implementations already need separate tables for NFKC and IDNA.<br>

&gt;&gt; &gt;&gt; &gt;<br>

&gt;&gt; &gt;&gt; &gt; True that they are not exact; but the differences are few, and<br>

&gt;&gt; &gt;&gt; &gt; extremely rare (not even measurable in practice, since there<br>

&gt;&gt; &gt;&gt; &gt; frequency<br>

&gt;&gt; &gt;&gt; &gt; is  on a par with random data). Moreover, some implementations<br>

&gt;&gt; &gt;&gt; &gt; already<br>

&gt;&gt; &gt;&gt; &gt; use the latest version of NFKC instead of having special old<br>

&gt;&gt; &gt;&gt; &gt; versions,<br>

&gt;&gt; &gt;&gt; &gt; because the differences are so small. So given the choice of a major<br>

&gt;&gt; &gt;&gt; &gt; breakage or an insignificant breakage, I&#39;d go for the insignificant<br>

&gt;&gt; &gt;&gt; &gt; one.<br>

&gt;&gt; &gt;&gt; &gt;<br>

&gt;&gt; &gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; &gt;&gt; That is where arguments about complexity get complicated.<br>

&gt;&gt; &gt;&gt; &gt;&gt; IDNA2008, even with contextual rules, is arguably less complex<br>

&gt;&gt; &gt;&gt; &gt;&gt; than IDNA2003 precisely because, other than that handful of<br>

&gt;&gt; &gt;&gt; &gt;&gt; characters, the tables are smaller and the interpretation of an<br>

&gt;&gt; &gt;&gt; &gt;&gt; entry in those tables is &quot;valid&quot; or &quot;not&quot;.  By contrast,<br>

&gt;&gt; &gt;&gt; &gt;&gt; IDNA2003 requires a table that is nearly the size of Unicode<br>

&gt;&gt; &gt;&gt; &gt;&gt; with mapping actions for many characters.<br>

&gt;&gt; &gt;&gt; &gt;<br>

&gt;&gt; &gt;&gt; &gt; I have just no idea whatever where you are getting your figures, but<br>

&gt;&gt; &gt;&gt; &gt; they are very misleading. I&#39;ll assume that was not the intent.<br>

&gt;&gt; &gt;&gt; &gt;<br>

&gt;&gt; &gt;&gt; &gt; Here are the figures I get.<br>

&gt;&gt; &gt;&gt; &gt;<br>

&gt;&gt; &gt;&gt; &gt; PValid or Context: 90262<br>

&gt;&gt; &gt;&gt; &gt; NFKC-Folded,    Remapped:       5290<br>

&gt;&gt; &gt;&gt; &gt; NFKC-Lower,     Remapped:       5224<br>

&gt;&gt; &gt;&gt; &gt; NFC-Folded,     Remapped:       2485<br>

&gt;&gt; &gt;&gt; &gt; NFC-Lower,      Remapped:       2394<br>

&gt;&gt; &gt;&gt; &gt;<br>

&gt;&gt; &gt;&gt; &gt; A &quot;table that is nearly the size of Unicode&quot;. If you mean possible<br>

&gt;&gt; &gt;&gt; &gt; Unicode characters, that&#39;s over a million. Even if you mean graphic<br>

&gt;&gt; &gt;&gt; &gt; characters, that&#39;s somewhat over 100,000<br>

&gt;&gt; &gt;&gt; &gt; (<a href="http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B" target="_blank">http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[</a>^[:c:]]).<br>

&gt;&gt; &gt;&gt; &gt;<br>

&gt;&gt; &gt;&gt; &gt; NFKC-Folded affects 5,290 characters in Unicode 5.2. That&#39;s about 5%<br>

&gt;&gt; &gt;&gt; &gt; of graphic characters: in my book at least, 5% doesn&#39;t mean &quot;nearly<br>

&gt;&gt; &gt;&gt; &gt; all&quot;. Or maybe you meant something odd like &quot;the size of the table in<br>

&gt;&gt; &gt;&gt; &gt; bytes is nearly as large as the number of Unicode assigned graphic<br>

&gt;&gt; &gt;&gt; &gt; characters&quot;.<br>

&gt;&gt; &gt;&gt; &gt;<br>

&gt;&gt; &gt;&gt; &gt; Let&#39;s step back a bit. We need to remember that IDNA2008 already<br>

&gt;&gt; &gt;&gt; &gt; requires the data in Tables and NFC (for sizing on that, see<br>

&gt;&gt; &gt;&gt; &gt; <a href="http://www.macchiato.com/unicode/nfc-faq" target="_blank">http://www.macchiato.com/unicode/nfc-faq</a>). The additional table size<br>

&gt;&gt; &gt;&gt; &gt; for NFKC and Folding is not that big an increase. As a matter of<br>

&gt;&gt; &gt;&gt; &gt; fact,<br>

&gt;&gt; &gt;&gt; &gt; if an implementation is tight on space, then having them available<br>

&gt;&gt; &gt;&gt; &gt; allows it to substantially cut down on the Table size by<br>

&gt;&gt; &gt;&gt; &gt; algorithmically computing<br>

&gt;&gt; &gt;&gt; &gt; <a href="http://tools.ietf.org/html/draft-ietf-idnabis-tables-05#section-2.2" target="_blank">http://tools.ietf.org/html/draft-ietf-idnabis-tables-05#section-2.2</a>.<br>

&gt;&gt; &gt;&gt; &gt;<br>

&gt;&gt; &gt;&gt; &gt; If you have different figures, it would be useful to put them out.<br>

&gt;&gt; &gt;&gt; &gt;<br>

&gt;&gt; &gt;&gt; &gt;&gt; And, of course, a<br>

&gt;&gt; &gt;&gt; &gt;&gt; transition strategy that preserves full functionality for all<br>

&gt;&gt; &gt;&gt; &gt;&gt; labels that were valid under IDNA2003 means that one has to<br>

&gt;&gt; &gt;&gt; &gt;&gt; support both, which is the most complex option possible.<br>

&gt;&gt; &gt;&gt; &gt;<br>

&gt;&gt; &gt;&gt; &gt; I agree that it has the most overhead, since you have to keep a copy<br>

&gt;&gt; &gt;&gt; &gt; of IDNA2003 around. That&#39;s why I favor a cleaner approach.<br>

&gt;&gt; &gt;&gt; &gt;<br>

&gt;&gt; &gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; &gt;&gt;    john<br>

&gt;&gt; &gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; &gt;&gt; _______________________________________________<br>

&gt;&gt; &gt;&gt; &gt;&gt; Idna-update mailing list<br>

&gt;&gt; &gt;&gt; &gt;&gt; <a href="mailto:Idna-update@alvestrand.no">Idna-update@alvestrand.no</a><br>

&gt;&gt; &gt;&gt; &gt;&gt; <a href="http://www.alvestrand.no/mailman/listinfo/idna-update" target="_blank">http://www.alvestrand.no/mailman/listinfo/idna-update</a><br>

&gt;&gt; &gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; &gt;<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt;<br>

&gt;<br>

&gt;<br>

_______________________________________________<br>

Idna-update mailing list<br>

<a href="mailto:Idna-update@alvestrand.no">Idna-update@alvestrand.no</a><br>

<a href="http://www.alvestrand.no/mailman/listinfo/idna-update" target="_blank">http://www.alvestrand.no/mailman/listinfo/idna-update</a><br>

</div></div></blockquote></div><br>