Sorry, I misunderstood.<br><br>Tatweel is in the same boat as many other characters. It is irrelevant to mappings, since the only relevant mappings are those where the results are all valid U-Label characters. So those characters are off the table anyway.<br>
<br clear="all">Mark<br>
<br><br><div class="gmail_quote">On Thu, Apr 2, 2009 at 13:08, Erik van der Poel <span dir="ltr"><<a href="mailto:erikv@google.com">erikv@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
My email says "disallow Tatweel", not map Tatweel.<br>
<font color="#888888"><br>
Erik<br>
</font><div><div></div><div class="h5"><br>
On Thu, Apr 2, 2009 at 11:44 AM, Mark Davis <<a href="mailto:mark@macchiato.com">mark@macchiato.com</a>> wrote:<br>
> I agree that the construction, like Tables, can be Property based. And large<br>
> swathes of characters can be excluded pretty much out the box. However, it<br>
> does mean going through the exercise of looking at the excluded characters<br>
> to see if any of them should be exceptions. While I don't think it is worth<br>
> the effort, worth the further incompatibility with IDNA2003, or worth not<br>
> being able to use off-the-shelf NFKC and case-folding code, it is not an<br>
> unreasonable compromise.<br>
><br>
> As far as Tatweel goes, we really don't want to add any mappings that were<br>
> not in IDNA2003; for security and interoperability we need all Unicode 3.2<br>
> characters to either (a) not map, or (b) map to the exactly the same as<br>
> IDNA2003 has. (You pointed out the problem earlier.)<br>
><br>
> Mark<br>
><br>
><br>
> On Thu, Apr 2, 2009 at 10:50, Erik van der Poel <<a href="mailto:erikv@google.com">erikv@google.com</a>> wrote:<br>
>><br>
>> It may not be necessary to do character-by-character analysis of NFKC.<br>
>> We may be able to select a small number of the NFKC tags:<br>
>><br>
>> <font> A font variant (e.g. a blackletter form).<br>
>> <noBreak> A no-break version of a space or hyphen.<br>
>> <initial> An initial presentation form (Arabic).<br>
>> <medial> A medial presentation form (Arabic).<br>
>> <final> A final presentation form (Arabic).<br>
>> <isolated> An isolated presentation form (Arabic).<br>
>> <circle> An encircled form.<br>
>> <super> A superscript form.<br>
>> <sub> A subscript form.<br>
>> <vertical> A vertical layout presentation form.<br>
>> <wide> A wide (or zenkaku) compatibility character.<br>
>> <narrow> A narrow (or hankaku) compatibility character.<br>
>> <small> A small variant form (CNS compatibility).<br>
>> <square> A CJK squared font variant.<br>
>> <fraction> A vulgar fraction form.<br>
>> <compat> Otherwise unspecified compatibility character.<br>
>><br>
>> Of these, I would suggest that <wide> and <narrow> are needed for East<br>
>> Asian input methods.<br>
>><br>
>> We should also remember that a number of WG participants would have to<br>
>> compromise to some extent, in order to accept mapping as a requirement<br>
>> on the lookup side. Those that pushed for lookup mappings should also<br>
>> be willing to make some compromises.<br>
>><br>
>> One example where we seem to have consensus for getting "stricter" is<br>
>> the Tatweel. The consensus seems to be to disallow Tatweel.<br>
>><br>
>> So my suggestion is that those who are pushing for lookup mapping, be<br>
>> willing to get "stricter" about the input to the mapping function.<br>
>> Otherwise, I fear that this WG will not reach a final consensus,<br>
>> possibly leading to a "fork" between the Web protocol stack and<br>
>> others.<br>
>><br>
>> Erik<br>
>><br>
>> On Thu, Apr 2, 2009 at 10:07 AM, Mark Davis <<a href="mailto:mark@macchiato.com">mark@macchiato.com</a>> wrote:<br>
>> > It would be possible to do a Tables section for mappings, that went<br>
>> > through<br>
>> > the same kind of process that we did for Tables, of fine tuning the<br>
>> > mapping.<br>
>> > That is, we could go through all of the mappings and figure out which<br>
>> > ones<br>
>> > we need, and which ones we don't.<br>
>> ><br>
>> > Frankly, I don't think we need to go through the effort. The only<br>
>> > problem I<br>
>> > see is where a disallowed character X looks most like one PVALID<br>
>> > character<br>
>> > P1, but maps to a different PVALID character P2, and P1 is not<br>
>> > confusable<br>
>> > with P2 already. I don't know of any cases like that.<br>
>> ><br>
>> > BTW, my earlier figures were including the "Remove Default Ignorables"<br>
>> > from<br>
>> > my earlier mail. Here are the figures with that broken out:<br>
>> ><br>
>> > NFKC-CF-RDI, Remapped: 5290<br>
>> > NFKC-LC-RDI, Remapped: 5224<br>
>> > NFKC-CF, Remapped: 4896<br>
>> > NFKC-LC, Remapped: 4830<br>
>> > NFC-CF-RDI, Remapped: 2485<br>
>> > NFC-LC-RDI, Remapped: 2394<br>
>> > NFC-CF, Remapped: 2091<br>
>> > NFC-LC, Remapped: 2000<br>
>> ><br>
>> > CF = Unicode toCaseFold<br>
>> > LC = Unicode toLowercase<br>
>> > RDI = Remove default ignorables<br>
>> ><br>
>> > And of course, the mappings would be restricted to only mapping<br>
>> > characters<br>
>> > that were not PVALID in any event, so the above figures would vary<br>
>> > depending<br>
>> > on what we end up with there.<br>
>> ><br>
>> > Mark<br>
>> ><br>
>> ><br>
>> > On Thu, Apr 2, 2009 at 09:53, Erik van der Poel <<a href="mailto:erikv@google.com">erikv@google.com</a>><br>
>> > wrote:<br>
>> >><br>
>> >> Please, let's not let size and complexity issues derail this IDNAbis<br>
>> >> effort. Haste makes waste. IDNA2003 was a good first cut, that took<br>
>> >> advantage of several Unicode tables, adopting them wholesale. IDNA2008<br>
>> >> is a much more careful effort, with detailed dissection, as you can<br>
>> >> see in the Table draft. We should apply similar care to the "mapping"<br>
>> >> table.<br>
>> >><br>
>> >> I suggest that we come up with principles, that we then apply to the<br>
>> >> question of mapping. For example, the reason for lower-casing<br>
>> >> non-ASCII letters is to compensate for the lack of matching on the<br>
>> >> server side. The reason for mapping full-width Latin to normal is<br>
>> >> because it is easy to type those characters in East Asian input<br>
>> >> methods. (Of course, we need to see if there is consensus, for each of<br>
>> >> these "reasons".)<br>
>> >><br>
>> >> I also suggest that we automate the process of finding problematic<br>
>> >> characters. For example, we have already seen that 3-way relationships<br>
>> >> are problematic. One example of this is Final/Normal/Capital Sigma. We<br>
>> >> can automatically find these in Unicode's CaseFold tables. We can also<br>
>> >> look for cases where one character becomes two when upper- or<br>
>> >> lower-cased (e.g. Eszett -> SS).<br>
>> >><br>
>> >> We should definitely not let the current size of Unicode-related<br>
>> >> libraries like ICU affect the decision-making process in IETF. Thin<br>
>> >> clients can always let big servers do the heavy lifting.<br>
>> >><br>
>> >> Erik<br>
>> >><br>
>> >> On Thu, Apr 2, 2009 at 8:22 AM, Mark Davis <<a href="mailto:mark@macchiato.com">mark@macchiato.com</a>> wrote:<br>
>> >> > Mark<br>
>> >> ><br>
>> >> > On Wed, Apr 1, 2009 at 12:51, John C Klensin <<a href="mailto:klensin@jck.com">klensin@jck.com</a>> wrote:<br>
>> >> ><br>
>> >> >> We are, of course, there already. While<br>
>> >> >> NFKC(CaseFold(NFKC(string))) is a good predictor of the<br>
>> >> >> Stringprep mappings, it is not an exact one, and IDNA2003<br>
>> >> >> implementations already need separate tables for NFKC and IDNA.<br>
>> >> ><br>
>> >> > True that they are not exact; but the differences are few, and<br>
>> >> > extremely rare (not even measurable in practice, since there<br>
>> >> > frequency<br>
>> >> > is on a par with random data). Moreover, some implementations<br>
>> >> > already<br>
>> >> > use the latest version of NFKC instead of having special old<br>
>> >> > versions,<br>
>> >> > because the differences are so small. So given the choice of a major<br>
>> >> > breakage or an insignificant breakage, I'd go for the insignificant<br>
>> >> > one.<br>
>> >> ><br>
>> >> >><br>
>> >> >> That is where arguments about complexity get complicated.<br>
>> >> >> IDNA2008, even with contextual rules, is arguably less complex<br>
>> >> >> than IDNA2003 precisely because, other than that handful of<br>
>> >> >> characters, the tables are smaller and the interpretation of an<br>
>> >> >> entry in those tables is "valid" or "not". By contrast,<br>
>> >> >> IDNA2003 requires a table that is nearly the size of Unicode<br>
>> >> >> with mapping actions for many characters.<br>
>> >> ><br>
>> >> > I have just no idea whatever where you are getting your figures, but<br>
>> >> > they are very misleading. I'll assume that was not the intent.<br>
>> >> ><br>
>> >> > Here are the figures I get.<br>
>> >> ><br>
>> >> > PValid or Context: 90262<br>
>> >> > NFKC-Folded, Remapped: 5290<br>
>> >> > NFKC-Lower, Remapped: 5224<br>
>> >> > NFC-Folded, Remapped: 2485<br>
>> >> > NFC-Lower, Remapped: 2394<br>
>> >> ><br>
>> >> > A "table that is nearly the size of Unicode". If you mean possible<br>
>> >> > Unicode characters, that's over a million. Even if you mean graphic<br>
>> >> > characters, that's somewhat over 100,000<br>
>> >> > (<a href="http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B" target="_blank">http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[</a>^[:c:]]).<br>
>> >> ><br>
>> >> > NFKC-Folded affects 5,290 characters in Unicode 5.2. That's about 5%<br>
>> >> > of graphic characters: in my book at least, 5% doesn't mean "nearly<br>
>> >> > all". Or maybe you meant something odd like "the size of the table in<br>
>> >> > bytes is nearly as large as the number of Unicode assigned graphic<br>
>> >> > characters".<br>
>> >> ><br>
>> >> > Let's step back a bit. We need to remember that IDNA2008 already<br>
>> >> > requires the data in Tables and NFC (for sizing on that, see<br>
>> >> > <a href="http://www.macchiato.com/unicode/nfc-faq" target="_blank">http://www.macchiato.com/unicode/nfc-faq</a>). The additional table size<br>
>> >> > for NFKC and Folding is not that big an increase. As a matter of<br>
>> >> > fact,<br>
>> >> > if an implementation is tight on space, then having them available<br>
>> >> > allows it to substantially cut down on the Table size by<br>
>> >> > algorithmically computing<br>
>> >> > <a href="http://tools.ietf.org/html/draft-ietf-idnabis-tables-05#section-2.2" target="_blank">http://tools.ietf.org/html/draft-ietf-idnabis-tables-05#section-2.2</a>.<br>
>> >> ><br>
>> >> > If you have different figures, it would be useful to put them out.<br>
>> >> ><br>
>> >> >> And, of course, a<br>
>> >> >> transition strategy that preserves full functionality for all<br>
>> >> >> labels that were valid under IDNA2003 means that one has to<br>
>> >> >> support both, which is the most complex option possible.<br>
>> >> ><br>
>> >> > I agree that it has the most overhead, since you have to keep a copy<br>
>> >> > of IDNA2003 around. That's why I favor a cleaner approach.<br>
>> >> ><br>
>> >> >><br>
>> >> >> john<br>
>> >> >><br>
>> >> >> _______________________________________________<br>
>> >> >> Idna-update mailing list<br>
>> >> >> <a href="mailto:Idna-update@alvestrand.no">Idna-update@alvestrand.no</a><br>
>> >> >> <a href="http://www.alvestrand.no/mailman/listinfo/idna-update" target="_blank">http://www.alvestrand.no/mailman/listinfo/idna-update</a><br>
>> >> >><br>
>> >> ><br>
>> ><br>
>> ><br>
><br>
><br>
_______________________________________________<br>
Idna-update mailing list<br>
<a href="mailto:Idna-update@alvestrand.no">Idna-update@alvestrand.no</a><br>
<a href="http://www.alvestrand.no/mailman/listinfo/idna-update" target="_blank">http://www.alvestrand.no/mailman/listinfo/idna-update</a><br>
</div></div></blockquote></div><br>