Q2: What mapping function should be used in a revised IDNA2008 specification?

Fri Apr 3 08:45:05 CEST 2009

Hello Shawn,

In general, I agree with you that IDNA2008 should not introduce mappings 
that were not present in INDA2003. However, let's say
a cased pair of characters is added. I wouldn't assume that any
application creator would want to exclude that pair from case
mapping; it might lead to rather annoying user complaints.

Regards,   Martin.

On 2009/04/03 4:49, Shawn Steele (???) wrote:
> I'm concerned about additional mappings.  If we have to  back from 2008 to 2003, it's annoying, but what about 2012 to 2008 to 2003?
>
> Also such fallback would depend on how a # of different clients approached the problem.  That'd pbly cause lots of variation
>
> -shawn
>
> Sent from my AT&T Samsung i907 Windows Mobile® Smartphone.
>
> -----Original Message-----
> From: Erik van der Poel<erikv at google.com>
> Sent: Thursday, April 02, 2009 11:54 AM
> To: Mark Davis<mark at macchiato.com>
> Cc: Vint Cerf<vint at google.com>; idna-update at alvestrand.no<idna-update at alvestrand.no>; Martin J. Dürst<duerst at it.aoyama.ac.jp>; John C Klensin<klensin at jck.com>
> Subject: Re: Q2: What mapping function should be used in a revised IDNA2008     specification?
>
>
> Please, let's not let size and complexity issues derail this IDNAbis
> effort. Haste makes waste. IDNA2003 was a good first cut, that took
> advantage of several Unicode tables, adopting them wholesale. IDNA2008
> is a much more careful effort, with detailed dissection, as you can
> see in the Table draft. We should apply similar care to the "mapping"
> table.
>
> I suggest that we come up with principles, that we then apply to the
> question of mapping. For example, the reason for lower-casing
> non-ASCII letters is to compensate for the lack of matching on the
> server side. The reason for mapping full-width Latin to normal is
> because it is easy to type those characters in East Asian input
> methods. (Of course, we need to see if there is consensus, for each of
> these "reasons".)
>
> I also suggest that we automate the process of finding problematic
> characters. For example, we have already seen that 3-way relationships
> are problematic. One example of this is Final/Normal/Capital Sigma. We
> can automatically find these in Unicode's CaseFold tables. We can also
> look for cases where one character becomes two when upper- or
> lower-cased (e.g. Eszett ->  SS).
>
> We should definitely not let the current size of Unicode-related
> libraries like ICU affect the decision-making process in IETF. Thin
> clients can always let big servers do the heavy lifting.
>
> Erik
>
> On Thu, Apr 2, 2009 at 8:22 AM, Mark Davis<mark at macchiato.com>  wrote:
>> Mark
>>
>> On Wed, Apr 1, 2009 at 12:51, John C Klensin<klensin at jck.com>  wrote:
>>
>>> We are, of course, there already.  While
>>> NFKC(CaseFold(NFKC(string))) is a good predictor of the
>>> Stringprep mappings, it is not an exact one, and IDNA2003
>>> implementations already need separate tables for NFKC and IDNA.
>> True that they are not exact; but the differences are few, and
>> extremely rare (not even measurable in practice, since there frequency
>> is  on a par with random data). Moreover, some implementations already
>> use the latest version of NFKC instead of having special old versions,
>> because the differences are so small. So given the choice of a major
>> breakage or an insignificant breakage, I'd go for the insignificant
>> one.
>>
>>> That is where arguments about complexity get complicated.
>>> IDNA2008, even with contextual rules, is arguably less complex
>>> than IDNA2003 precisely because, other than that handful of
>>> characters, the tables are smaller and the interpretation of an
>>> entry in those tables is "valid" or "not".  By contrast,
>>> IDNA2003 requires a table that is nearly the size of Unicode
>>> with mapping actions for many characters.
>> I have just no idea whatever where you are getting your figures, but
>> they are very misleading. I'll assume that was not the intent.
>>
>> Here are the figures I get.
>>
>> PValid or Context: 90262
>> NFKC-Folded,    Remapped:       5290
>> NFKC-Lower,     Remapped:       5224
>> NFC-Folded,     Remapped:       2485
>> NFC-Lower,      Remapped:       2394
>>
>> A "table that is nearly the size of Unicode". If you mean possible
>> Unicode characters, that's over a million. Even if you mean graphic
>> characters, that's somewhat over 100,000
>> (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[^[:c:]]).
>>
>> NFKC-Folded affects 5,290 characters in Unicode 5.2. That's about 5%
>> of graphic characters: in my book at least, 5% doesn't mean "nearly
>> all". Or maybe you meant something odd like "the size of the table in
>> bytes is nearly as large as the number of Unicode assigned graphic
>> characters".
>>
>> Let's step back a bit. We need to remember that IDNA2008 already
>> requires the data in Tables and NFC (for sizing on that, see
>> http://www.macchiato.com/unicode/nfc-faq). The additional table size
>> for NFKC and Folding is not that big an increase. As a matter of fact,
>> if an implementation is tight on space, then having them available
>> allows it to substantially cut down on the Table size by
>> algorithmically computing
>> http://tools.ietf.org/html/draft-ietf-idnabis-tables-05#section-2.2.
>>
>> If you have different figures, it would be useful to put them out.
>>
>>> And, of course, a
>>> transition strategy that preserves full functionality for all
>>> labels that were valid under IDNA2003 means that one has to
>>> support both, which is the most complex option possible.
>> I agree that it has the most overhead, since you have to keep a copy
>> of IDNA2003 around. That's why I favor a cleaner approach.
>>
>>>     john
>>>
>>> _______________________________________________
>>> Idna-update mailing list
>>> Idna-update at alvestrand.no
>>> http://www.alvestrand.no/mailman/listinfo/idna-update
>>>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>

-- 
#-# Martin J.Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst at it.aoyama.ac.jp