Q2: What mapping function should be used in a revised IDNA2008 specification?

Thu Apr 2 18:53:41 CEST 2009

Please, let's not let size and complexity issues derail this IDNAbis
effort. Haste makes waste. IDNA2003 was a good first cut, that took
advantage of several Unicode tables, adopting them wholesale. IDNA2008
is a much more careful effort, with detailed dissection, as you can
see in the Table draft. We should apply similar care to the "mapping"
table.

I suggest that we come up with principles, that we then apply to the
question of mapping. For example, the reason for lower-casing
non-ASCII letters is to compensate for the lack of matching on the
server side. The reason for mapping full-width Latin to normal is
because it is easy to type those characters in East Asian input
methods. (Of course, we need to see if there is consensus, for each of
these "reasons".)

I also suggest that we automate the process of finding problematic
characters. For example, we have already seen that 3-way relationships
are problematic. One example of this is Final/Normal/Capital Sigma. We
can automatically find these in Unicode's CaseFold tables. We can also
look for cases where one character becomes two when upper- or
lower-cased (e.g. Eszett -> SS).

We should definitely not let the current size of Unicode-related
libraries like ICU affect the decision-making process in IETF. Thin
clients can always let big servers do the heavy lifting.

Erik

On Thu, Apr 2, 2009 at 8:22 AM, Mark Davis <mark at macchiato.com> wrote:
> Mark
>
> On Wed, Apr 1, 2009 at 12:51, John C Klensin <klensin at jck.com> wrote:
>
>> We are, of course, there already.  While
>> NFKC(CaseFold(NFKC(string))) is a good predictor of the
>> Stringprep mappings, it is not an exact one, and IDNA2003
>> implementations already need separate tables for NFKC and IDNA.
>
> True that they are not exact; but the differences are few, and
> extremely rare (not even measurable in practice, since there frequency
> is  on a par with random data). Moreover, some implementations already
> use the latest version of NFKC instead of having special old versions,
> because the differences are so small. So given the choice of a major
> breakage or an insignificant breakage, I'd go for the insignificant
> one.
>
>>
>> That is where arguments about complexity get complicated.
>> IDNA2008, even with contextual rules, is arguably less complex
>> than IDNA2003 precisely because, other than that handful of
>> characters, the tables are smaller and the interpretation of an
>> entry in those tables is "valid" or "not".  By contrast,
>> IDNA2003 requires a table that is nearly the size of Unicode
>> with mapping actions for many characters.
>
> I have just no idea whatever where you are getting your figures, but
> they are very misleading. I'll assume that was not the intent.
>
> Here are the figures I get.
>
> PValid or Context: 90262
> NFKC-Folded,    Remapped:       5290
> NFKC-Lower,     Remapped:       5224
> NFC-Folded,     Remapped:       2485
> NFC-Lower,      Remapped:       2394
>
> A "table that is nearly the size of Unicode". If you mean possible
> Unicode characters, that's over a million. Even if you mean graphic
> characters, that's somewhat over 100,000
> (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[^[:c:]]).
>
> NFKC-Folded affects 5,290 characters in Unicode 5.2. That's about 5%
> of graphic characters: in my book at least, 5% doesn't mean "nearly
> all". Or maybe you meant something odd like "the size of the table in
> bytes is nearly as large as the number of Unicode assigned graphic
> characters".
>
> Let's step back a bit. We need to remember that IDNA2008 already
> requires the data in Tables and NFC (for sizing on that, see
> http://www.macchiato.com/unicode/nfc-faq). The additional table size
> for NFKC and Folding is not that big an increase. As a matter of fact,
> if an implementation is tight on space, then having them available
> allows it to substantially cut down on the Table size by
> algorithmically computing
> http://tools.ietf.org/html/draft-ietf-idnabis-tables-05#section-2.2.
>
> If you have different figures, it would be useful to put them out.
>
>> And, of course, a
>> transition strategy that preserves full functionality for all
>> labels that were valid under IDNA2003 means that one has to
>> support both, which is the most complex option possible.
>
> I agree that it has the most overhead, since you have to keep a copy
> of IDNA2003 around. That's why I favor a cleaner approach.
>
>>
>>    john
>>
>> _______________________________________________
>> Idna-update mailing list
>> Idna-update at alvestrand.no
>> http://www.alvestrand.no/mailman/listinfo/idna-update
>>
>