Q2: What mapping function should be used in a revised IDNA2008 specification?

Thu Apr 2 20:27:20 CEST 2009

I modified the program to add a comparison to IDNA2003. I am only including
cases where the mapping results in A-Label characters. The numbers within
and across row don't add up as you might expect because of various overlaps
and because only mappings to A-Label characters are counted.

Most of the difference between NFKC-CF-RDI and IDNA2003 are new 5.2
characters; there are 5 diverging mappings. (As I said before, these figures
don't include the current list of special cases: eszett, final_sigma,
joiners.)

PValid or Context: 90262
IDNA2003,    Remapped:    4337
NFKC-CF-RDI,    Remapped:    5291,    Diverging:    5
NFKC-LC-RDI,    Remapped:    5225,    Diverging:    77
NFKC-CF,    Remapped:    4896,    Diverging:    32
NFKC-LC,    Remapped:    4830,    Diverging:    104
NFC-CF-RDI,    Remapped:    2486,    Diverging:    2663
NFC-LC-RDI,    Remapped:    2395,    Diverging:    2754
NFC-CF,    Remapped:    2091,    Diverging:    2690
NFC-LC,    Remapped:    2000,    Diverging:    2781

Mark

On Thu, Apr 2, 2009 at 10:07, Mark Davis <mark at macchiato.com> wrote:

> It would be possible to do a Tables section for mappings, that went through
> the same kind of process that we did for Tables, of fine tuning the mapping.
> That is, we could go through all of the mappings and figure out which ones
> we need, and which ones we don't.
>
> Frankly, I don't think we need to go through the effort. The only problem I
> see is where a disallowed character X looks most like one PVALID character
> P1, but maps to a different PVALID character P2, and P1 is not confusable
> with P2 already. I don't know of any cases like that.
>
> BTW, my earlier figures were including the "Remove Default Ignorables" from
> my earlier mail. Here are the figures with that broken out:
>
> NFKC-CF-RDI,    Remapped:    5290
> NFKC-LC-RDI,    Remapped:    5224
> NFKC-CF,    Remapped:    4896
> NFKC-LC,    Remapped:    4830
> NFC-CF-RDI,    Remapped:    2485
> NFC-LC-RDI,    Remapped:    2394
> NFC-CF,    Remapped:    2091
> NFC-LC,    Remapped:    2000
>
> CF = Unicode toCaseFold
> LC = Unicode toLowercase
> RDI = Remove default ignorables
>
> And of course, the mappings would be restricted to only mapping characters
> that were not PVALID in any event, so the above figures would vary depending
> on what we end up with there.
>
> Mark
>
>
>
> On Thu, Apr 2, 2009 at 09:53, Erik van der Poel <erikv at google.com> wrote:
>
>> Please, let's not let size and complexity issues derail this IDNAbis
>> effort. Haste makes waste. IDNA2003 was a good first cut, that took
>> advantage of several Unicode tables, adopting them wholesale. IDNA2008
>> is a much more careful effort, with detailed dissection, as you can
>> see in the Table draft. We should apply similar care to the "mapping"
>> table.
>>
>> I suggest that we come up with principles, that we then apply to the
>> question of mapping. For example, the reason for lower-casing
>> non-ASCII letters is to compensate for the lack of matching on the
>> server side. The reason for mapping full-width Latin to normal is
>> because it is easy to type those characters in East Asian input
>> methods. (Of course, we need to see if there is consensus, for each of
>> these "reasons".)
>>
>> I also suggest that we automate the process of finding problematic
>> characters. For example, we have already seen that 3-way relationships
>> are problematic. One example of this is Final/Normal/Capital Sigma. We
>> can automatically find these in Unicode's CaseFold tables. We can also
>> look for cases where one character becomes two when upper- or
>> lower-cased (e.g. Eszett -> SS).
>>
>> We should definitely not let the current size of Unicode-related
>> libraries like ICU affect the decision-making process in IETF. Thin
>> clients can always let big servers do the heavy lifting.
>>
>> Erik
>>
>> On Thu, Apr 2, 2009 at 8:22 AM, Mark Davis <mark at macchiato.com> wrote:
>> > Mark
>> >
>> > On Wed, Apr 1, 2009 at 12:51, John C Klensin <klensin at jck.com> wrote:
>> >
>> >> We are, of course, there already.  While
>> >> NFKC(CaseFold(NFKC(string))) is a good predictor of the
>> >> Stringprep mappings, it is not an exact one, and IDNA2003
>> >> implementations already need separate tables for NFKC and IDNA.
>> >
>> > True that they are not exact; but the differences are few, and
>> > extremely rare (not even measurable in practice, since there frequency
>> > is  on a par with random data). Moreover, some implementations already
>> > use the latest version of NFKC instead of having special old versions,
>> > because the differences are so small. So given the choice of a major
>> > breakage or an insignificant breakage, I'd go for the insignificant
>> > one.
>> >
>> >>
>> >> That is where arguments about complexity get complicated.
>> >> IDNA2008, even with contextual rules, is arguably less complex
>> >> than IDNA2003 precisely because, other than that handful of
>> >> characters, the tables are smaller and the interpretation of an
>> >> entry in those tables is "valid" or "not".  By contrast,
>> >> IDNA2003 requires a table that is nearly the size of Unicode
>> >> with mapping actions for many characters.
>> >
>> > I have just no idea whatever where you are getting your figures, but
>> > they are very misleading. I'll assume that was not the intent.
>> >
>> > Here are the figures I get.
>> >
>> > PValid or Context: 90262
>> > NFKC-Folded,    Remapped:       5290
>> > NFKC-Lower,     Remapped:       5224
>> > NFC-Folded,     Remapped:       2485
>> > NFC-Lower,      Remapped:       2394
>> >
>> > A "table that is nearly the size of Unicode". If you mean possible
>> > Unicode characters, that's over a million. Even if you mean graphic
>> > characters, that's somewhat over 100,000
>> > (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B>
>> ^[:c:]]).
>> >
>> > NFKC-Folded affects 5,290 characters in Unicode 5.2. That's about 5%
>> > of graphic characters: in my book at least, 5% doesn't mean "nearly
>> > all". Or maybe you meant something odd like "the size of the table in
>> > bytes is nearly as large as the number of Unicode assigned graphic
>> > characters".
>> >
>> > Let's step back a bit. We need to remember that IDNA2008 already
>> > requires the data in Tables and NFC (for sizing on that, see
>> > http://www.macchiato.com/unicode/nfc-faq). The additional table size
>> > for NFKC and Folding is not that big an increase. As a matter of fact,
>> > if an implementation is tight on space, then having them available
>> > allows it to substantially cut down on the Table size by
>> > algorithmically computing
>> > http://tools.ietf.org/html/draft-ietf-idnabis-tables-05#section-2.2.
>> >
>> > If you have different figures, it would be useful to put them out.
>> >
>> >> And, of course, a
>> >> transition strategy that preserves full functionality for all
>> >> labels that were valid under IDNA2003 means that one has to
>> >> support both, which is the most complex option possible.
>> >
>> > I agree that it has the most overhead, since you have to keep a copy
>> > of IDNA2003 around. That's why I favor a cleaner approach.
>> >
>> >>
>> >>    john
>> >>
>> >> _______________________________________________
>> >> Idna-update mailing list
>> >> Idna-update at alvestrand.no
>> >> http://www.alvestrand.no/mailman/listinfo/idna-update
>> >>
>> >
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20090402/b8e3a213/attachment-0001.htm