Q2: What mapping function should be used in a revised IDNA2008 specification?

Thu Apr 2 19:07:40 CEST 2009

It would be possible to do a Tables section for mappings, that went through
the same kind of process that we did for Tables, of fine tuning the mapping.
That is, we could go through all of the mappings and figure out which ones
we need, and which ones we don't.

Frankly, I don't think we need to go through the effort. The only problem I
see is where a disallowed character X looks most like one PVALID character
P1, but maps to a different PVALID character P2, and P1 is not confusable
with P2 already. I don't know of any cases like that.

BTW, my earlier figures were including the "Remove Default Ignorables" from
my earlier mail. Here are the figures with that broken out:

NFKC-CF-RDI,    Remapped:    5290
NFKC-LC-RDI,    Remapped:    5224
NFKC-CF,    Remapped:    4896
NFKC-LC,    Remapped:    4830
NFC-CF-RDI,    Remapped:    2485
NFC-LC-RDI,    Remapped:    2394
NFC-CF,    Remapped:    2091
NFC-LC,    Remapped:    2000

CF = Unicode toCaseFold
LC = Unicode toLowercase
RDI = Remove default ignorables

And of course, the mappings would be restricted to only mapping characters
that were not PVALID in any event, so the above figures would vary depending
on what we end up with there.

Mark

On Thu, Apr 2, 2009 at 09:53, Erik van der Poel <erikv at google.com> wrote:

> Please, let's not let size and complexity issues derail this IDNAbis
> effort. Haste makes waste. IDNA2003 was a good first cut, that took
> advantage of several Unicode tables, adopting them wholesale. IDNA2008
> is a much more careful effort, with detailed dissection, as you can
> see in the Table draft. We should apply similar care to the "mapping"
> table.
>
> I suggest that we come up with principles, that we then apply to the
> question of mapping. For example, the reason for lower-casing
> non-ASCII letters is to compensate for the lack of matching on the
> server side. The reason for mapping full-width Latin to normal is
> because it is easy to type those characters in East Asian input
> methods. (Of course, we need to see if there is consensus, for each of
> these "reasons".)
>
> I also suggest that we automate the process of finding problematic
> characters. For example, we have already seen that 3-way relationships
> are problematic. One example of this is Final/Normal/Capital Sigma. We
> can automatically find these in Unicode's CaseFold tables. We can also
> look for cases where one character becomes two when upper- or
> lower-cased (e.g. Eszett -> SS).
>
> We should definitely not let the current size of Unicode-related
> libraries like ICU affect the decision-making process in IETF. Thin
> clients can always let big servers do the heavy lifting.
>
> Erik
>
> On Thu, Apr 2, 2009 at 8:22 AM, Mark Davis <mark at macchiato.com> wrote:
> > Mark
> >
> > On Wed, Apr 1, 2009 at 12:51, John C Klensin <klensin at jck.com> wrote:
> >
> >> We are, of course, there already.  While
> >> NFKC(CaseFold(NFKC(string))) is a good predictor of the
> >> Stringprep mappings, it is not an exact one, and IDNA2003
> >> implementations already need separate tables for NFKC and IDNA.
> >
> > True that they are not exact; but the differences are few, and
> > extremely rare (not even measurable in practice, since there frequency
> > is  on a par with random data). Moreover, some implementations already
> > use the latest version of NFKC instead of having special old versions,
> > because the differences are so small. So given the choice of a major
> > breakage or an insignificant breakage, I'd go for the insignificant
> > one.
> >
> >>
> >> That is where arguments about complexity get complicated.
> >> IDNA2008, even with contextual rules, is arguably less complex
> >> than IDNA2003 precisely because, other than that handful of
> >> characters, the tables are smaller and the interpretation of an
> >> entry in those tables is "valid" or "not".  By contrast,
> >> IDNA2003 requires a table that is nearly the size of Unicode
> >> with mapping actions for many characters.
> >
> > I have just no idea whatever where you are getting your figures, but
> > they are very misleading. I'll assume that was not the intent.
> >
> > Here are the figures I get.
> >
> > PValid or Context: 90262
> > NFKC-Folded,    Remapped:       5290
> > NFKC-Lower,     Remapped:       5224
> > NFC-Folded,     Remapped:       2485
> > NFC-Lower,      Remapped:       2394
> >
> > A "table that is nearly the size of Unicode". If you mean possible
> > Unicode characters, that's over a million. Even if you mean graphic
> > characters, that's somewhat over 100,000
> > (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B>
> ^[:c:]]).
> >
> > NFKC-Folded affects 5,290 characters in Unicode 5.2. That's about 5%
> > of graphic characters: in my book at least, 5% doesn't mean "nearly
> > all". Or maybe you meant something odd like "the size of the table in
> > bytes is nearly as large as the number of Unicode assigned graphic
> > characters".
> >
> > Let's step back a bit. We need to remember that IDNA2008 already
> > requires the data in Tables and NFC (for sizing on that, see
> > http://www.macchiato.com/unicode/nfc-faq). The additional table size
> > for NFKC and Folding is not that big an increase. As a matter of fact,
> > if an implementation is tight on space, then having them available
> > allows it to substantially cut down on the Table size by
> > algorithmically computing
> > http://tools.ietf.org/html/draft-ietf-idnabis-tables-05#section-2.2.
> >
> > If you have different figures, it would be useful to put them out.
> >
> >> And, of course, a
> >> transition strategy that preserves full functionality for all
> >> labels that were valid under IDNA2003 means that one has to
> >> support both, which is the most complex option possible.
> >
> > I agree that it has the most overhead, since you have to keep a copy
> > of IDNA2003 around. That's why I favor a cleaner approach.
> >
> >>
> >>    john
> >>
> >> _______________________________________________
> >> Idna-update mailing list
> >> Idna-update at alvestrand.no
> >> http://www.alvestrand.no/mailman/listinfo/idna-update
> >>
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20090402/3414620b/attachment.htm