Q2: What mapping function should be used in a revised IDNA2008 specification?

Mark Davis mark at macchiato.com
Thu Apr 2 17:22:45 CEST 2009


Mark

On Wed, Apr 1, 2009 at 12:51, John C Klensin <klensin at jck.com> wrote:

> We are, of course, there already.  While
> NFKC(CaseFold(NFKC(string))) is a good predictor of the
> Stringprep mappings, it is not an exact one, and IDNA2003
> implementations already need separate tables for NFKC and IDNA.

True that they are not exact; but the differences are few, and
extremely rare (not even measurable in practice, since there frequency
is  on a par with random data). Moreover, some implementations already
use the latest version of NFKC instead of having special old versions,
because the differences are so small. So given the choice of a major
breakage or an insignificant breakage, I'd go for the insignificant
one.

>
> That is where arguments about complexity get complicated.
> IDNA2008, even with contextual rules, is arguably less complex
> than IDNA2003 precisely because, other than that handful of
> characters, the tables are smaller and the interpretation of an
> entry in those tables is "valid" or "not".  By contrast,
> IDNA2003 requires a table that is nearly the size of Unicode
> with mapping actions for many characters.

I have just no idea whatever where you are getting your figures, but
they are very misleading. I'll assume that was not the intent.

Here are the figures I get.

PValid or Context: 90262
NFKC-Folded,	Remapped:	5290
NFKC-Lower,	Remapped:	5224
NFC-Folded,	Remapped:	2485
NFC-Lower,	Remapped:	2394

A "table that is nearly the size of Unicode". If you mean possible
Unicode characters, that's over a million. Even if you mean graphic
characters, that's somewhat over 100,000
(http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[^[:c:]]).

NFKC-Folded affects 5,290 characters in Unicode 5.2. That's about 5%
of graphic characters: in my book at least, 5% doesn't mean "nearly
all". Or maybe you meant something odd like "the size of the table in
bytes is nearly as large as the number of Unicode assigned graphic
characters".

Let's step back a bit. We need to remember that IDNA2008 already
requires the data in Tables and NFC (for sizing on that, see
http://www.macchiato.com/unicode/nfc-faq). The additional table size
for NFKC and Folding is not that big an increase. As a matter of fact,
if an implementation is tight on space, then having them available
allows it to substantially cut down on the Table size by
algorithmically computing
http://tools.ietf.org/html/draft-ietf-idnabis-tables-05#section-2.2.

If you have different figures, it would be useful to put them out.

> And, of course, a
> transition strategy that preserves full functionality for all
> labels that were valid under IDNA2003 means that one has to
> support both, which is the most complex option possible.

I agree that it has the most overhead, since you have to keep a copy
of IDNA2003 around. That's why I favor a cleaner approach.

>
>    john
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>


More information about the Idna-update mailing list