Q2: What mapping function should be used in a revised IDNA2008 specification?

Erik van der Poel erikv at google.com
Thu Apr 2 22:21:35 CEST 2009


I am quite strongly opposed to multiple lookup. I think clients should
just try their current IDNA version, and if that fails, just stop,
instead of trying the previous version of IDNA. This will cause file
maintainers to fix or remove the stored domain names that stop
working. No pain, no gain.

Having said that, we still need to decide what to do about Eszett,
Final Sigma, ZWJ and ZWNJ. A number of us are still researching these
characters, but I fear that these four may be the only ones where the
Web protocol stack may "fork" from the rest, whatever "the rest" may
be.

And this "fork" may manifest itself in the form that John mentioned,
i.e. protocol-specific mappings in the IRI/URI boundary. Or we may see
the same mappings in http: and mailto: IRIs, I don't know.

Erik

On Thu, Apr 2, 2009 at 12:49 PM, Shawn Steele (???)
<Shawn.Steele at microsoft.com> wrote:
> I'm concerned about additional mappings.  If we have to  back from 2008 to 2003, it's annoying, but what about 2012 to 2008 to 2003?
>
> Also such fallback would depend on how a # of different clients approached the problem.  That'd pbly cause lots of variation
>
> -shawn
>
> Sent from my AT&T Samsung i907 Windows Mobile® Smartphone.
>
> -----Original Message-----
> From: Erik van der Poel <erikv at google.com>
> Sent: Thursday, April 02, 2009 11:54 AM
> To: Mark Davis <mark at macchiato.com>
> Cc: Vint Cerf <vint at google.com>; idna-update at alvestrand.no <idna-update at alvestrand.no>; Martin J. Dürst <duerst at it.aoyama.ac.jp>; John C Klensin <klensin at jck.com>
> Subject: Re: Q2: What mapping function should be used in a revised IDNA2008     specification?
>
>
> Please, let's not let size and complexity issues derail this IDNAbis
> effort. Haste makes waste. IDNA2003 was a good first cut, that took
> advantage of several Unicode tables, adopting them wholesale. IDNA2008
> is a much more careful effort, with detailed dissection, as you can
> see in the Table draft. We should apply similar care to the "mapping"
> table.
>
> I suggest that we come up with principles, that we then apply to the
> question of mapping. For example, the reason for lower-casing
> non-ASCII letters is to compensate for the lack of matching on the
> server side. The reason for mapping full-width Latin to normal is
> because it is easy to type those characters in East Asian input
> methods. (Of course, we need to see if there is consensus, for each of
> these "reasons".)
>
> I also suggest that we automate the process of finding problematic
> characters. For example, we have already seen that 3-way relationships
> are problematic. One example of this is Final/Normal/Capital Sigma. We
> can automatically find these in Unicode's CaseFold tables. We can also
> look for cases where one character becomes two when upper- or
> lower-cased (e.g. Eszett -> SS).
>
> We should definitely not let the current size of Unicode-related
> libraries like ICU affect the decision-making process in IETF. Thin
> clients can always let big servers do the heavy lifting.
>
> Erik
>
> On Thu, Apr 2, 2009 at 8:22 AM, Mark Davis <mark at macchiato.com> wrote:
>> Mark
>>
>> On Wed, Apr 1, 2009 at 12:51, John C Klensin <klensin at jck.com> wrote:
>>
>>> We are, of course, there already.  While
>>> NFKC(CaseFold(NFKC(string))) is a good predictor of the
>>> Stringprep mappings, it is not an exact one, and IDNA2003
>>> implementations already need separate tables for NFKC and IDNA.
>>
>> True that they are not exact; but the differences are few, and
>> extremely rare (not even measurable in practice, since there frequency
>> is  on a par with random data). Moreover, some implementations already
>> use the latest version of NFKC instead of having special old versions,
>> because the differences are so small. So given the choice of a major
>> breakage or an insignificant breakage, I'd go for the insignificant
>> one.
>>
>>>
>>> That is where arguments about complexity get complicated.
>>> IDNA2008, even with contextual rules, is arguably less complex
>>> than IDNA2003 precisely because, other than that handful of
>>> characters, the tables are smaller and the interpretation of an
>>> entry in those tables is "valid" or "not".  By contrast,
>>> IDNA2003 requires a table that is nearly the size of Unicode
>>> with mapping actions for many characters.
>>
>> I have just no idea whatever where you are getting your figures, but
>> they are very misleading. I'll assume that was not the intent.
>>
>> Here are the figures I get.
>>
>> PValid or Context: 90262
>> NFKC-Folded,    Remapped:       5290
>> NFKC-Lower,     Remapped:       5224
>> NFC-Folded,     Remapped:       2485
>> NFC-Lower,      Remapped:       2394
>>
>> A "table that is nearly the size of Unicode". If you mean possible
>> Unicode characters, that's over a million. Even if you mean graphic
>> characters, that's somewhat over 100,000
>> (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[^[:c:]]).
>>
>> NFKC-Folded affects 5,290 characters in Unicode 5.2. That's about 5%
>> of graphic characters: in my book at least, 5% doesn't mean "nearly
>> all". Or maybe you meant something odd like "the size of the table in
>> bytes is nearly as large as the number of Unicode assigned graphic
>> characters".
>>
>> Let's step back a bit. We need to remember that IDNA2008 already
>> requires the data in Tables and NFC (for sizing on that, see
>> http://www.macchiato.com/unicode/nfc-faq). The additional table size
>> for NFKC and Folding is not that big an increase. As a matter of fact,
>> if an implementation is tight on space, then having them available
>> allows it to substantially cut down on the Table size by
>> algorithmically computing
>> http://tools.ietf.org/html/draft-ietf-idnabis-tables-05#section-2.2.
>>
>> If you have different figures, it would be useful to put them out.
>>
>>> And, of course, a
>>> transition strategy that preserves full functionality for all
>>> labels that were valid under IDNA2003 means that one has to
>>> support both, which is the most complex option possible.
>>
>> I agree that it has the most overhead, since you have to keep a copy
>> of IDNA2003 around. That's why I favor a cleaner approach.
>>
>>>
>>>    john
>>>
>>> _______________________________________________
>>> Idna-update mailing list
>>> Idna-update at alvestrand.no
>>> http://www.alvestrand.no/mailman/listinfo/idna-update
>>>
>>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>


More information about the Idna-update mailing list