Q2: What mapping function should be used in a revised IDNA2008 specification?

Erik van der Poel erikv at google.com
Thu Apr 2 19:50:48 CEST 2009


It may not be necessary to do character-by-character analysis of NFKC.
We may be able to select a small number of the NFKC tags:

<font>  	A font variant (e.g. a blackletter form).
<noBreak>  	A no-break version of a space or hyphen.
<initial>  	An initial presentation form (Arabic).
<medial>  	A medial presentation form (Arabic).
<final>  	A final presentation form (Arabic).
<isolated>  	An isolated presentation form (Arabic).
<circle>  	An encircled form.
<super>  	A superscript form.
<sub>  	A subscript form.
<vertical>  	A vertical layout presentation form.
<wide>  	A wide (or zenkaku) compatibility character.
<narrow>  	A narrow (or hankaku) compatibility character.
<small>  	A small variant form (CNS compatibility).
<square>  	A CJK squared font variant.
<fraction>  	A vulgar fraction form.
<compat>  	Otherwise unspecified compatibility character.

Of these, I would suggest that <wide> and <narrow> are needed for East
Asian input methods.

We should also remember that a number of WG participants would have to
compromise to some extent, in order to accept mapping as a requirement
on the lookup side. Those that pushed for lookup mappings should also
be willing to make some compromises.

One example where we seem to have consensus for getting "stricter" is
the Tatweel. The consensus seems to be to disallow Tatweel.

So my suggestion is that those who are pushing for lookup mapping, be
willing to get "stricter" about the input to the mapping function.
Otherwise, I fear that this WG will not reach a final consensus,
possibly leading to a "fork" between the Web protocol stack and
others.

Erik

On Thu, Apr 2, 2009 at 10:07 AM, Mark Davis <mark at macchiato.com> wrote:
> It would be possible to do a Tables section for mappings, that went through
> the same kind of process that we did for Tables, of fine tuning the mapping.
> That is, we could go through all of the mappings and figure out which ones
> we need, and which ones we don't.
>
> Frankly, I don't think we need to go through the effort. The only problem I
> see is where a disallowed character X looks most like one PVALID character
> P1, but maps to a different PVALID character P2, and P1 is not confusable
> with P2 already. I don't know of any cases like that.
>
> BTW, my earlier figures were including the "Remove Default Ignorables" from
> my earlier mail. Here are the figures with that broken out:
>
> NFKC-CF-RDI,    Remapped:    5290
> NFKC-LC-RDI,    Remapped:    5224
> NFKC-CF,    Remapped:    4896
> NFKC-LC,    Remapped:    4830
> NFC-CF-RDI,    Remapped:    2485
> NFC-LC-RDI,    Remapped:    2394
> NFC-CF,    Remapped:    2091
> NFC-LC,    Remapped:    2000
>
> CF = Unicode toCaseFold
> LC = Unicode toLowercase
> RDI = Remove default ignorables
>
> And of course, the mappings would be restricted to only mapping characters
> that were not PVALID in any event, so the above figures would vary depending
> on what we end up with there.
>
> Mark
>
>
> On Thu, Apr 2, 2009 at 09:53, Erik van der Poel <erikv at google.com> wrote:
>>
>> Please, let's not let size and complexity issues derail this IDNAbis
>> effort. Haste makes waste. IDNA2003 was a good first cut, that took
>> advantage of several Unicode tables, adopting them wholesale. IDNA2008
>> is a much more careful effort, with detailed dissection, as you can
>> see in the Table draft. We should apply similar care to the "mapping"
>> table.
>>
>> I suggest that we come up with principles, that we then apply to the
>> question of mapping. For example, the reason for lower-casing
>> non-ASCII letters is to compensate for the lack of matching on the
>> server side. The reason for mapping full-width Latin to normal is
>> because it is easy to type those characters in East Asian input
>> methods. (Of course, we need to see if there is consensus, for each of
>> these "reasons".)
>>
>> I also suggest that we automate the process of finding problematic
>> characters. For example, we have already seen that 3-way relationships
>> are problematic. One example of this is Final/Normal/Capital Sigma. We
>> can automatically find these in Unicode's CaseFold tables. We can also
>> look for cases where one character becomes two when upper- or
>> lower-cased (e.g. Eszett -> SS).
>>
>> We should definitely not let the current size of Unicode-related
>> libraries like ICU affect the decision-making process in IETF. Thin
>> clients can always let big servers do the heavy lifting.
>>
>> Erik
>>
>> On Thu, Apr 2, 2009 at 8:22 AM, Mark Davis <mark at macchiato.com> wrote:
>> > Mark
>> >
>> > On Wed, Apr 1, 2009 at 12:51, John C Klensin <klensin at jck.com> wrote:
>> >
>> >> We are, of course, there already.  While
>> >> NFKC(CaseFold(NFKC(string))) is a good predictor of the
>> >> Stringprep mappings, it is not an exact one, and IDNA2003
>> >> implementations already need separate tables for NFKC and IDNA.
>> >
>> > True that they are not exact; but the differences are few, and
>> > extremely rare (not even measurable in practice, since there frequency
>> > is  on a par with random data). Moreover, some implementations already
>> > use the latest version of NFKC instead of having special old versions,
>> > because the differences are so small. So given the choice of a major
>> > breakage or an insignificant breakage, I'd go for the insignificant
>> > one.
>> >
>> >>
>> >> That is where arguments about complexity get complicated.
>> >> IDNA2008, even with contextual rules, is arguably less complex
>> >> than IDNA2003 precisely because, other than that handful of
>> >> characters, the tables are smaller and the interpretation of an
>> >> entry in those tables is "valid" or "not".  By contrast,
>> >> IDNA2003 requires a table that is nearly the size of Unicode
>> >> with mapping actions for many characters.
>> >
>> > I have just no idea whatever where you are getting your figures, but
>> > they are very misleading. I'll assume that was not the intent.
>> >
>> > Here are the figures I get.
>> >
>> > PValid or Context: 90262
>> > NFKC-Folded,    Remapped:       5290
>> > NFKC-Lower,     Remapped:       5224
>> > NFC-Folded,     Remapped:       2485
>> > NFC-Lower,      Remapped:       2394
>> >
>> > A "table that is nearly the size of Unicode". If you mean possible
>> > Unicode characters, that's over a million. Even if you mean graphic
>> > characters, that's somewhat over 100,000
>> > (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[^[:c:]]).
>> >
>> > NFKC-Folded affects 5,290 characters in Unicode 5.2. That's about 5%
>> > of graphic characters: in my book at least, 5% doesn't mean "nearly
>> > all". Or maybe you meant something odd like "the size of the table in
>> > bytes is nearly as large as the number of Unicode assigned graphic
>> > characters".
>> >
>> > Let's step back a bit. We need to remember that IDNA2008 already
>> > requires the data in Tables and NFC (for sizing on that, see
>> > http://www.macchiato.com/unicode/nfc-faq). The additional table size
>> > for NFKC and Folding is not that big an increase. As a matter of fact,
>> > if an implementation is tight on space, then having them available
>> > allows it to substantially cut down on the Table size by
>> > algorithmically computing
>> > http://tools.ietf.org/html/draft-ietf-idnabis-tables-05#section-2.2.
>> >
>> > If you have different figures, it would be useful to put them out.
>> >
>> >> And, of course, a
>> >> transition strategy that preserves full functionality for all
>> >> labels that were valid under IDNA2003 means that one has to
>> >> support both, which is the most complex option possible.
>> >
>> > I agree that it has the most overhead, since you have to keep a copy
>> > of IDNA2003 around. That's why I favor a cleaner approach.
>> >
>> >>
>> >>    john
>> >>
>> >> _______________________________________________
>> >> Idna-update mailing list
>> >> Idna-update at alvestrand.no
>> >> http://www.alvestrand.no/mailman/listinfo/idna-update
>> >>
>> >
>
>


More information about the Idna-update mailing list