Q2: What mapping function should be used in a revised IDNA2008 specification?

Erik van der Poel erikv at google.com
Thu Apr 2 22:08:42 CEST 2009


My email says "disallow Tatweel", not map Tatweel.

Erik

On Thu, Apr 2, 2009 at 11:44 AM, Mark Davis <mark at macchiato.com> wrote:
> I agree that the construction, like Tables, can be Property based. And large
> swathes of characters can be excluded pretty much out the box. However, it
> does mean going through the exercise of looking at the excluded characters
> to see if any of them should be exceptions. While I don't think it is worth
> the effort, worth the further incompatibility with IDNA2003, or worth not
> being able to use off-the-shelf NFKC and case-folding code, it is not an
> unreasonable compromise.
>
> As far as Tatweel goes, we really don't want to add any mappings that were
> not in IDNA2003; for security and interoperability we need all Unicode 3.2
> characters to either (a) not map, or (b) map to the exactly the same as
> IDNA2003 has. (You pointed out the problem earlier.)
>
> Mark
>
>
> On Thu, Apr 2, 2009 at 10:50, Erik van der Poel <erikv at google.com> wrote:
>>
>> It may not be necessary to do character-by-character analysis of NFKC.
>> We may be able to select a small number of the NFKC tags:
>>
>> <font>          A font variant (e.g. a blackletter form).
>> <noBreak>       A no-break version of a space or hyphen.
>> <initial>       An initial presentation form (Arabic).
>> <medial>        A medial presentation form (Arabic).
>> <final>         A final presentation form (Arabic).
>> <isolated>      An isolated presentation form (Arabic).
>> <circle>        An encircled form.
>> <super>         A superscript form.
>> <sub>   A subscript form.
>> <vertical>      A vertical layout presentation form.
>> <wide>          A wide (or zenkaku) compatibility character.
>> <narrow>        A narrow (or hankaku) compatibility character.
>> <small>         A small variant form (CNS compatibility).
>> <square>        A CJK squared font variant.
>> <fraction>      A vulgar fraction form.
>> <compat>        Otherwise unspecified compatibility character.
>>
>> Of these, I would suggest that <wide> and <narrow> are needed for East
>> Asian input methods.
>>
>> We should also remember that a number of WG participants would have to
>> compromise to some extent, in order to accept mapping as a requirement
>> on the lookup side. Those that pushed for lookup mappings should also
>> be willing to make some compromises.
>>
>> One example where we seem to have consensus for getting "stricter" is
>> the Tatweel. The consensus seems to be to disallow Tatweel.
>>
>> So my suggestion is that those who are pushing for lookup mapping, be
>> willing to get "stricter" about the input to the mapping function.
>> Otherwise, I fear that this WG will not reach a final consensus,
>> possibly leading to a "fork" between the Web protocol stack and
>> others.
>>
>> Erik
>>
>> On Thu, Apr 2, 2009 at 10:07 AM, Mark Davis <mark at macchiato.com> wrote:
>> > It would be possible to do a Tables section for mappings, that went
>> > through
>> > the same kind of process that we did for Tables, of fine tuning the
>> > mapping.
>> > That is, we could go through all of the mappings and figure out which
>> > ones
>> > we need, and which ones we don't.
>> >
>> > Frankly, I don't think we need to go through the effort. The only
>> > problem I
>> > see is where a disallowed character X looks most like one PVALID
>> > character
>> > P1, but maps to a different PVALID character P2, and P1 is not
>> > confusable
>> > with P2 already. I don't know of any cases like that.
>> >
>> > BTW, my earlier figures were including the "Remove Default Ignorables"
>> > from
>> > my earlier mail. Here are the figures with that broken out:
>> >
>> > NFKC-CF-RDI,    Remapped:    5290
>> > NFKC-LC-RDI,    Remapped:    5224
>> > NFKC-CF,    Remapped:    4896
>> > NFKC-LC,    Remapped:    4830
>> > NFC-CF-RDI,    Remapped:    2485
>> > NFC-LC-RDI,    Remapped:    2394
>> > NFC-CF,    Remapped:    2091
>> > NFC-LC,    Remapped:    2000
>> >
>> > CF = Unicode toCaseFold
>> > LC = Unicode toLowercase
>> > RDI = Remove default ignorables
>> >
>> > And of course, the mappings would be restricted to only mapping
>> > characters
>> > that were not PVALID in any event, so the above figures would vary
>> > depending
>> > on what we end up with there.
>> >
>> > Mark
>> >
>> >
>> > On Thu, Apr 2, 2009 at 09:53, Erik van der Poel <erikv at google.com>
>> > wrote:
>> >>
>> >> Please, let's not let size and complexity issues derail this IDNAbis
>> >> effort. Haste makes waste. IDNA2003 was a good first cut, that took
>> >> advantage of several Unicode tables, adopting them wholesale. IDNA2008
>> >> is a much more careful effort, with detailed dissection, as you can
>> >> see in the Table draft. We should apply similar care to the "mapping"
>> >> table.
>> >>
>> >> I suggest that we come up with principles, that we then apply to the
>> >> question of mapping. For example, the reason for lower-casing
>> >> non-ASCII letters is to compensate for the lack of matching on the
>> >> server side. The reason for mapping full-width Latin to normal is
>> >> because it is easy to type those characters in East Asian input
>> >> methods. (Of course, we need to see if there is consensus, for each of
>> >> these "reasons".)
>> >>
>> >> I also suggest that we automate the process of finding problematic
>> >> characters. For example, we have already seen that 3-way relationships
>> >> are problematic. One example of this is Final/Normal/Capital Sigma. We
>> >> can automatically find these in Unicode's CaseFold tables. We can also
>> >> look for cases where one character becomes two when upper- or
>> >> lower-cased (e.g. Eszett -> SS).
>> >>
>> >> We should definitely not let the current size of Unicode-related
>> >> libraries like ICU affect the decision-making process in IETF. Thin
>> >> clients can always let big servers do the heavy lifting.
>> >>
>> >> Erik
>> >>
>> >> On Thu, Apr 2, 2009 at 8:22 AM, Mark Davis <mark at macchiato.com> wrote:
>> >> > Mark
>> >> >
>> >> > On Wed, Apr 1, 2009 at 12:51, John C Klensin <klensin at jck.com> wrote:
>> >> >
>> >> >> We are, of course, there already.  While
>> >> >> NFKC(CaseFold(NFKC(string))) is a good predictor of the
>> >> >> Stringprep mappings, it is not an exact one, and IDNA2003
>> >> >> implementations already need separate tables for NFKC and IDNA.
>> >> >
>> >> > True that they are not exact; but the differences are few, and
>> >> > extremely rare (not even measurable in practice, since there
>> >> > frequency
>> >> > is  on a par with random data). Moreover, some implementations
>> >> > already
>> >> > use the latest version of NFKC instead of having special old
>> >> > versions,
>> >> > because the differences are so small. So given the choice of a major
>> >> > breakage or an insignificant breakage, I'd go for the insignificant
>> >> > one.
>> >> >
>> >> >>
>> >> >> That is where arguments about complexity get complicated.
>> >> >> IDNA2008, even with contextual rules, is arguably less complex
>> >> >> than IDNA2003 precisely because, other than that handful of
>> >> >> characters, the tables are smaller and the interpretation of an
>> >> >> entry in those tables is "valid" or "not".  By contrast,
>> >> >> IDNA2003 requires a table that is nearly the size of Unicode
>> >> >> with mapping actions for many characters.
>> >> >
>> >> > I have just no idea whatever where you are getting your figures, but
>> >> > they are very misleading. I'll assume that was not the intent.
>> >> >
>> >> > Here are the figures I get.
>> >> >
>> >> > PValid or Context: 90262
>> >> > NFKC-Folded,    Remapped:       5290
>> >> > NFKC-Lower,     Remapped:       5224
>> >> > NFC-Folded,     Remapped:       2485
>> >> > NFC-Lower,      Remapped:       2394
>> >> >
>> >> > A "table that is nearly the size of Unicode". If you mean possible
>> >> > Unicode characters, that's over a million. Even if you mean graphic
>> >> > characters, that's somewhat over 100,000
>> >> > (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[^[:c:]]).
>> >> >
>> >> > NFKC-Folded affects 5,290 characters in Unicode 5.2. That's about 5%
>> >> > of graphic characters: in my book at least, 5% doesn't mean "nearly
>> >> > all". Or maybe you meant something odd like "the size of the table in
>> >> > bytes is nearly as large as the number of Unicode assigned graphic
>> >> > characters".
>> >> >
>> >> > Let's step back a bit. We need to remember that IDNA2008 already
>> >> > requires the data in Tables and NFC (for sizing on that, see
>> >> > http://www.macchiato.com/unicode/nfc-faq). The additional table size
>> >> > for NFKC and Folding is not that big an increase. As a matter of
>> >> > fact,
>> >> > if an implementation is tight on space, then having them available
>> >> > allows it to substantially cut down on the Table size by
>> >> > algorithmically computing
>> >> > http://tools.ietf.org/html/draft-ietf-idnabis-tables-05#section-2.2.
>> >> >
>> >> > If you have different figures, it would be useful to put them out.
>> >> >
>> >> >> And, of course, a
>> >> >> transition strategy that preserves full functionality for all
>> >> >> labels that were valid under IDNA2003 means that one has to
>> >> >> support both, which is the most complex option possible.
>> >> >
>> >> > I agree that it has the most overhead, since you have to keep a copy
>> >> > of IDNA2003 around. That's why I favor a cleaner approach.
>> >> >
>> >> >>
>> >> >>    john
>> >> >>
>> >> >> _______________________________________________
>> >> >> Idna-update mailing list
>> >> >> Idna-update at alvestrand.no
>> >> >> http://www.alvestrand.no/mailman/listinfo/idna-update
>> >> >>
>> >> >
>> >
>> >
>
>


More information about the Idna-update mailing list