Q2: What mapping function should be used in a revised IDNA2008 specification?

Mark Davis mark at macchiato.com
Thu Apr 2 22:24:40 CEST 2009


Sorry, I misunderstood.

Tatweel is in the same boat as many other characters. It is irrelevant to
mappings, since the only relevant mappings are those where the results are
all valid U-Label characters. So those characters are off the table anyway.

Mark


On Thu, Apr 2, 2009 at 13:08, Erik van der Poel <erikv at google.com> wrote:

> My email says "disallow Tatweel", not map Tatweel.
>
> Erik
>
> On Thu, Apr 2, 2009 at 11:44 AM, Mark Davis <mark at macchiato.com> wrote:
> > I agree that the construction, like Tables, can be Property based. And
> large
> > swathes of characters can be excluded pretty much out the box. However,
> it
> > does mean going through the exercise of looking at the excluded
> characters
> > to see if any of them should be exceptions. While I don't think it is
> worth
> > the effort, worth the further incompatibility with IDNA2003, or worth not
> > being able to use off-the-shelf NFKC and case-folding code, it is not an
> > unreasonable compromise.
> >
> > As far as Tatweel goes, we really don't want to add any mappings that
> were
> > not in IDNA2003; for security and interoperability we need all Unicode
> 3.2
> > characters to either (a) not map, or (b) map to the exactly the same as
> > IDNA2003 has. (You pointed out the problem earlier.)
> >
> > Mark
> >
> >
> > On Thu, Apr 2, 2009 at 10:50, Erik van der Poel <erikv at google.com>
> wrote:
> >>
> >> It may not be necessary to do character-by-character analysis of NFKC.
> >> We may be able to select a small number of the NFKC tags:
> >>
> >> <font>          A font variant (e.g. a blackletter form).
> >> <noBreak>       A no-break version of a space or hyphen.
> >> <initial>       An initial presentation form (Arabic).
> >> <medial>        A medial presentation form (Arabic).
> >> <final>         A final presentation form (Arabic).
> >> <isolated>      An isolated presentation form (Arabic).
> >> <circle>        An encircled form.
> >> <super>         A superscript form.
> >> <sub>   A subscript form.
> >> <vertical>      A vertical layout presentation form.
> >> <wide>          A wide (or zenkaku) compatibility character.
> >> <narrow>        A narrow (or hankaku) compatibility character.
> >> <small>         A small variant form (CNS compatibility).
> >> <square>        A CJK squared font variant.
> >> <fraction>      A vulgar fraction form.
> >> <compat>        Otherwise unspecified compatibility character.
> >>
> >> Of these, I would suggest that <wide> and <narrow> are needed for East
> >> Asian input methods.
> >>
> >> We should also remember that a number of WG participants would have to
> >> compromise to some extent, in order to accept mapping as a requirement
> >> on the lookup side. Those that pushed for lookup mappings should also
> >> be willing to make some compromises.
> >>
> >> One example where we seem to have consensus for getting "stricter" is
> >> the Tatweel. The consensus seems to be to disallow Tatweel.
> >>
> >> So my suggestion is that those who are pushing for lookup mapping, be
> >> willing to get "stricter" about the input to the mapping function.
> >> Otherwise, I fear that this WG will not reach a final consensus,
> >> possibly leading to a "fork" between the Web protocol stack and
> >> others.
> >>
> >> Erik
> >>
> >> On Thu, Apr 2, 2009 at 10:07 AM, Mark Davis <mark at macchiato.com> wrote:
> >> > It would be possible to do a Tables section for mappings, that went
> >> > through
> >> > the same kind of process that we did for Tables, of fine tuning the
> >> > mapping.
> >> > That is, we could go through all of the mappings and figure out which
> >> > ones
> >> > we need, and which ones we don't.
> >> >
> >> > Frankly, I don't think we need to go through the effort. The only
> >> > problem I
> >> > see is where a disallowed character X looks most like one PVALID
> >> > character
> >> > P1, but maps to a different PVALID character P2, and P1 is not
> >> > confusable
> >> > with P2 already. I don't know of any cases like that.
> >> >
> >> > BTW, my earlier figures were including the "Remove Default Ignorables"
> >> > from
> >> > my earlier mail. Here are the figures with that broken out:
> >> >
> >> > NFKC-CF-RDI,    Remapped:    5290
> >> > NFKC-LC-RDI,    Remapped:    5224
> >> > NFKC-CF,    Remapped:    4896
> >> > NFKC-LC,    Remapped:    4830
> >> > NFC-CF-RDI,    Remapped:    2485
> >> > NFC-LC-RDI,    Remapped:    2394
> >> > NFC-CF,    Remapped:    2091
> >> > NFC-LC,    Remapped:    2000
> >> >
> >> > CF = Unicode toCaseFold
> >> > LC = Unicode toLowercase
> >> > RDI = Remove default ignorables
> >> >
> >> > And of course, the mappings would be restricted to only mapping
> >> > characters
> >> > that were not PVALID in any event, so the above figures would vary
> >> > depending
> >> > on what we end up with there.
> >> >
> >> > Mark
> >> >
> >> >
> >> > On Thu, Apr 2, 2009 at 09:53, Erik van der Poel <erikv at google.com>
> >> > wrote:
> >> >>
> >> >> Please, let's not let size and complexity issues derail this IDNAbis
> >> >> effort. Haste makes waste. IDNA2003 was a good first cut, that took
> >> >> advantage of several Unicode tables, adopting them wholesale.
> IDNA2008
> >> >> is a much more careful effort, with detailed dissection, as you can
> >> >> see in the Table draft. We should apply similar care to the "mapping"
> >> >> table.
> >> >>
> >> >> I suggest that we come up with principles, that we then apply to the
> >> >> question of mapping. For example, the reason for lower-casing
> >> >> non-ASCII letters is to compensate for the lack of matching on the
> >> >> server side. The reason for mapping full-width Latin to normal is
> >> >> because it is easy to type those characters in East Asian input
> >> >> methods. (Of course, we need to see if there is consensus, for each
> of
> >> >> these "reasons".)
> >> >>
> >> >> I also suggest that we automate the process of finding problematic
> >> >> characters. For example, we have already seen that 3-way
> relationships
> >> >> are problematic. One example of this is Final/Normal/Capital Sigma.
> We
> >> >> can automatically find these in Unicode's CaseFold tables. We can
> also
> >> >> look for cases where one character becomes two when upper- or
> >> >> lower-cased (e.g. Eszett -> SS).
> >> >>
> >> >> We should definitely not let the current size of Unicode-related
> >> >> libraries like ICU affect the decision-making process in IETF. Thin
> >> >> clients can always let big servers do the heavy lifting.
> >> >>
> >> >> Erik
> >> >>
> >> >> On Thu, Apr 2, 2009 at 8:22 AM, Mark Davis <mark at macchiato.com>
> wrote:
> >> >> > Mark
> >> >> >
> >> >> > On Wed, Apr 1, 2009 at 12:51, John C Klensin <klensin at jck.com>
> wrote:
> >> >> >
> >> >> >> We are, of course, there already.  While
> >> >> >> NFKC(CaseFold(NFKC(string))) is a good predictor of the
> >> >> >> Stringprep mappings, it is not an exact one, and IDNA2003
> >> >> >> implementations already need separate tables for NFKC and IDNA.
> >> >> >
> >> >> > True that they are not exact; but the differences are few, and
> >> >> > extremely rare (not even measurable in practice, since there
> >> >> > frequency
> >> >> > is  on a par with random data). Moreover, some implementations
> >> >> > already
> >> >> > use the latest version of NFKC instead of having special old
> >> >> > versions,
> >> >> > because the differences are so small. So given the choice of a
> major
> >> >> > breakage or an insignificant breakage, I'd go for the insignificant
> >> >> > one.
> >> >> >
> >> >> >>
> >> >> >> That is where arguments about complexity get complicated.
> >> >> >> IDNA2008, even with contextual rules, is arguably less complex
> >> >> >> than IDNA2003 precisely because, other than that handful of
> >> >> >> characters, the tables are smaller and the interpretation of an
> >> >> >> entry in those tables is "valid" or "not".  By contrast,
> >> >> >> IDNA2003 requires a table that is nearly the size of Unicode
> >> >> >> with mapping actions for many characters.
> >> >> >
> >> >> > I have just no idea whatever where you are getting your figures,
> but
> >> >> > they are very misleading. I'll assume that was not the intent.
> >> >> >
> >> >> > Here are the figures I get.
> >> >> >
> >> >> > PValid or Context: 90262
> >> >> > NFKC-Folded,    Remapped:       5290
> >> >> > NFKC-Lower,     Remapped:       5224
> >> >> > NFC-Folded,     Remapped:       2485
> >> >> > NFC-Lower,      Remapped:       2394
> >> >> >
> >> >> > A "table that is nearly the size of Unicode". If you mean possible
> >> >> > Unicode characters, that's over a million. Even if you mean graphic
> >> >> > characters, that's somewhat over 100,000
> >> >> > (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B>
> ^[:c:]]).
> >> >> >
> >> >> > NFKC-Folded affects 5,290 characters in Unicode 5.2. That's about
> 5%
> >> >> > of graphic characters: in my book at least, 5% doesn't mean "nearly
> >> >> > all". Or maybe you meant something odd like "the size of the table
> in
> >> >> > bytes is nearly as large as the number of Unicode assigned graphic
> >> >> > characters".
> >> >> >
> >> >> > Let's step back a bit. We need to remember that IDNA2008 already
> >> >> > requires the data in Tables and NFC (for sizing on that, see
> >> >> > http://www.macchiato.com/unicode/nfc-faq). The additional table
> size
> >> >> > for NFKC and Folding is not that big an increase. As a matter of
> >> >> > fact,
> >> >> > if an implementation is tight on space, then having them available
> >> >> > allows it to substantially cut down on the Table size by
> >> >> > algorithmically computing
> >> >> >
> http://tools.ietf.org/html/draft-ietf-idnabis-tables-05#section-2.2.
> >> >> >
> >> >> > If you have different figures, it would be useful to put them out.
> >> >> >
> >> >> >> And, of course, a
> >> >> >> transition strategy that preserves full functionality for all
> >> >> >> labels that were valid under IDNA2003 means that one has to
> >> >> >> support both, which is the most complex option possible.
> >> >> >
> >> >> > I agree that it has the most overhead, since you have to keep a
> copy
> >> >> > of IDNA2003 around. That's why I favor a cleaner approach.
> >> >> >
> >> >> >>
> >> >> >>    john
> >> >> >>
> >> >> >> _______________________________________________
> >> >> >> Idna-update mailing list
> >> >> >> Idna-update at alvestrand.no
> >> >> >> http://www.alvestrand.no/mailman/listinfo/idna-update
> >> >> >>
> >> >> >
> >> >
> >> >
> >
> >
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20090402/9bfc85e0/attachment-0001.htm 


More information about the Idna-update mailing list