Q2: What mapping function should be used in a revised IDNA2008 specification?

Fri Apr 3 08:40:21 CEST 2009

Hello Erik,

On 2009/04/03 2:50, Erik van der Poel wrote:
> It may not be necessary to do character-by-character analysis of NFKC.
> We may be able to select a small number of the NFKC tags:

Thanks for listing them all up. Without looking at the details
(which definitely will have to be done), my assesment would be:

> <font>   	A font variant (e.g. a blackletter form).

Out. Difficult to type,... anyhow.

> <noBreak>   	A no-break version of a space or hyphen.

Out. Their equivalents are probably out already, so moot.

> <initial>   	An initial presentation form (Arabic).
> <medial>   	A medial presentation form (Arabic).
> <final>   	A final presentation form (Arabic).
> <isolated>   	An isolated presentation form (Arabic).

Out, I'd say, but it would be better to get input on this
from the Arabic IDN experts.

> <circle>   	An encircled form.
> <super>   	A superscript form.
> <sub>   	A subscript form.

Out. They are clearly visually distinct, difficult to enter,...

> <vertical>   	A vertical layout presentation form.

Unclear. Out unless they happen to be introduced by IMEs
(my guess is that these are mostly used for glyph identifiers
in fonts, but that's just a guess)

> <wide>   	A wide (or zenkaku) compatibility character.
> <narrow>   	A narrow (or hankaku) compatibility character.

In. This has already been pointed out by several people.

> <small>   	A small variant form (CNS compatibility).

Out.

> <square>   	A CJK squared font variant.

Out. These are several characters squeezed into one basic square.

> <fraction>   	A vulgar fraction form.

Out. They only produce confusion with slashes and the like.

> <compat>   	Otherwise unspecified compatibility character.

That's probably the category that we will have to look at most closely.

Regards,    Martin.

> Of these, I would suggest that<wide>  and<narrow>  are needed for East
> Asian input methods.
>
> We should also remember that a number of WG participants would have to
> compromise to some extent, in order to accept mapping as a requirement
> on the lookup side. Those that pushed for lookup mappings should also
> be willing to make some compromises.
>
> One example where we seem to have consensus for getting "stricter" is
> the Tatweel. The consensus seems to be to disallow Tatweel.
>
> So my suggestion is that those who are pushing for lookup mapping, be
> willing to get "stricter" about the input to the mapping function.
> Otherwise, I fear that this WG will not reach a final consensus,
> possibly leading to a "fork" between the Web protocol stack and
> others.
>
> Erik
>
> On Thu, Apr 2, 2009 at 10:07 AM, Mark Davis<mark at macchiato.com>  wrote:
>> It would be possible to do a Tables section for mappings, that went through
>> the same kind of process that we did for Tables, of fine tuning the mapping.
>> That is, we could go through all of the mappings and figure out which ones
>> we need, and which ones we don't.
>>
>> Frankly, I don't think we need to go through the effort. The only problem I
>> see is where a disallowed character X looks most like one PVALID character
>> P1, but maps to a different PVALID character P2, and P1 is not confusable
>> with P2 already. I don't know of any cases like that.
>>
>> BTW, my earlier figures were including the "Remove Default Ignorables" from
>> my earlier mail. Here are the figures with that broken out:
>>
>> NFKC-CF-RDI,    Remapped:    5290
>> NFKC-LC-RDI,    Remapped:    5224
>> NFKC-CF,    Remapped:    4896
>> NFKC-LC,    Remapped:    4830
>> NFC-CF-RDI,    Remapped:    2485
>> NFC-LC-RDI,    Remapped:    2394
>> NFC-CF,    Remapped:    2091
>> NFC-LC,    Remapped:    2000
>>
>> CF = Unicode toCaseFold
>> LC = Unicode toLowercase
>> RDI = Remove default ignorables
>>
>> And of course, the mappings would be restricted to only mapping characters
>> that were not PVALID in any event, so the above figures would vary depending
>> on what we end up with there.
>>
>> Mark
>>
>>
>> On Thu, Apr 2, 2009 at 09:53, Erik van der Poel<erikv at google.com>  wrote:
>>> Please, let's not let size and complexity issues derail this IDNAbis
>>> effort. Haste makes waste. IDNA2003 was a good first cut, that took
>>> advantage of several Unicode tables, adopting them wholesale. IDNA2008
>>> is a much more careful effort, with detailed dissection, as you can
>>> see in the Table draft. We should apply similar care to the "mapping"
>>> table.
>>>
>>> I suggest that we come up with principles, that we then apply to the
>>> question of mapping. For example, the reason for lower-casing
>>> non-ASCII letters is to compensate for the lack of matching on the
>>> server side. The reason for mapping full-width Latin to normal is
>>> because it is easy to type those characters in East Asian input
>>> methods. (Of course, we need to see if there is consensus, for each of
>>> these "reasons".)
>>>
>>> I also suggest that we automate the process of finding problematic
>>> characters. For example, we have already seen that 3-way relationships
>>> are problematic. One example of this is Final/Normal/Capital Sigma. We
>>> can automatically find these in Unicode's CaseFold tables. We can also
>>> look for cases where one character becomes two when upper- or
>>> lower-cased (e.g. Eszett ->  SS).
>>>
>>> We should definitely not let the current size of Unicode-related
>>> libraries like ICU affect the decision-making process in IETF. Thin
>>> clients can always let big servers do the heavy lifting.
>>>
>>> Erik
>>>
>>> On Thu, Apr 2, 2009 at 8:22 AM, Mark Davis<mark at macchiato.com>  wrote:
>>>> Mark
>>>>
>>>> On Wed, Apr 1, 2009 at 12:51, John C Klensin<klensin at jck.com>  wrote:
>>>>
>>>>> We are, of course, there already.  While
>>>>> NFKC(CaseFold(NFKC(string))) is a good predictor of the
>>>>> Stringprep mappings, it is not an exact one, and IDNA2003
>>>>> implementations already need separate tables for NFKC and IDNA.
>>>> True that they are not exact; but the differences are few, and
>>>> extremely rare (not even measurable in practice, since there frequency
>>>> is  on a par with random data). Moreover, some implementations already
>>>> use the latest version of NFKC instead of having special old versions,
>>>> because the differences are so small. So given the choice of a major
>>>> breakage or an insignificant breakage, I'd go for the insignificant
>>>> one.
>>>>
>>>>> That is where arguments about complexity get complicated.
>>>>> IDNA2008, even with contextual rules, is arguably less complex
>>>>> than IDNA2003 precisely because, other than that handful of
>>>>> characters, the tables are smaller and the interpretation of an
>>>>> entry in those tables is "valid" or "not".  By contrast,
>>>>> IDNA2003 requires a table that is nearly the size of Unicode
>>>>> with mapping actions for many characters.
>>>> I have just no idea whatever where you are getting your figures, but
>>>> they are very misleading. I'll assume that was not the intent.
>>>>
>>>> Here are the figures I get.
>>>>
>>>> PValid or Context: 90262
>>>> NFKC-Folded,    Remapped:       5290
>>>> NFKC-Lower,     Remapped:       5224
>>>> NFC-Folded,     Remapped:       2485
>>>> NFC-Lower,      Remapped:       2394
>>>>
>>>> A "table that is nearly the size of Unicode". If you mean possible
>>>> Unicode characters, that's over a million. Even if you mean graphic
>>>> characters, that's somewhat over 100,000
>>>> (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[^[:c:]]).
>>>>
>>>> NFKC-Folded affects 5,290 characters in Unicode 5.2. That's about 5%
>>>> of graphic characters: in my book at least, 5% doesn't mean "nearly
>>>> all". Or maybe you meant something odd like "the size of the table in
>>>> bytes is nearly as large as the number of Unicode assigned graphic
>>>> characters".
>>>>
>>>> Let's step back a bit. We need to remember that IDNA2008 already
>>>> requires the data in Tables and NFC (for sizing on that, see
>>>> http://www.macchiato.com/unicode/nfc-faq). The additional table size
>>>> for NFKC and Folding is not that big an increase. As a matter of fact,
>>>> if an implementation is tight on space, then having them available
>>>> allows it to substantially cut down on the Table size by
>>>> algorithmically computing
>>>> http://tools.ietf.org/html/draft-ietf-idnabis-tables-05#section-2.2.
>>>>
>>>> If you have different figures, it would be useful to put them out.
>>>>
>>>>> And, of course, a
>>>>> transition strategy that preserves full functionality for all
>>>>> labels that were valid under IDNA2003 means that one has to
>>>>> support both, which is the most complex option possible.
>>>> I agree that it has the most overhead, since you have to keep a copy
>>>> of IDNA2003 around. That's why I favor a cleaner approach.
>>>>
>>>>>     john
>>>>>
>>>>> _______________________________________________
>>>>> Idna-update mailing list
>>>>> Idna-update at alvestrand.no
>>>>> http://www.alvestrand.no/mailman/listinfo/idna-update
>>>>>
>>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update

-- 
#-# Martin J.Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst at it.aoyama.ac.jp