idna-mapping update

Mon Dec 21 09:17:47 CET 2009

Hello Michel, others,

On 2009/12/19 12:13, Michel SUIGNARD wrote:
>>  From Lisa Dusseault (Dec 1st)
>> I don't believe we know what the WG consensus position is around how
>> strongly pre-lookup mappings are recommended and in what use cases,
>> and how compatible optional pre-lookup mappings are with IDNA2003
>> in-protocol mapping.
>
> I'd like to give a new feedback to that statement. The issue some of us have with the current recommendation in idna-mappings [draft-ietf-idnabis-mappings-05] is that it is vastly different from the mapping done in IDNA_2003, especially concerning compatibility mapping done beyond the narrow/wide mapping suggested in the current document.

It is indeed vastly different on paper. But it is NOT vastly different 
in practice, because most of the NFKC mappings (except for <wide> and 
<narrow>) are either to DISALLOWED or are extremely rare.

> The solution proposes the referencing of a single mapping table, improving greatly odds that implementers will do the right thing. Finally, it makes trivial for the draft Unicode TR46 to refer to a common mapping definition, avoiding potential confusion and unnecessary duplication.
>
> Some examples of characters mapped differently between idna-mappings and idna 2003 (in idna-mappings they stay unmapped):
>
> 	00AA ( ª ) =>  0061 ( a ) # FEMININE ORDINAL INDICATOR
> 	00B2 ( ² ) =>  0032 ( 2 ) # SUPERSCRIPT TWO
> 	00B3 ( ³ ) =>  0033 ( 3 ) # SUPERSCRIPT THREE
> 	00B5 ( µ ) =>  03BC ( μ ) # MICRO SIGN
> 	00B9 ( ¹ ) =>  0031 ( 1 ) # SUPERSCRIPT ONE
> 	00BA ( º ) =>  006F ( o ) # MASCULINE ORDINAL INDICATOR

I have no idea why these would be really useful.

> 	0130 ( İ ) =>  0069 0307 ( i̇ ) # LATIN CAPITAL LETTER I WITH DOT ABOVE

Shawn repeatedly asked about the reason for this mapping, which I agree 
doesn't make sense at all, because the lower-case of İ is always i, not i̇.

> 	0132 ( Ĳ ) =>  0069 006A ( ij ) # LATIN CAPITAL LIGATURE IJ

I guess the mapping also includes the lower-case ij ligature. These are 
sometimes used in Dutch. But the Dutch keyboard 
(http://en.wikipedia.org/wiki/File:Nederlandse_toetsenbordindeling_-_tekst_als_paden.svg, 
http://www.microsoft.com/globaldev/keyboards/kbdne.htm) doesn't contain 
them.

> 	013F ( Ŀ ) =>  006C 00B7 ( l• ) # LATIN CAPITAL LETTER L WITH MIDDLE DOT
> 	0140 ( ŀ ) =>  006C 00B7 ( l• ) # LATIN SMALL LETTER L WITH MIDDLE DOT

These may be used in Catalan. Any specific information?

> 	0149 ( ŉ ) =>  02BC 006E ( ʼn ) # LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
> 	017F ( ſ ) =>  0073 ( s ) # LATIN SMALL LETTER LONG S
> 	01C4 ( Ǆ ) =>  0064 017E ( dž ) # LATIN CAPITAL LETTER DZ WITH CARON
> 	01F3 ( ǳ ) =>  0064 007A ( dz ) # LATIN SMALL LETTER DZ
>
> By using a mapping table based on the NFKC_CF property already exposed in Unicode (reflecting IDNA mapping as designed in IDNA 2003), modified to improve compatibility with IDNA 2008, it is possible to address the concern expressed above.

What exactly is the concern? How many of these characters are out there? 
How many risk to be generated based on the keyboards that the users use?

> The table is available in http://www.unicode.org/Public/idna/5.1.0/IdnaMappingTable.txt

Is there a 5.2 version?

> and its construction is explained in section 7 of the latest TR46 draft in http://www.unicode.org/reports/tr46/

Many aspects of this 'explanation' are difficult to understand. For 
example, why does U+FFFD have to be excluded? It is not allowed in IDNs, 
neither in IDNA 2003 nor in IDNA 2008, and doesn't participate in any 
normalizations.

> The editing instruction for idna-mappings to include referencing to this new mapping table follows:
> <<
> In Section 2, replace items 1-4 and all following text by:
>
> ========================================================================
>
> 1. For each code point in the input string to be used under the IDNA
>     protocol, map the code point using the [IDNA Mapping Table], as follows:
>
>     a. Look up the status value for the code point in the table.
>
>     b. If the status is "ignored", removed the code point from the input
>        string.

Do we still have ignorables in IDNA 2008?

>     c. If the status is "mapped", replace the code point in the input string
>        by the mapped value in the table. Note that the mapped value
>        may consist of more than one code point.
>
>     d. For any other status ("valid", "disallowed", "deviation"), and for
>        any code point which is unassigned for the Unicode version of
>        the table, leave the code point unchanged in the input string.

I think it is very clear that we don't want any special treatment for 
the codepoints that TR46 calls "deviation".

> 2. Normalize the string which results from the mapping in step 1, using
>     Unicode Normalization Form C (NFC).
>
> Note that the result of this mapping and normalization of the input string may result in a string which is not valid per [I-D.ietf-idnabis-protocol], because it may contain disallowed or unassigned code points, or may otherwise fail well-formedness conditions specified in that protocol. Such verification is outside the scope of this document.
>
> If the mappings in this document are applied to versions of Unicode later than Unicode 5.1, the corresponding version of the IDNA Mapping Table for those later versions of the Unicode Standard should be used.

Shouldn't we just say that the newest version is always preferable?

Regards,   Martin.

> ====================================
> In Section 6 Normative references
> Add:
>
> [IDNA Mapping Table] add reference to
> http://www.unicode.org/Public/idna/5.1.0/IdnaMappingTable.txt
>
> =========================
> Best regards,
> Michel
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst at it.aoyama.ac.jp