IDNAbis Preprocessing Draft

Mark Davis mark.davis at icu-project.org
Sat Jan 19 20:36:23 CET 2008


 IDNAbis Preprocessing Draft * M. Davis, 2007-01-08
(live document at: *http://docs.google.com/Doc?id=dfqr8rd5_51c3nrskcx)*
*
*TBD: boilerplate, wordsmithing, references, fleshing out for clarity,...*

When using the IDNAbis specification, some user agents such as browsers may
have a requirement to interoperate compatibly with the prior IDNA2003
specification and/or may operate in an environment that needs to allow
lenient parsing of IDNs. To do so, such user agents need to do a number of
preprocessing steps on URI/IRIs to extract and convert labels. To promote
interoperability among user agents, the specification for such preprocessing
is provided in this document.

Lower-level protocols, such as the SMTP envelope, should require the strict
use of U-labels and thus not use the preprocessing specified here.
Language-specific modifications to the preprocessing specified in this
document are outside of the scope of this document; they are, however,
discouraged because of the problems they pose for interoperability.
1. PreprocessingThe preprocessing consists of the following steps, performed
in order:


   1.  Parse URI/IRI to get the host_name string.
   - *Abort with error if not found.*
      2.  Convert the host_name string to Unicode.
   - *Abort with error if there is any conversion problem. *
   3. Convert any escapes in the host_name string to Unicode code points
   as necessary, depending on context (eg, HTML NCRs like 十 or
   Javascript escapes like \u5341).
   - *Abort with error if any are malformed (such as "\u123G"). *
   4.  Convert any %-escapes in the host_name string according to IRI
   (eg, %2e becomes
U+002E<http://unicode.org/cldr/utility/character.jsp?a=002E>( . ) FULL
STOP)
   - *Abort with error if malformed (eg, "%2" or the bytes are not
      allowed in UTF-8). *
   5. Map the host_name string according to the IDNA Preprocessing Table
   (see below).
   6. Normalize the host_name to Unicode Normalization Form C:
      - *host_name = toNFC(host_name)
      *
   7. Parse the host_name string into labels, using
U+002E<http://unicode.org/cldr/utility/character.jsp?a=002E>( . ) FULL
STOP as the label delimiter.
   8. Each label that contains only characters [\-a-zA-Z0-9] is an ASCII
   label. Each other label is processed according to the IDNAbis specification
   to convert to ASCII. That is:
      1. Verify that the label complies with IDNAbis.
         - *Abort with error if not.
         *
      2. Convert the label to ASCII according to the PunyCode
      specification.
         - *label = ToASCII(label).*
      - Abort with error if invalid

2. IDNA Preprocessing Table
This table provides a combined case folding and NFKC normalization, with
some small modifications for IDNA2003 compatibility. This table will remain
stable for all future versions of Unicode; that is, no mappings will be
changed, and any new mappings will only be added for new assigned
characters. There are more details in each section below.

Note that the way that the IDNA Preprocessing Table is constructed, in order
to ensure that isNFKC(output) it is sufficient to do toNFC(output). That is,
the extra changes that are in NFKC that are not in NFC are already in the
table. It is also necessary to do *at least* toNFC(output), since otherwise
the text may have unordered combining marks and/or uncomposed characters.
2.1 IDNA Preprocessing Table UsageThe IDNA Preprocessing Table, once
constructed, consists of a set of mappings. Each mapping entry has a single
code point as a source, and maps that code point to a result sequence of
zero or more other code points.

To use the table to map a string, walk through the string, one code point at
a time. If there is a mapping entry for that code point, replace that code
point by the result of the mapping entry. Otherwise retain the code point as
is.
2.2 IDNA Preprocessing Table ConstructionThe IDNA Preprocessing Table in
constructed as specified in this section.

Initially, the table is constructed based on Unicode 5.1. But a table for
any version of Unicode subsequent to Unicode 5.1 can be constructed with
exactly the same rules.

Informally, the table construction is done by mapping each Unicode character
by applying casefolding and then normalization to Unicode Normalization Form
KD (NFKD). However, there are some exceptional mappings and exclusions
required for compatibility with IDNA2003. The exceptional mappings
constitute a small list of characters that map to nothing in IDNA2003, plus
full stops and a few normalization corrections requiring special handling.
Those are listed completely in Section 2.3.

The exclusions constitute another small list of characters which map to
themselves under IDNA2003 rules, but which do not map to themselves if
casefolded and normalized by the Unicode 5.1 specification. These are listed
completely in Section 2.2.

Note that unassigned (reserved) code points never get an entry in the IDNA
Preprocessing Table.

Formally, the construction of the IDNA Preprocessing Table is specified as:

For each code point X:

   1. *Exceptions. *If X is in the IDNA Preprocessing Exceptions, use the
      mapping in that table, and continue with next code point
      2. * Exclusions.* If X is in IDNA Preprocessing Exclusions,
      continue with next code point
      3. *Normalization and Casefolding.*
         1. Z := X
         2. Do
            a. Y := Z
            b. Z := toNFKC(toCaseFold(Y))
         until (Y == Z)                  // the maximum iterations
         required are two
         3. If X != Y
         then add the mapping X => Y
         else continue without adding a mapping for X

2.3 IDNA Preprocessing Exclusions Exclude the following characters, for
compatibility with IDNA2003. These are characters that didn't have
lowercases in Unicode 3.2, but had lowercase characters added later. Unicode
has since stabilized case folding, so that this won't happen in the future.
That is, case pairs will be assigned in the same version of Unicode -- so
any newly assigned character will either have a casefolding in that version
of Unicode, or it will never have a casefolding in the future.

U+04C0 <http://unicode.org/cldr/utility/character.jsp?a=04C0> ( Ӏ ) CYRILLIC
LETTER PALOCHKA
U+10A0 <http://unicode.org/cldr/utility/character.jsp?a=10A0> ( Ⴀ ) GEORGIAN
CAPITAL LETTER AN
…{36}…U+10C5 <http://unicode.org/cldr/utility/character.jsp?a=10C5> ( Ⴥ )
GEORGIAN CAPITAL LETTER HOE
U+2132 <http://unicode.org/cldr/utility/character.jsp?a=2132> ( Ⅎ ) TURNED
CAPITAL F
U+2183 <http://unicode.org/cldr/utility/character.jsp?a=2183> ( Ↄ ) ROMAN
NUMERAL REVERSED ONE HUNDRED
2.3 IDNA Preprocessing Exceptions For compatibility with IDNA2003, include
the following mappings. The notation [:xxx:] means a Unicode property value.
A mapping is expressed as X => Y, where X is a single code point, and Y is a
sequence of zero or more other code points.

*2.3.1. Ignore (map to an empty sequence) the following characters
*
These are specific mappings as part of IDNA2003.

 U+00AD <http://unicode.org/cldr/utility/character.jsp?a=00AD> ( ) SOFT
HYPHEN
U+034F <http://unicode.org/cldr/utility/character.jsp?a=034F> ( ) COMBINING
GRAPHEME JOINER
U+1806 <http://unicode.org/cldr/utility/character.jsp?a=1806> ( ᠆ )
MONGOLIAN TODO SOFT HYPHEN
U+200B <http://unicode.org/cldr/utility/character.jsp?a=200B> ( ) ZERO WIDTH
SPACE
U+2060 <http://unicode.org/cldr/utility/character.jsp?a=2060> ( ) WORD
JOINER
U+FEFF <http://unicode.org/cldr/utility/character.jsp?a=FEFF> ( ) ZERO WIDTH
NO-BREAK SPACE
and Variation Selectors

In UnicodeSet notation: [\u034F\u200B-\u200D\u2060\uFEFF\u00AD
[:variation_selector:]]

 Note: the following characters were ignored in IDNA2003. They are allowed
in IDNAbis in limited contexts and otherwise ignored.

 U+200C <http://unicode.org/cldr/utility/character.jsp?a=200C> ( ) ZERO
WIDTH NON-JOINER
U+200D <http://unicode.org/cldr/utility/character.jsp?a=200D> ( ) ZERO WIDTH
JOINER

 In UnicodeSet notation: [\u200C \u200D]

*2.3.2. Full Stops

*These are specific mappings as part of IDNA2003, having to do with label
separators.

Map U+3002 <http://unicode.org/cldr/utility/character.jsp?a=3002> ( 。 )
IDEOGRAPHIC FULL STOP (and anything mapped to it by toNFKC) to
U+002E<http://unicode.org/cldr/utility/character.jsp?a=002E>( . ) FULL
STOP. That is:

U+3002 <http://unicode.org/cldr/utility/character.jsp?a=3002> ( 。 )
IDEOGRAPHIC FULL STOP
=> U+002E <http://unicode.org/cldr/utility/character.jsp?a=002E> ( . ) FULL
STOP

U+FF61 <http://unicode.org/cldr/utility/character.jsp?a=FF61> ( 。 )
HALFWIDTH IDEOGRAPHIC FULL STOP
=> U+002E <http://unicode.org/cldr/utility/character.jsp?a=002E> ( . ) FULL
STOP

 *2.3.3. Retain Corrigendum #4: Five Unihan Canonical Mapping
Errors<http://www.unicode.org/versions/corrigendum4.html>
*


These are characters whose normalizations changed after Unicode 3.2 (all of
them were in Unicode 4.0.0). While the set of characters that are normalized
to different values has been stable in Unicode, the results have not been.
We anticipate that as of Unicode 5.1, normalization will be completely
stabilized, so these would be the first *and *last such characters.


 U+2F868 ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F868
=> U+2136A ( ? ) CJK UNIFIED IDEOGRAPH-2136A

U+2F874 ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F874
=> U+5F33 ( ? ) CJK UNIFIED IDEOGRAPH-5F33

U+2F91F ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F91F
=> U+43AB ( ? ) CJK UNIFIED IDEOGRAPH-43AB

U+2F95F ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F95F
=> U+7AAE ( ? ) CJK UNIFIED IDEOGRAPH-7AAE

U+2F9BF ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F9BF
=> U+4D57 ( ? ) CJK UNIFIED IDEOGRAPH-4D57
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20080119/f165259a/attachment-0001.html


More information about the Idna-update mailing list