<h1>

  IDNAbis Preprocessing <font color="#cc0000">Draft</font>

</h1><i>

M. Davis, 2007-01-08<br>

<span style="font-style: italic;">(live document at: </span></i><a style="font-style: italic;" id="publishedDocumentUrl" class="tabcontent" target="_blank" href="http://docs.google.com/Doc?id=dfqr8rd5_51c3nrskcx">http://docs.google.com/Doc?id=dfqr8rd5_51c3nrskcx

</a><span style="font-style: italic;">)</span><i style="font-style: italic;"><br>

</i><br><i>TBD: boilerplate, wordsmithing, references, fleshing out for clarity,...</i><br style="font-style: italic;">

<br>When using the IDNAbis specification, some user agents such as

browsers may have a requirement to interoperate compatibly with the

prior IDNA2003 specification and/or may operate in an environment that

needs to allow

lenient parsing of IDNs. To do so, such user agents need to do a number

of preprocessing steps on URI/IRIs to extract and convert labels. To

promote interoperability among user agents, the specification for such

preprocessing is provided in this document.<br>

<br>

Lower-level protocols, such as the SMTP envelope, should require the

strict use of U-labels and thus not use the preprocessing specified

here. Language-specific modifications to the preprocessing specified in

this document are outside of the scope of this document; they are,

however, discouraged because of the problems they pose for

interoperability.<br>

<h2>1. Preprocessing</h2>The preprocessing consists of the following steps, performed in order:<br>

<br>

<ol><li>

    <div>

      Parse URI/IRI to get the host_name string.<br></div></li><ul><li><div><i>Abort with error if not found.</i>

  </div></li></ul><li>

    <div>

      Convert the host_name string to Unicode.</div></li><ul><li><i>Abort with error if there is any conversion problem.

  </i></li></ul><li>

    <div>Convert

any escapes in the host_name string to Unicode code points as

necessary, depending on context (eg, HTML NCRs like &amp;#x5341; or

Javascript escapes like \u5341).</div></li><ul><li><i>Abort with error if any are malformed (such as &quot;\u123G&quot;).

  </i></li></ul><li>

    <div>

      Convert any %-escapes in the host_name string according to IRI (eg, %2e becomes <code><a href="http://unicode.org/cldr/utility/character.jsp?a=002E" target="c">U+002E</a></code> ( . ) FULL STOP)</div></li><ul><li><i>

Abort with error if malformed (eg, &quot;%2&quot; or the bytes are not allowed in UTF-8).

  </i></li></ul><li>

    Map the host_name string according to the IDNA Preprocessing Table (see below).

  </li><li>

    Normalize the host_name to Unicode Normalization Form C:</li><ul><li><i>host_name = toNFC(host_name)<br></i></li></ul><li>

    Parse the host_name string into labels, using <code><a href="http://unicode.org/cldr/utility/character.jsp?a=002E" target="c">U+002E</a></code> ( . ) FULL STOP as the label delimiter.

  </li><li>Each

label that contains only characters [\-a-zA-Z0-9] is an ASCII label.

Each other label is processed according to the IDNAbis specification to

convert to ASCII. That is: </li><ol><li>

      Verify that the label complies with IDNAbis.</li><ul><li><i>Abort with error if not.<br>

    </i></li></ul><li>Convert the label to ASCII according to the PunyCode specification.</li><ul><li><i>label = ToASCII(label).</i></li></ul><ul><li>Abort with error if invalid

    </li></ul></ol></ol>

<h2>2. IDNA Preprocessing Table<br></h2>This table provides a combined

case folding and NFKC normalization, with some small modifications for

IDNA2003 compatibility. This table will remain stable for all future

versions of

Unicode; that is, no mappings will be changed, and any new mappings

will only be added for new assigned characters. There are more details

in each section below.<br><br>Note that the way that the IDNA

Preprocessing Table is constructed, in order to ensure that

isNFKC(output) it is sufficient to do toNFC(output). That is, the extra

changes that are in NFKC that are not in NFC are already in the table.

It is also necessary to do <i>at least</i> toNFC(output), since otherwise the text may have unordered combining marks and/or uncomposed characters.<br><h3>2.1 IDNA Preprocessing Table Usage</h3>The

IDNA Preprocessing Table, once constructed, consists of a set of

mappings. Each mapping entry has a single code point as a source, and

maps that code point to a result sequence of zero or more other code

points.<br><br>To use the table to map a string, walk through the

string, one code point at a time. If there is a mapping entry for that

code point, replace that code point by the result of the mapping entry.

Otherwise retain the code point as is.<br><h3>2.2 IDNA Preprocessing Table Construction</h3>The IDNA Preprocessing Table in constructed as specified in this section.<br><br>Initially,

the table is constructed based on Unicode 5.1. But a table for any

version of Unicode subsequent to Unicode 5.1 can be constructed with

exactly the same rules. <br><br>Informally, the table construction is

done by mapping each Unicode character by applying casefolding and then

normalization to Unicode Normalization Form KD (NFKD). However, there

are some exceptional mappings and exclusions required for compatibility

with IDNA2003. The exceptional mappings constitute a small list of

characters that map to nothing in IDNA2003, plus full stops and a few

normalization corrections requiring special handling. Those are listed

completely in Section 2.3.<br><br>The exclusions constitute another

small list of characters which map to themselves under IDNA2003 rules,

but which do not map to themselves if casefolded and normalized by the

Unicode 5.1 specification. These are listed completely in Section 2.2.<br><br>Note that unassigned (reserved) code points never get an entry in the IDNA Preprocessing Table.<br><br>Formally, the construction of the IDNA Preprocessing Table is specified as:

<br><br><div style="margin-left: 40px;">For each code point X:<br></div>

<ol><ol type="A"><li><b>Exceptions. </b>If X is in the IDNA Preprocessing Exceptions, use the mapping in that table, and continue with next code point

  </li><li><b>

    Exclusions.</b> If X is in IDNA Preprocessing Exclusions, continue with next code point<br>

  </li><li><b>Normalization and Casefolding.</b></li><ol><li>Z := X</li><li>Do<br>&nbsp;&nbsp; a. Y := Z<br>&nbsp;&nbsp; b. Z := toNFKC(toCaseFold(Y))<br>until (Y == Z)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // the maximum iterations required are two

    </li><li>If X != Y<br>then add the mapping X =&gt; Y<br>else continue without adding a mapping for X<br></li></ol></ol></ol>

<h3>2.3 IDNA Preprocessing Exclusions

</h3>

Exclude the following characters, for compatibility with IDNA2003.

These are characters that didn&#39;t have lowercases in Unicode 3.2, but

had lowercase characters added later. Unicode has since stabilized case

folding, so that this won&#39;t happen in the future. That is, case pairs

will be assigned in the same version of Unicode -- so any newly

assigned character will either have a casefolding in that version of

Unicode, or it will never have a casefolding in the future.<br>

<br>

<code><a href="http://unicode.org/cldr/utility/character.jsp?a=04C0" target="c">U+04C0</a></code> ( Ӏ ) CYRILLIC LETTER PALOCHKA<br>

<code><a href="http://unicode.org/cldr/utility/character.jsp?a=10A0" target="c">U+10A0</a></code> ( Ⴀ ) GEORGIAN CAPITAL LETTER AN<br>

…{36}…<code><a href="http://unicode.org/cldr/utility/character.jsp?a=10C5" target="c">U+10C5</a></code> ( Ⴥ ) GEORGIAN CAPITAL LETTER HOE<br>

<code><a href="http://unicode.org/cldr/utility/character.jsp?a=2132" target="c">U+2132</a></code> ( Ⅎ ) TURNED CAPITAL F<br>

<code><a href="http://unicode.org/cldr/utility/character.jsp?a=2183" target="c">U+2183</a></code> ( Ↄ ) ROMAN NUMERAL REVERSED ONE HUNDRED<br>

<h3>2.3 IDNA Preprocessing Exceptions

</h3>

For compatibility with IDNA2003, include the following mappings. The

notation [:xxx:] means a Unicode property value. A mapping is expressed

as X =&gt; Y, where X is a single code point, and Y is a sequence of

zero or more other code points.<br>

<br>

<b>2.3.1. Ignore (map to an empty sequence) the following characters<br></b><br>These are specific mappings as part of IDNA2003.<br><br>

<div style="margin-left: 40px;">

  <code><a href="http://unicode.org/cldr/utility/character.jsp?a=00AD" target="c">U+00AD</a></code> ( ) SOFT HYPHEN<br>

  <code><a href="http://unicode.org/cldr/utility/character.jsp?a=034F" target="c">U+034F</a></code> ( ) COMBINING GRAPHEME JOINER<br>

  <code><a href="http://unicode.org/cldr/utility/character.jsp?a=1806" target="c">U+1806</a></code> ( ᠆ ) MONGOLIAN TODO SOFT HYPHEN<br>

  <code><a href="http://unicode.org/cldr/utility/character.jsp?a=200B" target="c">U+200B</a></code> ( ) ZERO WIDTH SPACE<br>

  <code><a href="http://unicode.org/cldr/utility/character.jsp?a=2060" target="c">U+2060</a></code> ( ) WORD JOINER<br>

  <code><a href="http://unicode.org/cldr/utility/character.jsp?a=FEFF" target="c">U+FEFF</a></code> ( ) ZERO WIDTH NO-BREAK SPACE<br>

  and Variation Selectors<br>

</div>

<br>

In UnicodeSet notation: [\u034F\u200B-\u200D\u2060\uFEFF\u00AD [:variation_selector:]]<br>

<br>

<div style="margin-left: 40px;"> Note: the following characters were

ignored in IDNA2003. They are allowed in IDNAbis in limited contexts

and otherwise ignored.<br><br>

</div>

<div style="margin-left: 80px;">

  <code><a href="http://unicode.org/cldr/utility/character.jsp?a=200C" target="c">U+200C</a></code> ( ) ZERO WIDTH NON-JOINER<br>

  <code><a href="http://unicode.org/cldr/utility/character.jsp?a=200D" target="c">U+200D</a></code> ( ) ZERO WIDTH JOINER<br><br>

</div>

<div style="margin-left: 40px;">

  In UnicodeSet notation: [\u200C \u200D]<br>

</div>

<br>

<b>2.3.2. Full Stops<br><br></b>These are specific mappings as part of IDNA2003, having to do with label separators.<br><br>Map <code><a href="http://unicode.org/cldr/utility/character.jsp?a=3002" target="c">U+3002</a></code>

 ( 。 ) IDEOGRAPHIC FULL STOP (and anything mapped to it by toNFKC) to <code><a href="http://unicode.org/cldr/utility/character.jsp?a=002E" target="c">U+002E</a></code> ( . ) FULL STOP. That is:<br>

<br>

<code><a href="http://unicode.org/cldr/utility/character.jsp?a=3002" target="c">U+3002</a></code> ( 。 ) IDEOGRAPHIC FULL STOP<br>

=&gt; <code><a href="http://unicode.org/cldr/utility/character.jsp?a=002E" target="c">U+002E</a></code> ( . ) FULL STOP<br>

<br>

<code><a href="http://unicode.org/cldr/utility/character.jsp?a=FF61" target="c">U+FF61</a></code> ( ｡ ) HALFWIDTH IDEOGRAPHIC FULL STOP<br>

=&gt; <code><a href="http://unicode.org/cldr/utility/character.jsp?a=002E" target="c">U+002E</a></code> ( . ) FULL STOP<br>

<br>

<p>

  <b>2.3.3. Retain <a href="http://www.unicode.org/versions/corrigendum4.html">Corrigendum #4: Five Unihan Canonical Mapping Errors</a><br>

  </b></p><p><br></p><p>These

are characters whose normalizations changed after Unicode 3.2 (all of

them were in Unicode 4.0.0). While the set of characters that are

normalized to different values has been stable in Unicode, the results

have not been. We anticipate that as of Unicode 5.1, normalization will

be completely stabilized, so these would be the first <i>and </i>last such characters.<br></p>

<p>

  <code><a target="c"><br></a></code>

</p>

<p>

  <code><a target="c">U+2F868</a></code> ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F868<br>

  =&gt; <code><a target="c">U+2136A</a></code> ( ? ) CJK UNIFIED IDEOGRAPH-2136A<br>

  <br>

  <code><a target="c">U+2F874</a></code> ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F874<br>

  <code>=&gt; <a target="c">U+5F33</a></code> ( ? ) CJK UNIFIED IDEOGRAPH-5F33<br>

  <br>

  <code><a target="c">U+2F91F</a></code> ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F91F<br>

  <code>=&gt; <a target="c">U+43AB</a></code> ( ? ) CJK UNIFIED IDEOGRAPH-43AB<br>

  <br>

  <code><a target="c">U+2F95F</a></code> ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F95F<br>

  =&gt; <code><a target="c">U+7AAE</a></code> ( ? ) CJK UNIFIED IDEOGRAPH-7AAE<br>

  <br>

  <code><a target="c">U+2F9BF</a></code> ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F9BF<br>

  <code>=&gt; <a target="c">U+4D57</a></code> ( ? ) CJK UNIFIED IDEOGRAPH-4D57<br>

</p>