IDNAbis Preprocessing Draft

Harald Alvestrand harald at alvestrand.no
Sun Jan 20 22:07:56 CET 2008


Mark,

I think developing a draft like this is a reasonable thing to do. But I
do have some problems with the specific approach you are suggesting.

Mark Davis skrev:
>
>
>   IDNAbis Preprocessing Draft
>
> /M. Davis, 2007-01-08
> (live document at: /http://docs.google.com/Doc?id=dfqr8rd5_51c3nrskcx
> <http://docs.google.com/Doc?id=dfqr8rd5_51c3nrskcx>)/
> /
> /TBD: boilerplate, wordsmithing, references, fleshing out for clarity,.../
>
> When using the IDNAbis specification, some user agents such as
> browsers may have a requirement to interoperate compatibly with the
> prior IDNA2003 specification and/or may operate in an environment that
> needs to allow lenient parsing of IDNs.
This is an incorrect characterization - since what they are parsing does
not conform to the IDN spec, they are not IDNs. What you are instead
advocating (which I think is a reasonable thing) is to parse input that
the user intends to indicate an IDN and try to extract the exact IDN
that was (probably) intended by the user.
> To do so, such user agents need to do a number of preprocessing steps
> on URI/IRIs to extract and convert labels. To promote interoperability
> among user agents, the specification for such preprocessing is
> provided in this document.
Please flip this around - IDNs are not dependent on being inside an
URI/IRI, and in many cases they are not. You should (I think) focus on
the idea that those steps are required when some user input is intended
to be interpreted as an IDN, where one common use case is attempts by
users to type an IRI. (An URI can't contain IDNs, so it's out of scope.)
>
> Lower-level protocols, such as the SMTP envelope, should require the
> strict use of U-labels
or a-labels.
> and thus not use the preprocessing specified here. Language-specific
> modifications to the preprocessing specified in this document are
> outside of the scope of this document; they are, however, discouraged
> because of the problems they pose for interoperability.
I think you need to add a section here giving the requirements on your
mapping algorithm:

- That any characters legal in IDNAbis, if present in the input, are
also present in the output
- That where the user has a reasonable expectation that giving the
character "X" as input will be treated as equivalent to "X'", and it's
possible to determine this unambiguously, this is what should happen.

This may also be a good place to put some text about what kinds of error
recovery are appropriate if the "Abort with error" steps below happen;
again, this depends on context, but some words to show that we've
thought about it are always nice.

I do wonder what your mapping tables look like for the trailing Greek
sigma - that's the canonical case of a context dependent case-mapping,
just as the dotless I is the canonical case of a language dependent
case-mapping.
>
>
>     1. Preprocessing
>
> The preprocessing consists of the following steps, performed in order:
>
>   1.
>       Parse URI/IRI to get the host_name string.
>          *
>             /Abort with error if not found./
>
This is only relevant for the IRI case.
>
>          *
>
>   1.
>       Convert the host_name string to Unicode.
>           * /Abort with error if there is any conversion problem. /
>   2.
>       Convert any escapes in the host_name string to Unicode code
>       points as necessary, depending on context (eg, HTML NCRs like
>       &#x5341; or Javascript escapes like \u5341).
>           * /Abort with error if any are malformed (such as "\u123G"/
>
Again, this is only relevant for the IRI case. HTML NCRs are only
relevant if you assume HTML or another SGML-derived context.
>
>           * / /
>   1.
>       Convert any %-escapes in the host_name string according to IRI
>       (eg, %2e becomes |U+002E
>       <http://unicode.org/cldr/utility/character.jsp?a=002E>| ( . )
>       FULL STOP)
>           * /Abort with error if malformed (eg, "%2" or the bytes are
>             not allowed in UTF-8)./
>
And this one... I think it would be better if you wrote a separate
section on "Extracting IDNs from IRIs in an HTML document", as one
example of preprocessing before you can get to the IDNA Preprocessing part.
>
>           * / /
>    1. Map the host_name string according to the IDNA Preprocessing
>       Table (see below).
>    2. Normalize the host_name to Unicode Normalization Form C:
>           * /host_name = toNFC(host_name)
>             /
>
I assume that you're using toNFC because the difference between toNFC
and toNFKC has already been handled by the IDNA Preprocessing Table. Is
that right?
>
>    1. Parse the host_name string into labels, using |U+002E
>       <http://unicode.org/cldr/utility/character.jsp?a=002E>| ( . )
>       FULL STOP as the label delimiter.
>    2. Each label that contains only characters [\-a-zA-Z0-9] is an
>       ASCII label. Each other label is processed according to the
>       IDNAbis specification to convert to ASCII. That is:
>          1. Verify that the label complies with IDNAbis.
>                 * /Abort with error if not.
>                   /
>          2. Convert the label to ASCII according to the PunyCode
>             specification.
>                 * /label = ToASCII(label)./
>                 * Abort with error if invalid
>
Not in all cases. We only need to convert to ASCII if we're preparing
the IDN for an IDNA-ignorant slot (such as the DNS protocol). In other
cases (such as the EAI UTF8SMTP), it's not needed.
>
>                *
>
>
>     2. IDNA Preprocessing Table
>
> This table provides a combined case folding and NFKC normalization,
> with some small modifications for IDNA2003 compatibility. This table
> will remain stable for all future versions of Unicode; that is, no
> mappings will be changed, and any new mappings will only be added for
> new assigned characters. There are more details in each section below.
>
> Note that the way that the IDNA Preprocessing Table is constructed, in
> order to ensure that isNFKC(output) it is sufficient to do
> toNFC(output). That is, the extra changes that are in NFKC that are
> not in NFC are already in the table. It is also necessary to do /at
> least/ toNFC(output), since otherwise the text may have unordered
> combining marks and/or uncomposed characters.

>
>       2.1 IDNA Preprocessing Table Usage
>
> The IDNA Preprocessing Table, once constructed, consists of a set of
> mappings. Each mapping entry has a single code point as a source, and
> maps that code point to a result sequence of zero or more other code
> points.
>
> To use the table to map a string, walk through the string, one code
> point at a time. If there is a mapping entry for that code point,
> replace that code point by the result of the mapping entry. Otherwise
> retain the code point as is.
>
>
>       2.2 IDNA Preprocessing Table Construction
>
> The IDNA Preprocessing Table in constructed as specified in this section.
>
> Initially, the table is constructed based on Unicode 5.1. But a table
> for any version of Unicode subsequent to Unicode 5.1 can be
> constructed with exactly the same rules.
>
> Informally, the table construction is done by mapping each Unicode
> character by applying casefolding and then normalization to Unicode
> Normalization Form KD (NFKD). However, there are some exceptional
> mappings and exclusions required for compatibility with IDNA2003. The
> exceptional mappings constitute a small list of characters that map to
> nothing in IDNA2003, plus full stops and a few normalization
> corrections requiring special handling. Those are listed completely in
> Section 2.3.
>
> The exclusions constitute another small list of characters which map
> to themselves under IDNA2003 rules, but which do not map to themselves
> if casefolded and normalized by the Unicode 5.1 specification. These
> are listed completely in Section 2.2.
>
> Note that unassigned (reserved) code points never get an entry in the
> IDNA Preprocessing Table.
>
> Formally, the construction of the IDNA Preprocessing Table is
> specified as:
>
> For each code point X:
>
>          1. *Exceptions. *If X is in the IDNA Preprocessing
>             Exceptions, use the mapping in that table, and continue
>             with next code point
>          2. * Exclusions.* If X is in IDNA Preprocessing Exclusions,
>             continue with next code point
>          3. *Normalization and Casefolding.*
>                1. Z := X
>                2. Do
>                      a. Y := Z
>                      b. Z := toNFKC(toCaseFold(Y))
>                   until (Y == Z)                  // the maximum
>                   iterations required are two
>
What are the references for toCaseFold() and toNFKC()? Unicode
specification section....?

>               1.
>
>
>                2. If X != Y
>                   then add the mapping X => Y
>                   else continue without adding a mapping for X
>
>
>       2.3 IDNA Preprocessing Exclusions
>
> Exclude the following characters, for compatibility with IDNA2003.
What does "exclude" mean here? Forbid them on input (making the
conversion fail), or something else?

Given that we're making an algorithm for turning "messy user input" into
"clean IDNAbis-compatible domain names", I'm not sure excluding them
will make sense.

> These are characters that didn't have lowercases in Unicode 3.2, but
> had lowercase characters added later. Unicode has since stabilized
> case folding, so that this won't happen in the future. That is, case
> pairs will be assigned in the same version of Unicode -- so any newly
> assigned character will either have a casefolding in that version of
> Unicode, or it will never have a casefolding in the future.
Out of curiosity: what will Unicode do if you discover a case variant
that you didn't know existed? (the one that's been talked about a bit is
an upper-case esszett...)
>
> |U+04C0 <http://unicode.org/cldr/utility/character.jsp?a=04C0>| ( Ӏ )
> CYRILLIC LETTER PALOCHKA
> |U+10A0 <http://unicode.org/cldr/utility/character.jsp?a=10A0>| ( Ⴀ )
> GEORGIAN CAPITAL LETTER AN
> …{36}…|U+10C5 <http://unicode.org/cldr/utility/character.jsp?a=10C5>|
> ( Ⴥ ) GEORGIAN CAPITAL LETTER HOE
> |U+2132 <http://unicode.org/cldr/utility/character.jsp?a=2132>| ( Ⅎ )
> TURNED CAPITAL F
> |U+2183 <http://unicode.org/cldr/utility/character.jsp?a=2183>| ( Ↄ )
> ROMAN NUMERAL REVERSED ONE HUNDRED
>
>
>       2.3 IDNA Preprocessing Exceptions
>
> For compatibility with IDNA2003, include the following mappings. The
> notation [:xxx:] means a Unicode property value. A mapping is
> expressed as X => Y, where X is a single code point, and Y is a
> sequence of zero or more other code points.
>
> *2.3.1. Ignore (map to an empty sequence) the following characters
> *
I would call that "remove", not "ignore". The Bidi algorithm uses the
term "ignore" to mean "act as if the characters were not present, but
still retain them in the string", so using "ignore" in this fashion is a
bit confusing to me.
> These are specific mappings as part of IDNA2003.
>
> |U+00AD <http://unicode.org/cldr/utility/character.jsp?a=00AD>| ( )
> SOFT HYPHEN
> |U+034F <http://unicode.org/cldr/utility/character.jsp?a=034F>| ( )
> COMBINING GRAPHEME JOINER
> |U+1806 <http://unicode.org/cldr/utility/character.jsp?a=1806>| ( ᠆ )
> MONGOLIAN TODO SOFT HYPHEN
> |U+200B <http://unicode.org/cldr/utility/character.jsp?a=200B>| ( )
> ZERO WIDTH SPACE
> |U+2060 <http://unicode.org/cldr/utility/character.jsp?a=2060>| ( )
> WORD JOINER
> |U+FEFF <http://unicode.org/cldr/utility/character.jsp?a=FEFF>| ( )
> ZERO WIDTH NO-BREAK SPACE
> and Variation Selectors
>
> In UnicodeSet notation: [\u034F\u200B-\u200D\u2060\uFEFF\u00AD
> [:variation_selector:]]
>
> Note: the following characters were ignored in IDNA2003. They are
> allowed in IDNAbis in limited contexts and otherwise ignored.
>
> |U+200C <http://unicode.org/cldr/utility/character.jsp?a=200C>| ( )
> ZERO WIDTH NON-JOINER
> |U+200D <http://unicode.org/cldr/utility/character.jsp?a=200D>| ( )
> ZERO WIDTH JOINER
>
> In UnicodeSet notation: [\u200C \u200D]
>
> *2.3.2. Full Stops
>
> *These are specific mappings as part of IDNA2003, having to do with
> label separators.
>
> Map |U+3002 <http://unicode.org/cldr/utility/character.jsp?a=3002>|
> ( 。 ) IDEOGRAPHIC FULL STOP (and anything mapped to it by toNFKC) to
> |U+002E <http://unicode.org/cldr/utility/character.jsp?a=002E>| ( . )
> FULL STOP. That is:
>
> |U+3002 <http://unicode.org/cldr/utility/character.jsp?a=3002>| ( 。 )
> IDEOGRAPHIC FULL STOP
> => |U+002E <http://unicode.org/cldr/utility/character.jsp?a=002E>| ( .
> ) FULL STOP
>
> |U+FF61 <http://unicode.org/cldr/utility/character.jsp?a=FF61>| ( 。 )
> HALFWIDTH IDEOGRAPHIC FULL STOP
> => |U+002E <http://unicode.org/cldr/utility/character.jsp?a=002E>| ( .
> ) FULL STOP
Is this the beginning of the list, or do you intend this to be the full
list?
I think we've discussed before things like ETHIOPIC FULL STOP (U+1362,
looks like 4 dots together) and ARMENIAN FULL STOP (U+0589, looks like a
colon); I'm fine with saying "we are only mapping those that map to the
full stop, not those that function like full stops, and neither are we
guaranteeing that we map things that look like full stops but aren't" -
but as long as we're doing any mapping for label separators, I think we
need to document the logic behind the design choice.
>
> *2.3.3. Retain Corrigendum #4: Five Unihan Canonical Mapping Errors
> <http://www.unicode.org/versions/corrigendum4.html>
> *
>
>
> These are characters whose normalizations changed after Unicode 3.2
> (all of them were in Unicode 4.0.0). While the set of characters that
> are normalized to different values has been stable in Unicode, the
> results have not been. We anticipate that as of Unicode 5.1,
> normalization will be completely stabilized, so these would be the
> first /and /last such characters.
>
> |
> |
>
> |U+2F868| ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F868
> => |U+2136A| ( ? ) CJK UNIFIED IDEOGRAPH-2136A
>
> |U+2F874| ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F874
> |=> U+5F33| ( ? ) CJK UNIFIED IDEOGRAPH-5F33
>
> |U+2F91F| ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F91F
> |=> U+43AB| ( ? ) CJK UNIFIED IDEOGRAPH-43AB
>
> |U+2F95F| ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F95F
> => |U+7AAE| ( ? ) CJK UNIFIED IDEOGRAPH-7AAE
>
> |U+2F9BF| ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F9BF
> |=> U+4D57| ( ? ) CJK UNIFIED IDEOGRAPH-4D57
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>   



More information about the Idna-update mailing list