Minimal IDNAbis requirements
Erik van der Poel
erikv at google.com
Tue Jan 1 21:53:21 CET 2008
I read through your idnabis issues draft and I have a couple of
comments/corrections. The indented sections are from the draft and the
following unindented sections are my comments:
Any rules or conventions that apply to DNS labels in general, such as
rules about lengths of strings, apply to whichever of the U-label or
A-label would be more restrictive.
For U-labels, string lengths are numbers of codepoints, I suppose. I
wonder if it is necessary to explicitly state that. I.e. as opposed to
the number of bytes in the UTF-8 encoding of the U-label, or the
UTF-16 encoding, etc.
Strings that do not conform to the rules for one of
these three categories and, in particular, strings that contain "-"
in the third or fourth character position but are
o not A-labels or
o that cannot be processed as U-labels or A-labels as described in
are invalid as labels in domain names that identify Internet hosts or
The hyphen in 3rd and 4th positions is referring to the old prefixes
that were used before xn--, right? Since this is the rationale
document, it would be nice if that were made explicit.
An "internationalized domain name" (IDN) is a domain name that may
contain one or more A-labels or U-labels, as appropriate, instead of
"instead of LDH-labels" sounds like you're excluding LDH-labels. How
about "a domain name that contains one or more A-labels, U-labels or
Because of this condition, which requires evaluation by individual
script communities of the characters suitable for use in IDNs (not
just, e.g., the general stability of the scripts in which those
characters are embedded) it is not feasible to define the boundary
point between this category and the next one by general properties of
the characters, such as the Unicode property lists.
This is referring to ALWAYS and MAYBE, but it seems like Patrik was
able to come up with a set of rules based on existing Unicode
properties that ended up placing some characters in the ALWAYS
category and some in MAYBE.
6.3. The Ligature and Digraph Problem
Maybe this should be called the "variant problem" or something, since
the Scandinavian o-diaeresis and o-slash issue is neither a ligature
nor digraph issue.
* Characters that are unassigned in the version of Unicode being
used by the registry or application are not permitted, even on
resolution (lookup). This is because, unlike the conditions
contemplated in IDNA2003 (except for right-to-left text), we
now understand that tests involving the context of characters
(e.g., some characters being permitted only adjacent to other
ones of specific types) and integrity tests on complete labels
will be needed. Unassigned code points cannot be permitted
because one cannot determine the contextual rules that
particular code points will require before characters are
assigned to them and the properties of those characters fully
Also, NFC will produce different results, if an unassigned codepoint
becomes assigned and then has a combining class that determines its
placement relative to other combining marks. The protocol draft only
mentions NFC under resolution, not under registration, by the way.
2. Adjustments in Stringprep tables or IDNA actions, including
normalization definitions, that do not affect characters that
have already been invalid under IDNA2003.
Should the "do not affect" be "affect"? (I believe I pointed this out
on 12/31/06, when the word was "impact" instead of "affect".)
11.2. IDNA Context Registry
For characters that are defined in the permitted character as
permitted character list
When systems use local character sets other than ASCII and Unicode,
this specification leaves the the problem of transcoding between the
Some specific suggestion
about identification and handling of confusable characters appear in
a Unicode Consortium publication [???]
A version of this document, is available in HTML format at
I believe the .txt extension should be .html:
More information about the Idna-update