Minimal IDNAbis requirements

Erik van der Poel erikv at google.com
Tue Jan 1 21:53:21 CET 2008


John,

I read through your idnabis issues draft and I have a couple of
comments/corrections. The indented sections are from the draft and the
following unindented sections are my comments:

   Any rules or conventions that apply to DNS labels in general, such as
   rules about lengths of strings, apply to whichever of the U-label or
   A-label would be more restrictive.

For U-labels, string lengths are numbers of codepoints, I suppose. I
wonder if it is necessary to explicitly state that. I.e. as opposed to
the number of bytes in the UTF-8 encoding of the U-label, or the
UTF-16 encoding, etc.

   Strings that do not conform to the rules for one of
   these three categories and, in particular, strings that contain "-"
   in the third or fourth character position but are

   o  not A-labels or

   o  that cannot be processed as U-labels or A-labels as described in
      these specifications,

   are invalid as labels in domain names that identify Internet hosts or
   similar resources.

The hyphen in 3rd and 4th positions is referring to the old prefixes
that were used before xn--, right? Since this is the rationale
document, it would be nice if that were made explicit.

   An "internationalized domain name" (IDN) is a domain name that may
   contain one or more A-labels or U-labels, as appropriate, instead of
   LDH labels.

"instead of LDH-labels" sounds like you're excluding LDH-labels. How
about "a domain name that contains one or more A-labels, U-labels or
LDH-labels"?

   Because of this condition, which requires evaluation by individual
   script communities of the characters suitable for use in IDNs (not
   just, e.g., the general stability of the scripts in which those
   characters are embedded) it is not feasible to define the boundary
   point between this category and the next one by general properties of
   the characters, such as the Unicode property lists.

This is referring to ALWAYS and MAYBE, but it seems like Patrik was
able to come up with a set of rules based on existing Unicode
properties that ended up placing some characters in the ALWAYS
category and some in MAYBE.

   6.3.  The Ligature and Digraph Problem

Maybe this should be called the "variant problem" or something, since
the Scandinavian o-diaeresis and o-slash issue is neither a ligature
nor digraph issue.

------------------

       *  Characters that are unassigned in the version of Unicode being
          used by the registry or application are not permitted, even on
          resolution (lookup).  This is because, unlike the conditions
          contemplated in IDNA2003 (except for right-to-left text), we
          now understand that tests involving the context of characters
          (e.g., some characters being permitted only adjacent to other
          ones of specific types) and integrity tests on complete labels
          will be needed.  Unassigned code points cannot be permitted
          because one cannot determine the contextual rules that
          particular code points will require before characters are
          assigned to them and the properties of those characters fully
          understood.

Also, NFC will produce different results, if an unassigned codepoint
becomes assigned and then has a combining class that determines its
placement relative to other combining marks. The protocol draft only
mentions NFC under resolution, not under registration, by the way.

   2.  Adjustments in Stringprep tables or IDNA actions, including
       normalization definitions, that do not affect characters that
       have already been invalid under IDNA2003.

Should the "do not affect" be "affect"? (I believe I pointed this out
on 12/31/06, when the word was "impact" instead of "affect".)

   11.2.  IDNA Context Registry

   For characters that are defined in the permitted character as

permitted character list

   When systems use local character sets other than ASCII and Unicode,
   this specification leaves the the problem of transcoding between the

the the

   Some specific suggestion
   about identification and handling of confusable characters appear in
   a Unicode Consortium publication [???]

suggestions (plural)

http://www.unicode.org/reports/tr36/

              A version of this document, is available in HTML format at
              http://stupid.domain.name/idnabis/draft-faltstrom-idnabis-tables-03.txt

I believe the .txt extension should be .html:

http://stupid.domain.name/idnabis/draft-faltstrom-idnabis-tables-03.html

Erik


More information about the Idna-update mailing list