idna-issues-05 (was: Re: Minimal IDNAbis requirements)

Wed Jan 2 18:51:07 CET 2008

--On Tuesday, 01 January, 2008 12:53 -0800 Erik van der Poel
<erikv at google.com> wrote:

> John,
> 
> I read through your idnabis issues draft and I have a couple of
> comments/corrections. The indented sections are from the draft
> and the following unindented sections are my comments:

Erik,

Thanks for the many constructive comments.  Responses to each
one are below.

>    Any rules or conventions that apply to DNS labels in
> general, such as    rules about lengths of strings, apply to
> whichever of the U-label or    A-label would be more
> restrictive.
> 
> For U-labels, string lengths are numbers of codepoints, I
> suppose. I wonder if it is necessary to explicitly state that.
> I.e. as opposed to the number of bytes in the UTF-8 encoding
> of the U-label, or the UTF-16 encoding, etc.

No, actually, given the constraints of the other protocols
involved, the limit on a U-label taken alone would be related to 

   max(number-of-codepoints, octet-length-of-utf8-form)

which, in practice, would always be the utf-8 string length.
And that number would need to be less than 63 octets per label
and 255 characters per FQDN.  The problem is that, if we don't
want users to see the A-labels any more often in necessary, we
need to recognize the restrictions of applications protocols and
the presentation forms based on them.   in practice, as we are
discovering with email and the EAI work, it is actually easier
to extend the character set and coding of strings than it is to
change length restrictions.

I've incorporated new  text into the working draft to clarify
this.

>    Strings that do not conform to the rules for one of
>    these three categories and, in particular, strings that
> contain "-"    in the third or fourth character position but
> are
> 
>    o  not A-labels or
> 
>    o  that cannot be processed as U-labels or A-labels as
> described in       these specifications,
> 
>    are invalid as labels in domain names that identify
> Internet hosts or    similar resources.
> 
> The hyphen in 3rd and 4th positions is referring to the old
> prefixes that were used before xn--, right? Since this is the
> rationale document, it would be nice if that were made
> explicit.

Both past and future.  One might, in principle, make future
changes that would require a prefix change or equivalent.  This
prohibition keeps that possibility open.  In addition,
encouraging a clear prohibition of such labels eliminates
another possible attack vector on the IDNA approach.

I've incorporated new text into the working draft to clarify
this.

>    An "internationalized domain name" (IDN) is a domain name
> that may    contain one or more A-labels or U-labels, as
> appropriate, instead of    LDH labels.
> 
> "instead of LDH-labels" sounds like you're excluding
> LDH-labels. How about "a domain name that contains one or more
> A-labels, U-labels or LDH-labels"?

That is one of the things that got us into trouble with IDNA,
which attempted to define "IDNs" such that all-ASCII labels
conforming to the hostname rules were simply a subset of the set
of IDNs.  The new drafts make the three categories disjoint in
the hope of eliminating the confusion so, yes, I'm excluding
LDH-labels.  Suggestions about better ways to say this would be
welcome.

>    Because of this condition, which requires evaluation by
> individual    script communities of the characters suitable
> for use in IDNs (not    just, e.g., the general stability of
> the scripts in which those    characters are embedded) it is
> not feasible to define the boundary    point between this
> category and the next one by general properties of    the
> characters, such as the Unicode property lists.
> 
> This is referring to ALWAYS and MAYBE, but it seems like
> Patrik was able to come up with a set of rules based on
> existing Unicode properties that ended up placing some
> characters in the ALWAYS category and some in MAYBE.

Not really.  What he did was to enumerate subsets of certain
scripts that go to ALWAYS now because we understand the issues
well enough (or think we do).   That is not, as others have
suggested, because of the limited script knowledge of the design
team.  Instead, it is because we have been --often in
conjunction with the relevant language communities-- able to
identify specific open issues with the other scripts we have
examined, in many cases with input from experts on the relevant
writing systems.  Remembering that we started with a script and
block-based approach, that list includes most other scripts used
by major contemporary languages ("major" and "contemporary" is
not a means of excluding anything, either short or long-term,
but just laid the foundation for identifying sufficient cases
from which we could apply induction.

To give a few examples, we know that the subset of CJK
characters that have been identified _by the language
communities_ as appropriate for DNS labels as part of the
table-specification work based on the JET model (see RFC 3743)
are safe enough for "ALWAYS" because the language communities
have told us so.  Other Han-derived characters go to MAYBE and
presumably stay there until and unless the JET-based tables are
expanded.   Once people really sit down and look at the
application and naming issues, rather than thinking in terms of
how things are coded, we expect controversy about whether
position-dependent characters (e.g., final-form ones)match the
corresponding base characters or not, a question whose answers
may be script-dependent.  The answers that might be obvious
based on standard lexographic, orthographic, or typographic
conventions may not be helpful for DNS use given the history of
people forming domain names by cramming words together without
breaks or other punctuation.  Because of issues similar to
these, until we can identify language communities who can decide
and take responsibility for the decisions, either the whole
script or the pairs of base characters and position-dependent
ones be kept in MAYBE (and keeping the whole script there
temporarily prevents a number of there complexities).  

As a final example (but by no means the only other one), Hangul
has some very specific rules about character sequencing in
string formation.  You will recall that there was some fairly
extended conversation while IDNA2003 was being developed about
whether the standard should enforce the rules.   The conclusion
was "no", but that was, in considerable measure, because there
was no mechanism for doing so.  Because the new model permits
imposition of context-dependent restrictions, it would now be
possible to write rules that would prohibit ill-formed Hangul
strings in the protocol.  My personal guess is that would
(still) be a bad idea and that such restrictions are best left
to registry action.    But I'm not competent to make a decision
on that subject and we are certainly not going to tell the user
community for that particular writing system how their system
should be used or restricted.   We need to hear from them and,
until we do, the script should remain in "MAYBE".

Again, better ways to explain this would be appreciated as would
advice as to whether the examples above belong in the document.

>    6.3.  The Ligature and Digraph Problem
> 
> Maybe this should be called the "variant problem" or
> something, since the Scandinavian o-diaeresis and o-slash
> issue is neither a ligature nor digraph issue.

Yes, although the notion of "combining character" tends to turn
it into an issue almost identical to the digraph one, the
terminology is still poor.  Unfortunately, in IDN parlance,
"variant" has become associated with the set of issues and
approaches associated with the JET work.  While one could use
the JET approach at registration time to address the o-diaeresis
and o-slash issues, that has more to do with a possible solution
than a description of the problem.  Other suggestions for
terminology would be appreciated.

> ------------------
> 
>        *  Characters that are unassigned in the version of
> Unicode being           used by the registry or application
> are not permitted, even on           resolution (lookup).
> This is because, unlike the conditions           contemplated
> in IDNA2003 (except for right-to-left text), we           now
> understand that tests involving the context of characters
> (e.g., some characters being permitted only adjacent to other
>           ones of specific types) and integrity tests on
> complete labels           will be needed.  Unassigned code
> points cannot be permitted           because one cannot
> determine the contextual rules that           particular code
> points will require before characters are           assigned
> to them and the properties of those characters fully
> understood.
> 
> Also, NFC will produce different results, if an unassigned
> codepoint becomes assigned and then has a combining class that
> determines its placement relative to other combining marks.
> The protocol draft only mentions NFC under resolution, not
> under registration, by the way.

Good points.  Text has been, or will be, adjusted.  I note,
however, that there have been some "will never happen"
discussions with the Unicode Consortium about assignment of new
characters to codepoints that then normalize to other things.

>    2.  Adjustments in Stringprep tables or IDNA actions,
> including        normalization definitions, that do not affect
> characters that        have already been invalid under
> IDNA2003.
> 
> Should the "do not affect" be "affect"? (I believe I pointed
> this out on 12/31/06, when the word was "impact" instead of
> "affect".)

Yep.  Sorry this got lost due to the other change.  Fixed.

>    11.2.  IDNA Context Registry
> 
>    For characters that are defined in the permitted character
> as
> 
> permitted character list

yes.  fixed.

>    When systems use local character sets other than ASCII and
> Unicode,    this specification leaves the the problem of
> transcoding between the
> 
> the the

Sigh. fixed.

>    Some specific suggestion
>    about identification and handling of confusable characters
> appear in    a Unicode Consortium publication [???]
> 
> suggestions (plural)
> 
> http://www.unicode.org/reports/tr36/

Fixed (reference already caught, but thanks).

>               A version of this document, is available in HTML
> format at
> http://stupid.domain.name/idnabis/draft-faltstrom-idnabis-tabl
> es-03.txt
> 
> I believe the .txt extension should be .html:
> 
> http://stupid.domain.name/idnabis/draft-faltstrom-idnabis-tabl
> es-03.html

Yes.  Stupid transcription error.  Thanks.

Again, many thanks.
   john