idna-issues-05 (was: Re: Minimal IDNAbis requirements)

Wed Jan 2 23:12:03 CET 2008

On Jan 2, 2008 9:51 AM, John C Klensin <klensin at jck.com> wrote:
> --On Tuesday, 01 January, 2008 12:53 -0800 Erik van der Poel <erikv at google.com> wrote:
> > For U-labels, string lengths are numbers of codepoints, I
> > suppose. I wonder if it is necessary to explicitly state that.
> > I.e. as opposed to the number of bytes in the UTF-8 encoding
> > of the U-label, or the UTF-16 encoding, etc.
>
> No, actually, given the constraints of the other protocols
> involved, the limit on a U-label taken alone would be related to
>
>    max(number-of-codepoints, octet-length-of-utf8-form)
>
> which, in practice, would always be the utf-8 string length.
> And that number would need to be less than 63 octets per label
> and 255 characters per FQDN.  The problem is that, if we don't
> want users to see the A-labels any more often in necessary, we
> need to recognize the restrictions of applications protocols and
> the presentation forms based on them.   in practice, as we are
> discovering with email and the EAI work, it is actually easier
> to extend the character set and coding of strings than it is to
> change length restrictions.

I see. However, other apps may not have the same issues. For example,
IRIs don't seem to have any of the 63, 255 or UTF-8 restrictions (IRIs
may be in a non-Unicode encoding).

Perhaps the 63/255/UTF-8 rules are appropriate as a *default* for all
apps, and each app can then choose their own particular rules if they
need to deviate from this default.

Anyway, I don't feel so strongly about this -- I look forward to the
new wording in your next draft.

> >    An "internationalized domain name" (IDN) is a domain name
> > that may    contain one or more A-labels or U-labels, as
> > appropriate, instead of    LDH labels.
> >
> > "instead of LDH-labels" sounds like you're excluding
> > LDH-labels. How about "a domain name that contains one or more
> > A-labels, U-labels or LDH-labels"?
>
> That is one of the things that got us into trouble with IDNA,
> which attempted to define "IDNs" such that all-ASCII labels
> conforming to the hostname rules were simply a subset of the set
> of IDNs.  The new drafts make the three categories disjoint in
> the hope of eliminating the confusion so, yes, I'm excluding
> LDH-labels.  Suggestions about better ways to say this would be
> welcome.

At the moment, most IDNs are mixtures of A-labels and LDH-labels,
since there are very few A-label TLDs and those TLDs are restricted.
Also, the next sentence in the draft says "This implies that every
conventional domain name is an IDN (which implies that it is possible
for a name to be an IDN without it containing any non-ASCII
characters)."

How about "An IDN is a domain name with at least one A-label or at
least one U-label."?

> > This is referring to ALWAYS and MAYBE, but it seems like
> > Patrik was able to come up with a set of rules based on
> > existing Unicode properties that ended up placing some
> > characters in the ALWAYS category and some in MAYBE.
>
> Not really.  What he did was to enumerate subsets of certain
> scripts that go to ALWAYS now because we understand the issues
> well enough (or think we do).   That is not, as others have
> suggested, because of the limited script knowledge of the design
> team.  Instead, it is because we have been --often in
> conjunction with the relevant language communities-- able to
> identify specific open issues with the other scripts we have
> examined, in many cases with input from experts on the relevant
> writing systems.  Remembering that we started with a script and
> block-based approach, that list includes most other scripts used
> by major contemporary languages ("major" and "contemporary" is
> not a means of excluding anything, either short or long-term,
> but just laid the foundation for identifying sufficient cases
> from which we could apply induction.
>
> To give a few examples, we know that the subset of CJK
> characters that have been identified _by the language
> communities_ as appropriate for DNS labels as part of the
> table-specification work based on the JET model (see RFC 3743)
> are safe enough for "ALWAYS" because the language communities
> have told us so.  Other Han-derived characters go to MAYBE and
> presumably stay there until and unless the JET-based tables are
> expanded.   Once people really sit down and look at the
> application and naming issues, rather than thinking in terms of
> how things are coded, we expect controversy about whether
> position-dependent characters (e.g., final-form ones)match the
> corresponding base characters or not, a question whose answers
> may be script-dependent.  The answers that might be obvious
> based on standard lexographic, orthographic, or typographic
> conventions may not be helpful for DNS use given the history of
> people forming domain names by cramming words together without
> breaks or other punctuation.  Because of issues similar to
> these, until we can identify language communities who can decide
> and take responsibility for the decisions, either the whole
> script or the pairs of base characters and position-dependent
> ones be kept in MAYBE (and keeping the whole script there
> temporarily prevents a number of there complexities).
>
> As a final example (but by no means the only other one), Hangul
> has some very specific rules about character sequencing in
> string formation.  You will recall that there was some fairly
> extended conversation while IDNA2003 was being developed about
> whether the standard should enforce the rules.   The conclusion
> was "no", but that was, in considerable measure, because there
> was no mechanism for doing so.  Because the new model permits
> imposition of context-dependent restrictions, it would now be
> possible to write rules that would prohibit ill-formed Hangul
> strings in the protocol.  My personal guess is that would
> (still) be a bad idea and that such restrictions are best left
> to registry action.    But I'm not competent to make a decision
> on that subject and we are certainly not going to tell the user
> community for that particular writing system how their system
> should be used or restricted.   We need to hear from them and,
> until we do, the script should remain in "MAYBE".
>
> Again, better ways to explain this would be appreciated as would
> advice as to whether the examples above belong in the document.

That's a difficult decision. On the one hand, it's great to have
examples in the rationale document, but on the other hand, we don't
want any language community to seize upon any of the examples and try
to delay the table RFC until their characters have been moved out of
the MAYBE category. I would lean towards providing more examples.

> >    6.3.  The Ligature and Digraph Problem
> >
> > Maybe this should be called the "variant problem" or
> > something, since the Scandinavian o-diaeresis and o-slash
> > issue is neither a ligature nor digraph issue.
>
> Yes, although the notion of "combining character" tends to turn
> it into an issue almost identical to the digraph one, the
> terminology is still poor.  Unfortunately, in IDN parlance,
> "variant" has become associated with the set of issues and
> approaches associated with the JET work.  While one could use
> the JET approach at registration time to address the o-diaeresis
> and o-slash issues, that has more to do with a possible solution
> than a description of the problem.  Other suggestions for
> terminology would be appreciated.

How about "The Spelling Variation Problem"?

> >        *  Characters that are unassigned in the version of
> > Unicode being           used by the registry or application
> > are not permitted, even on           resolution (lookup).
> > This is because, unlike the conditions           contemplated
> > in IDNA2003 (except for right-to-left text), we           now
> > understand that tests involving the context of characters
> > (e.g., some characters being permitted only adjacent to other
> >           ones of specific types) and integrity tests on
> > complete labels           will be needed.  Unassigned code
> > points cannot be permitted           because one cannot
> > determine the contextual rules that           particular code
> > points will require before characters are           assigned
> > to them and the properties of those characters fully
> > understood.
> >
> > Also, NFC will produce different results, if an unassigned
> > codepoint becomes assigned and then has a combining class that
> > determines its placement relative to other combining marks.
> > The protocol draft only mentions NFC under resolution, not
> > under registration, by the way.
>
> Good points.  Text has been, or will be, adjusted.  I note,
> however, that there have been some "will never happen"
> discussions with the Unicode Consortium about assignment of new
> characters to codepoints that then normalize to other things.

The Unicode Consortium only appears to make guarantees about assigned
codepoints in the normalization spec:

http://www.unicode.org/reports/tr15/#Versioning

Will your new protocol draft add NFC to the registration section, or
remove NFC from the resolution section (or ...)?

Erik