My comments on IDNA Definitions (-10.txt)

Mon Aug 24 12:17:53 CEST 2009

[I'm not at all happy about a two-weeks only WG last call in the very 
middle of vacation season!]

These are my comments on IDNA Definitions (-10.txt). Most are editorial 
(but important to improve the readability and usability of the 
document), but there is one very important technical point.

In general, there is too much talking around and about; the document 
would be much easier to read if it used simpler and more direct language 
and shorter sentences.

- Use only one name for talking about the document collection. Currently:
   - 'collection' (Abstract, 1.1)
   - 'series' (1.1)
   - 'set' (1.3)
   - 'and the associated ones' (1.1.1)
   - 'these documents' (2.1; very unclear when reading whether that 
phrase indeed refers to the document collection or to Unicode documents 
or what, similar again in 2.2)
This variability is confusing.

1.1.1 Audiences
"what names are permitted in DNS zone files," -> "what names are 
permitted in DNS zones," (whether these are files or whatever is 
implementation-dependent.

This section is very important, and would be much more effective with 
less circumscription. Just use straightforward terms people/functions 
that everybody else names directly, such as 'registries', 'registrars', 
'administrators creating subdomains', and so on, and then say that this 
list isn't exclusive. That will have the additional benefit of bringing 
the document up in more of the relevant searches.

The second paragraph is also overly circumscriptive. Using "the one 
containing explanatory material" to refer to Rationale is a strong 
disservice to every reader, even if strictly speaking may be preferable 
to a forward reference. Please use [] style references, or labels such 
as "Rationale" with a short sentence pointing to 1.3, or move the "who 
should read what" info to 1.3 with a general pointer from 1.1.1. But 
please stop talking around stuff that can be easily expressed more 
directly (this general comment applies in many other places, too).

1.1.1 should be 1.2, and 1.1.2 should be 1.3, and 1.3 should be 1.4, to 
simplify structure.

2.1: Say that 0x means hexadecimal (first para)

2.3.1, title: This looks as if this section defines one term, 
"LDH-Label". Change the title to something more general, such as 
"Definitions for ASCII-only Labels".

2.3.1, general (but most urgently 3rd para): Make sure that the terms 
defined stick out, at least the same way as in 2.1 (one para per def, 
defined word is first word of para). Move clear and simple definition to 
front, and rationale, relationships,... to the end of the paragaraph.

2.3.1: Move normative text to Protocol ("those labels MUST NOT be 
processed as ordinary LDH-labels by IDNA-conforming programs and SHOULD 
NOT be mixed with IDNA-labels in the same zone")

2.3.1, 3rd para: "but which otherwise conform to LDH-label rules" -> 
"but otherwise conform to LDH-label rules"

2.3.1, 3rd para: "case-independent" -> "case-insensitive"

2.3.1, 3rd para: "divided in" -> "divided into"

2.3.1: "for future extensions that use extensions based on the same 
"prefix and encoding" model"": a) 'extensions' is repeated; b) the IETF 
is great at not talking about future eventualities and describing 
general models that never may be used. In this and other sections, such 
stuff should also be cut out.

2.3.1, anchor10: I do not understand why we need the (1)..(4) notes. 
Either the definitions are clear enough, or they should be fixed. 
Something like "NON-RESERVED LDH LABELS (NR-LDH-labels) NR-LDH LABELS" 
is also total overkill. The only thing that's necessary is "NR-LDH 
labels", with exactly the same capitalization and hyphenation as in the 
definition.

2.3.1, Fig. 2: I'm somewhat confused here. Note (5) seems to suggest 
that U-labels have a fixed binary encoding (e.g. UTF-8) and are used 
directly in the DNS. Otherwise, the note doesn't make sense.

2.3.2.1, "While that constraint may be tested in any of several ways, an 
A-label must be capable of being produced by conversion from a U-label 
and a U-label must be capable of being produced by conversion from an 
A-label.": This puts the chart before the horse. Change to "An A-label 
must be capable of being produced by conversion from a U-label and a 
U-label must be capable of being produced by conversion from an A-label. 
There are several ways in which this constraint may be tested."

2.3.2.1, "Among other things, this implies that both U-labels and 
A-labels must be strings in Unicode NFC [Unicode-UAX15] normalized 
form.": A-labels are by definition in NFC, because they are ASCII-only. 
If you want to say that they must *represent* labels that are in NFC, 
that would be fine, but I think mentioning NFC here isn't really necessary.

MAJOR!!!!!
2.3.2.1 says: "Any rules or conventions that apply to DNS labels in 
general, such as rules about lengths of strings, apply to whichever of 
the U-label or A-label would be more restrictive.  For the U-label, 
constraints imposed by existing protocols and their presentation forms 
make the length restriction apply to the length in octets of the UTF-8 
form of those labels (which will always be greater than or equal to the 
length in code points)."
Now this is TOTALLY NEW to me. There sure is a restriction to 63 octets 
in the DNS itself, but because U-labels don't enter the DNS as such 
(neither as UTF-8 nor as UTF-16 or whatever), an arbitrary UTF-8-based 
length restriction seems totally unjustified. I'm not at all aware of 
such a restriction in IDNA2003.
Indeed, punycode was explicitly designed, among else, to perform well 
for scripts with few characters. For small scripts that need 3 bytes per 
character in UTF-8 (all Indic scripts, Georgian, Sinhala, Thai, Lao, 
Tibetan, Myanmar, Ethiopic, Cherokee, Unified Canadian Aboriginal 
Syllabics, Khmer,..., this restriction would mean a drastic reduction of 
the number of characters usable in a label. To give an example, when at 
W3C, I created some IRI tests (http://www.w3.org/2001/08/iri-test/).
The tests use Hiragana
(http://www.ほんとうにながいわけのわからないどめいんめいのらべるまだなが 
くしないとたりない.w3.mag.keio.ac.jp and http://ほんとうにながいわけのわ 
からないどめいんめいのらべるまだながくしないとたりない.ほんとうにながい 
わけのわからないどめいんめいのらべるまだながくしないとたりない.ほんとう 
にながいわけのわからないどめいんめいのらべるまだながくしないとたりな 
い.w3.mag.keio.ac.jp), which is atypical in that Hiragana-only Japanese 
is rarely used except in children's books, but it is typical in that 
punycode is able to represent 41 Hiragana (123 octets in UTF-8) in 58 
octets. Hiragana overall contains about 80 letters in a single block; 
punycode efficiency will vary with the size of the script (more 
efficient for smaller scripts, less efficient for larger scripts) as 
well as of course with every individual label.
Currently, all (on Windows) of IE7, Mozilla Firefox, Safari, and Opera 
pass both length tests (single label and multiple labels). It would be 
very counterproductive if IDNA2008 required further artificial 
restrictions which essentially disfavor languages and cultures that 
haven't been lucky to get short encodings for their scripts in UTF-8.
(I'd be fine if the Security section warns about the potential of some 
protocols or implementations not having appropriate space, but that's on 
a completely different level.)

2.3.2.2: NR-LDH-label and Internationalized Label: The section doesn't 
say anything about "Internationalized Label"s, although this term 
appears in the title. (the definition is in 2.3.2.3)

2.3.2.3: SVR record labels are not Internationalized labels, and 
therefore domain names used for SVR aren't IDNs. That's fine by me, but 
it should nevertheless be made clear (here or elsewhere) that IDNs can 
be used with SVR,... (this seems to be done at the end of 2.3.2.6, so 
this should be okay)

2.3.2.4: This seems to say that there is no equivalence between an 
all-lowercase A-label and an otherwise equal label where some letters 
(maybe accidentally) have been upper-cased. I think the cause of the 
problem is (as often in this document) the lack of consistent language. 
Instead of "and then testing for an exact match between the A-labels", 
say "and then testing for equivalence between the A-labels [using normal 
DNS matching rules]". If that's not what's intended, then some more 
background may be appropriate.

2.3.2.5: "a string of ASCII characters" -> "the string of ASCII characters"

2.3.3: "Because IDN labels may contain characters that are read, and 
preferentially displayed, from right to left,": Remove 'preferentially'. 
This maybe refers to some hopelessly broken systems, or to the fact that 
Arabic Braille is LTR, or something else, but is totally irrelevant and 
potentially misleading in this context.

2.3.3: Why doesn't this paragraph just refer to 'logical' 
representation, a term that people who know bidi are familiar with and 
that's widely used in Unicode.

2.3.4: "There has been some confusion about whether a "Punycode string" 
does or does not include the ACE prefix and about whether it is required 
that such strings could have been the output of the ToASCII operation":
a) The combination of 'required' and 'could' doesn't make ANY sense for me.
b) Is is unclear what "such strings" refers to (with ACE prefix? without 
ACE prefix?)

2.3.4: "much more clear" -> "much clearer"

4: There should be a very short paragraph saying that this section 
provides an overview and pointers into the security sections of the 
other documents. (or whatever else exactly the relationships are)

4.1: "In addition to characters that are permitted by IDNA2003 and its 
mapping conventions": Does this mean "In addition to characters that are 
permitted by (IDNA2003 and its mapping conventions)" or "In addition to 
characters that are permitted by IDNA2003 and [in addition to] its 
mapping conventions"? Please clarify.

4.1: "problems that might raise" -> "problems that might araise"

4.2: "these specifications"? The IDNA2008 collection of specifications? 
Or the specifications for the local character sets?

4.2: "(or different versions of one application)" -> "(or different 
versions or parts of one application)" (yes, this can and does happen)

4.4: "comparisons be done properly, as specified in the Requirements 
section of [IDNA2008-Protocol]": If comparisons are dealt with in 
Procotol, what's the purpose of 2.3.2.4? And what's the purpose of 
trying to explain it all again just after the quoted sentence?

4.5: "Despite that prohibition, there are a significant number of files 
and databases on the Internet in which domain name strings appear in 
native-character form;": This makes it appear as if such files and 
databases are in violation of some spec. But they may simply contain 
IRIs instead of URIs. I would simply start the subsection with something 
like "As long as IDNA2003 labels have been kept in A-label form, the 
only differences in interpretation arise for characters whose ..." and 
then, in a new paragraph, continue "For IDNA2003 labels that have been 
kept in native encoding,..."

Regards,   Martin.

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst at it.aoyama.ac.jp