My comments on IDNA Definitions (-10.txt)
"Martin J. Dürst"
duerst at it.aoyama.ac.jp
Mon Aug 24 12:17:53 CEST 2009
[I'm not at all happy about a two-weeks only WG last call in the very
middle of vacation season!]
These are my comments on IDNA Definitions (-10.txt). Most are editorial
(but important to improve the readability and usability of the
document), but there is one very important technical point.
In general, there is too much talking around and about; the document
would be much easier to read if it used simpler and more direct language
and shorter sentences.
- Use only one name for talking about the document collection. Currently:
- 'collection' (Abstract, 1.1)
- 'series' (1.1)
- 'set' (1.3)
- 'and the associated ones' (1.1.1)
- 'these documents' (2.1; very unclear when reading whether that
phrase indeed refers to the document collection or to Unicode documents
or what, similar again in 2.2)
This variability is confusing.
1.1.1 Audiences
"what names are permitted in DNS zone files," -> "what names are
permitted in DNS zones," (whether these are files or whatever is
implementation-dependent.
This section is very important, and would be much more effective with
less circumscription. Just use straightforward terms people/functions
that everybody else names directly, such as 'registries', 'registrars',
'administrators creating subdomains', and so on, and then say that this
list isn't exclusive. That will have the additional benefit of bringing
the document up in more of the relevant searches.
The second paragraph is also overly circumscriptive. Using "the one
containing explanatory material" to refer to Rationale is a strong
disservice to every reader, even if strictly speaking may be preferable
to a forward reference. Please use [] style references, or labels such
as "Rationale" with a short sentence pointing to 1.3, or move the "who
should read what" info to 1.3 with a general pointer from 1.1.1. But
please stop talking around stuff that can be easily expressed more
directly (this general comment applies in many other places, too).
1.1.1 should be 1.2, and 1.1.2 should be 1.3, and 1.3 should be 1.4, to
simplify structure.
2.1: Say that 0x means hexadecimal (first para)
2.3.1, title: This looks as if this section defines one term,
"LDH-Label". Change the title to something more general, such as
"Definitions for ASCII-only Labels".
2.3.1, general (but most urgently 3rd para): Make sure that the terms
defined stick out, at least the same way as in 2.1 (one para per def,
defined word is first word of para). Move clear and simple definition to
front, and rationale, relationships,... to the end of the paragaraph.
2.3.1: Move normative text to Protocol ("those labels MUST NOT be
processed as ordinary LDH-labels by IDNA-conforming programs and SHOULD
NOT be mixed with IDNA-labels in the same zone")
2.3.1, 3rd para: "but which otherwise conform to LDH-label rules" ->
"but otherwise conform to LDH-label rules"
2.3.1, 3rd para: "case-independent" -> "case-insensitive"
2.3.1, 3rd para: "divided in" -> "divided into"
2.3.1: "for future extensions that use extensions based on the same
"prefix and encoding" model"": a) 'extensions' is repeated; b) the IETF
is great at not talking about future eventualities and describing
general models that never may be used. In this and other sections, such
stuff should also be cut out.
2.3.1, anchor10: I do not understand why we need the (1)..(4) notes.
Either the definitions are clear enough, or they should be fixed.
Something like "NON-RESERVED LDH LABELS (NR-LDH-labels) NR-LDH LABELS"
is also total overkill. The only thing that's necessary is "NR-LDH
labels", with exactly the same capitalization and hyphenation as in the
definition.
2.3.1, Fig. 2: I'm somewhat confused here. Note (5) seems to suggest
that U-labels have a fixed binary encoding (e.g. UTF-8) and are used
directly in the DNS. Otherwise, the note doesn't make sense.
2.3.2.1, "While that constraint may be tested in any of several ways, an
A-label must be capable of being produced by conversion from a U-label
and a U-label must be capable of being produced by conversion from an
A-label.": This puts the chart before the horse. Change to "An A-label
must be capable of being produced by conversion from a U-label and a
U-label must be capable of being produced by conversion from an A-label.
There are several ways in which this constraint may be tested."
2.3.2.1, "Among other things, this implies that both U-labels and
A-labels must be strings in Unicode NFC [Unicode-UAX15] normalized
form.": A-labels are by definition in NFC, because they are ASCII-only.
If you want to say that they must *represent* labels that are in NFC,
that would be fine, but I think mentioning NFC here isn't really necessary.
MAJOR!!!!!
2.3.2.1 says: "Any rules or conventions that apply to DNS labels in
general, such as rules about lengths of strings, apply to whichever of
the U-label or A-label would be more restrictive. For the U-label,
constraints imposed by existing protocols and their presentation forms
make the length restriction apply to the length in octets of the UTF-8
form of those labels (which will always be greater than or equal to the
length in code points)."
Now this is TOTALLY NEW to me. There sure is a restriction to 63 octets
in the DNS itself, but because U-labels don't enter the DNS as such
(neither as UTF-8 nor as UTF-16 or whatever), an arbitrary UTF-8-based
length restriction seems totally unjustified. I'm not at all aware of
such a restriction in IDNA2003.
Indeed, punycode was explicitly designed, among else, to perform well
for scripts with few characters. For small scripts that need 3 bytes per
character in UTF-8 (all Indic scripts, Georgian, Sinhala, Thai, Lao,
Tibetan, Myanmar, Ethiopic, Cherokee, Unified Canadian Aboriginal
Syllabics, Khmer,..., this restriction would mean a drastic reduction of
the number of characters usable in a label. To give an example, when at
W3C, I created some IRI tests (http://www.w3.org/2001/08/iri-test/).
The tests use Hiragana
(http://www.ほんとうにながいわけのわからないどめいんめいのらべるまだなが
くしないとたりない.w3.mag.keio.ac.jp and http://ほんとうにながいわけのわ
からないどめいんめいのらべるまだながくしないとたりない.ほんとうにながい
わけのわからないどめいんめいのらべるまだながくしないとたりない.ほんとう
にながいわけのわからないどめいんめいのらべるまだながくしないとたりな
い.w3.mag.keio.ac.jp), which is atypical in that Hiragana-only Japanese
is rarely used except in children's books, but it is typical in that
punycode is able to represent 41 Hiragana (123 octets in UTF-8) in 58
octets. Hiragana overall contains about 80 letters in a single block;
punycode efficiency will vary with the size of the script (more
efficient for smaller scripts, less efficient for larger scripts) as
well as of course with every individual label.
Currently, all (on Windows) of IE7, Mozilla Firefox, Safari, and Opera
pass both length tests (single label and multiple labels). It would be
very counterproductive if IDNA2008 required further artificial
restrictions which essentially disfavor languages and cultures that
haven't been lucky to get short encodings for their scripts in UTF-8.
(I'd be fine if the Security section warns about the potential of some
protocols or implementations not having appropriate space, but that's on
a completely different level.)
2.3.2.2: NR-LDH-label and Internationalized Label: The section doesn't
say anything about "Internationalized Label"s, although this term
appears in the title. (the definition is in 2.3.2.3)
2.3.2.3: SVR record labels are not Internationalized labels, and
therefore domain names used for SVR aren't IDNs. That's fine by me, but
it should nevertheless be made clear (here or elsewhere) that IDNs can
be used with SVR,... (this seems to be done at the end of 2.3.2.6, so
this should be okay)
2.3.2.4: This seems to say that there is no equivalence between an
all-lowercase A-label and an otherwise equal label where some letters
(maybe accidentally) have been upper-cased. I think the cause of the
problem is (as often in this document) the lack of consistent language.
Instead of "and then testing for an exact match between the A-labels",
say "and then testing for equivalence between the A-labels [using normal
DNS matching rules]". If that's not what's intended, then some more
background may be appropriate.
2.3.2.5: "a string of ASCII characters" -> "the string of ASCII characters"
2.3.3: "Because IDN labels may contain characters that are read, and
preferentially displayed, from right to left,": Remove 'preferentially'.
This maybe refers to some hopelessly broken systems, or to the fact that
Arabic Braille is LTR, or something else, but is totally irrelevant and
potentially misleading in this context.
2.3.3: Why doesn't this paragraph just refer to 'logical'
representation, a term that people who know bidi are familiar with and
that's widely used in Unicode.
2.3.4: "There has been some confusion about whether a "Punycode string"
does or does not include the ACE prefix and about whether it is required
that such strings could have been the output of the ToASCII operation":
a) The combination of 'required' and 'could' doesn't make ANY sense for me.
b) Is is unclear what "such strings" refers to (with ACE prefix? without
ACE prefix?)
2.3.4: "much more clear" -> "much clearer"
4: There should be a very short paragraph saying that this section
provides an overview and pointers into the security sections of the
other documents. (or whatever else exactly the relationships are)
4.1: "In addition to characters that are permitted by IDNA2003 and its
mapping conventions": Does this mean "In addition to characters that are
permitted by (IDNA2003 and its mapping conventions)" or "In addition to
characters that are permitted by IDNA2003 and [in addition to] its
mapping conventions"? Please clarify.
4.1: "problems that might raise" -> "problems that might araise"
4.2: "these specifications"? The IDNA2008 collection of specifications?
Or the specifications for the local character sets?
4.2: "(or different versions of one application)" -> "(or different
versions or parts of one application)" (yes, this can and does happen)
4.4: "comparisons be done properly, as specified in the Requirements
section of [IDNA2008-Protocol]": If comparisons are dealt with in
Procotol, what's the purpose of 2.3.2.4? And what's the purpose of
trying to explain it all again just after the quoted sentence?
4.5: "Despite that prohibition, there are a significant number of files
and databases on the Internet in which domain name strings appear in
native-character form;": This makes it appear as if such files and
databases are in violation of some spec. But they may simply contain
IRIs instead of URIs. I would simply start the subsection with something
like "As long as IDNA2003 labels have been kept in A-label form, the
only differences in interpretation arise for characters whose ..." and
then, in a new paragraph, continue "For IDNA2003 labels that have been
kept in native encoding,..."
Regards, Martin.
--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:duerst at it.aoyama.ac.jp
More information about the Idna-update
mailing list