IDNA comments

Mon Jul 7 17:05:30 CEST 2008

This document: http://docs.google.com/Doc?id=dfqr8rd5_77hqpj2rfc

Now that the Unicode and CLDR releases are out the door, I've had some time
to turn back to IDNAbis. I did another pass through, and here are my
comments. The documents continue to show progress with each revision, so
that's very promising. Some of the main issues I see that still thread
through all of these documents are:

   1. *Stability of Labels. *I believe quite strongly that once a domain
   name is valid, it should not be invalidated by any later version of IDNA.
   Now, while we cannot prevent a later RFC from doing that, we *can *prevent
   such invalidation by the normal process of updating tables under these RFCs
   for new versions, adding exceptions, and changing contextual rules.
   2. *Instability of Nonlabels. *I also think that making nonlabels stable
   should *not *be a goal. It can't really be achieved anyway, since the
   presence of an UNALLOWED character can make a label be invalid in version X
   yet valid in version Y (where that character is defined).
   3. *Management.* The process of adding backwards compatibility
   characters, context conditions, and exceptions needs to be much more
   definitive.

Documents*
*

*https://datatracker.ietf.org/drafts/?filename=idnabis *
*
*

   1. http://tools.ietf.org/html/draft-ietf-idnabis-bidi
   2. http://tools.ietf.org/html/draft-ietf-idnabis-tables
   3. http://tools.ietf.org/html/draft-ietf-idnabis-protocol<http://tools.ietf.org/html/draft-ietf-idnabis-protocol>
   4. http://tools.ietf.org/html/draft-ietf-idnabis-rationale

Comments on bidi-00

   1. Typos (minor) : document needs spell-checking.
      1. ...when it's embedded...
      2.  ...compatibiltiy...
   2. Odd phrasing: should be specification, or document, ...
      - "This memo doesn't propose..."
   3. The following is problematic. It appears that the most we could do is
   place conditions on a label* in the context of the other labels*. That
   is, this would require a test on the entire domain name, not just an issue
   with a single label.
   - "This specification forbids using leading European numbers in ASCII-
      only labels; this is in conflict with a large installed base of such
      labels. The harm resulting from violating this rule is seen when a
      label at the next level down in the hierarchy ends with a number
      (Arabic or European).
      4. Older Comments on bidi:
   http://www.alvestrand.no/pipermail/idna-update/2008-March/001263.html
      - *So far as I can tell, these have not yet been addressed in the
      text.*

Comments on tables-01

   1. The use of human-readable names in this version is a big plus, thanks!
   2. "Codepoints with this property value will never be permitted in IDNs."
   Aside from the stability issue, this is a promise that cannot be kept,
   since a future RFC could modify this for IDNs (as pointed out on the list).
   Replace by:
   "are not permitted", or something like "will never be permitted unless
   this document were obsoleted".
   3. "It should be suitable for newer revisions of Unicode, as long as the
   Unicode properties on which it is based remain stable."
   Replace by
   "This is suitable for any newer versions of Unicod as well. Changes in
   Unicode properties that do not affect the outcome of this process do not
   affect IDN. For example, a character can move from So to Sm, or from Lo to
   Lu, without affecting the table results. Moreover, even if such changes were
   to result, the BackwardCompatible list (2.2.3.) will be adjusted to
   ensure the stability of the results."
   4. ... on a two step procedure... => on a two-step procedure
   5. ... That a label consists only of codepoints... => However, that a
   label consists only of codepoints
   6. Section 2.1.3: there was a change in the definition of DICP in
   preparation for IDNA. See Derived Property: Default_Ignorable_Code_Point in
   http://www.unicode.org/Public/5.1.0/ucd/DerivedCoreProperties.txt for the
   text for the updated text.
   7. "In many cases aliases are used in the data in the Unicode Standard.
   This document uses both the alias and the spelled out terms (for example
   alias Ll for the General Category Lowercase_Letter)."
   Replace with:
   "Unicode property names and property value names may have short
   abbreviations, such as gc for the General_Category property, and Ll for the
   Lowercase_Letter property value of that property."
   8. Sort the following by value instead of code point, for clarity.
   Ideally each value would be in its own subsection: PVALID, CONTEXTO,...

      002D; CONTEXTO  # HYPHEN-MINUS
      ...
      3007; PVALID    # IDEOGRAPHIC NUMBER ZERO
      303B; CONTEXTO  # VERTICAL IDEOGRAPHIC ITERATION MARK
      30FB; CONTEXTO  # KATAKANA MIDDLE DOT

   9. "The characters 02B9, 0375 and 0483..." In Unicode we have the
   convention that characters are represented by the format "U+02B9 MODIFIER
   LETTER PRIME" in free-flowing text, that is, always including the name. I
   strongly recommend that practice be followed in all of these documents; it
   makes it far easier for someone to follow what is going on (since most
   people don't memorize these numbers ;-). You can use
   http://unicode.org/cldr/utility/character.jsp?a=02B9 to get the name, or
   just grep the main unicode property file.
   10. "This category includes the codepoints that property values in
   versions of Unicode after 5.0". The 5.0 value was changed to 5.1 in most
   cases, but not here. Search all the documents for 5.0 in case any others
   were missed.
   11. "As the requirement is that codepoints having either of these
   derived..." Missing reference. What requirement?
   12. "This category consists of codepoints in the Unicode character set
   that are not (yet) assigned. It should be noted that the set of unassigned
   characters is the larger set {Cn, Cs}."
   The last sentence needs clarification: suggest
   "It should be noted that Unicode distinguishes between 'unassigned code
   points' and 'unassigned characters'. The unassigned code points are all but
   (Cn - Noncharacters), while the unassigned *characters* are all but (Cn +
   Cs).
   13. "If needed, IANA should (with the help of an appointed expert)
   suggest updates of this RFC where BackwardCompatible (Section 2.2.3) is
   updated, a set that is at
   release of this document is empty."
   This isn't going to work. I suggest that the backwards compatible
   character list, the exceptions list, and the context rules all be in a
   single document published by IANA, and controlled by the group discussed in
   rationale. We then need to provide guidance and constraints on this group.
   This kind of process is not new: for example, BCP 47 has very stringent
   guidelines on how the IANA language-subtag-registry is to be changed. In
   this case, the text should read something like:
   "If as a result of property changes in a version of Unicode, any assigned
   character under the old version of Unicode would have a different value
   according to this document than in the new version, then the IANA IDNA
   committee must amend the BackwardsCompatible List to ensure that the value
   remains stable. This must be published by IANA immediately upon release of
   the new version of Unicode (such timing is easily feasible because of the
   long lead times for Unicode beta versions)."

Comments on protocol-01

   1. 4.3.2.4: the bidi constraints apply to more than just single labels.
   2. 4.4: "While exact policies are not specified as part of IDNA2008 and
   it is expected that different registries may specify different policies,
   there SHOULD be policies." This SHOULD is pointless, unless some
   constraints or guidance are given. Otherwise my policy could be "any valid
   IDNA label", which would be precisely the same as no policy at all.
   3. 5.2: "The local character set, character coding conventions, and, as
   necessary, display and presentation conventions, are converted to Unicode
   (without surrogates), paralleling the process described above in Section
   4.2."
   In the vast majority of cases in modern software, the local charset IS
   Unicode, so this may be confusing. Also, UTF-16 does and must use surrogate
   code units, so this needs to be more precise. And excluding surrogate code
   points isn't necessary since gc=Cs are forbidden anyway. Suggest:
   "The string is converted from the local character set into Unicode, if it
   is not already Unicode. The exact nature of this conversion is beyond the
   scope of this document, but may involve normalization, as described in
   Section 4.2."
   4. 5.4: "In general, that conversion and testing should be performed if
   the domain name will later be presented to the user in native character form
   (this requires that the lookup application be IDNA-aware)."
   Suppose that program X creates an A-Label from a U-Label, then sends that
   A-Label to program Y, which sends it to program Z, which sends it to program
   W, which displays it. It sounds like each of Y, Z, W need to validate. Is
   that the intent of this text? If it is only W that needs to validate, then
   it gets a bit murky in today's world, where the boundaries between
   cooperating processes and programs are very fuzzy.
   5. 5.5: "In parallel with the registration procedure...". The use of "in
   parallel with", here and elsewhere, sounds like they might be concurrent
   operations, which is not what is intended. I suggest other wording. Also,
   the steps in 5.5 are all the same as in 4.3 -- except for bidi. This fact
   should be very clear in the text.
   6. Regex table. See below for recommendations. Note that what we did in
   BCP 47 is have a separate document specifying the initial contents of the
   registry

Older Comments on protocol -- haven't reviewed these yet:
http://www.alvestrand.no/pipermail/idna-update/2008-March/001171.html

Comments on rationale-00

   1. Rationale is currently a mixture of background information, plus text
   required for conformance to the protocol, plus rationale for why to change,
   and meta discussions like "In these cases, we should avoid trying to tell
   implementers what...". It should focus solely on the rationale, and all the
   normative text should be moved into the protocol document.
   2. "IDNA uses the Unicode character repertoire, which avoids the
   significant delays that would be inherent in waiting for a different and
   specific character sets to be defined for IDN purposes, presumably by some
   other standards developing organization. " This is a very strange rationale,
   like saying: "IDNA uses the English language, which avoids the significant
   delays that would be inherent in waiting for a different and specific
   language to be defined for IDN purposes, presumably by some other standards
   developing organization." Delete the "which avoids... organization."
   3. "to reduce the opportunities for attacks on the encoding system." =>
   "to reduce the opportunities for attacks via the encoding system."
   4. "9. Make bidirectional domain names in a paragraph display in a
   non-surprising fashion." This is just a special case of the previous item,
   so delete.
   5. "11. Make some currently-valid labels that are not actually IDNA
   labels invalid." Why do we care that labels invalid under IDNA2003 are also
   invalid under IDNA2008? Why wouldn't they be? Perhaps an example would help
   to clarify this.
   6. "are needed to make reasonable use of some scripts but become
   invisible characters in others." => "are needed to make reasonable use of
   some scripts but have no visible affect in others." These characters are
   always invisible; they cause the surrounding characters to change form.
   7. "that are safe for use only in conjunction". Since you never say why
   they are unsafe, this needs clarification. Do you mean this because of
   visible confusability?
   8. "If a character is classified as "DISALLOWED" in error and the error
   is sufficiently problematic, the only recourse would be either to introduce
   a new code point into Unicode and classify it as "PROTOCOL-VALID...". Unless
   you have some evidence to think that this is a real possibility (I don't),
   it should be removed.
   9. "[anchor25: Note in Draft: the last sentence above basically
   duplicates a comment in Security Considerations. Is it worth having
   in both places??" For my part, yes.
   10. "Applications MAY allow the display and user input of A-labels, but
   are encouraged to not do so except as an interface for special purposes,
   possibly for debugging, or to cope with display limitations." There is
   widespread use of the A-Label to signal a possible spoof -- while you
   discuss that later, I think it's swimming against the tide not to mention it
   here.
   11. "the two-character sequence "ae" is usually treated as a fully
   acceptable alternate orthography." Add: "for the "umlauted a" character".
   12. "They may occur in running text or be processed by one system after
   being provided in another. They may wish to try to normalize..." The two
   "they"s have different subjects; reword.
   13. "use characters that cannot be represented directly in domain names
   but for which interpretations are provided." What is meant by this, and how
   is it different in IDNA2008? In both IDNA2003 and 2008 they are illegal.
   14. " If a domain name appears in an arbitrary context (such as running
   text), one may be faced with the requirement to know that a string is a
   domain name in order to adjust for the different forms of dots but also to
   have traditional dots to recognize that a string is a domain name -- an
   obvious contradiction." Not a contradiction, remove. Example, if one
   recognizes full-width dot in detecting URLs, then one can clearly use them
   in parsing within labels.
   15. "None of those local decisions
   are a threat to interoperability as long as (i) only U-labels and
   A-labels are used in interchange with systems outside the local
   environment,...".
   Doesn't really follow that there are no problems. The obvious example of
   interoperability problems are where a Turkish friend has a URL that works in
   his browser, copies the text in an email and sends to me." When I click on
   it, it either 404's or **much worse**, goes to a different website.
   16. "The fact that a word exists is
         not proof that it should be usable in a DNS label and DNS labels
         are not expected to be usable for multiple-word phrases (although
         they are certainly not prohibited if the conventions and
         orthography of a particular language cause that to be possible)."
   Add, to show that we not playing favorites, "Even the very common words in
   English like "can't, and "don't" are not allowed.
   17. "Most Unicode names for letters are, in most cases, fairly
         intuitive, unambiguous and recognizable to users of the relevant
         script....and there are far more squares of various flavors in
         Unicode than there are hearts or stars." This just needs to be
   removed; the argumentation is faulty. For the same pronunciation, Chinese
   has hundreds of possible characters. If you want another reason (and someone
   to point a finger at), you could say: "The Unicode Standard recommends that
   these types of identifiers not contain symbols [UAX31].

Older Comments on issues (rationale) -- haven't reviewed these yet:
http://www.alvestrand.no/pipermail/idna-update/2008-March/001295.html
Regex format

I suggest that the table be formatted for clarity to not depend on
whitespace -- using names for each field -- and be broken into a list of
condition/result pairs.

Code point: 200C
Name:       ZERO WIDTH NON-JOINER
Lookup:     True

# Allow ZWNJ for breaking cursive connection, as needed in Farsi.
Before:     [[:Joining_Type=Dual_Joining:][:Joining_Type=Left_Joining:]]
[:Joining_Type=Transparent:]*
After:      [:Joining_Type=Transparent:]*
[[:Joining_Type=Dual_Joining:][:Joining_Type=Right_Joining:]]
Value:      PVALID

# Allow ZWNJ after letter-virama in same script, starting with
Devanagari, Gurmukhi,...
Before:     [[:gc=Letter:]&[:Script=Deva:]] [[:ccc=Virama:]&[:Script=Deva:]]
Value:      PVALID
Before:     [[:gc=Letter:]&[:Script=Guru:]] [[:ccc=Virama:]&[:Script=Guru:]]
Value:      PVALID

Code point: 200D
Name:       ZERO WIDTH JOINER
...

Code point: 00B7
...

The reason for breaking the condition into two parts is then it is very
clear exactly what we are testing.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20080707/78794499/attachment-0001.htm