Comments on IDNAbis issues-05

Mark Davis mark.davis at icu-project.org
Fri Dec 14 04:48:37 CET 2007


http://www.ietf.org/internet-drafts/draft-klensin-idnabis-issues-05.txt
Overview.Many nice improvements to the text.

*This list does not list what was already commented on for **
draft-klensin-idnabis-protocol-02.txt*<http://www.ietf.org/internet-drafts/draft-klensin-idnabis-protocol-02.txt>
* and **draft-faltstrom-idnabis-tables-03.txt*<http://www.ietf.org/internet-drafts/draft-faltstrom-idnabis-tables-03.txt>

Issues-1. IDNAbis has a major backwards compatibility issue with IDNA2003:
thousands of characters are excluded that used to be valid. What reason
might people have to believe that despite the terms NEVER and ALWAYS that
some future version, IDNAbis-bis, might not also do the same?

Issues-2. IDNA provided for backwards compatibility, by disallowing
Unassigned characters in registration, but allowing them in lookup. That let
old clients work despite new software. While once we update to U5.1 that is
not as much of a problem, it should be made clear why this change is made.

Issues-3. In general, whenever a statement is made about some class of
characters causing a problem, at least one clear example should be provided,
as in draft-alvestrand-idna-bidi-01.txt<http://www.ietf.org/internet-drafts/draft-alvestrand-idna-bidi-01.txt>

Issues-4. I would strongly suggest separating all of the "why did we do
this" and "how is it different from IDNA2003" into a separate document. It
will be of only historical interest after this becomes final, and will then
only clutter the document.

Details.
Issues-5.

   IDNA uses the Unicode character repertoire, which avoids the
   significant delays that would be inherent in waiting for a different
   and specific character set be defined for IDN purposes, presumably by
   some other standards developing organization.

Seems odd. There are no other contenders in the wings. Would be better, if
this has to be said, to just cite other IETF documents describing the
reasons for using Unicode.

Issues-6.

   To improve clarity, this document introduces three new terms.  A
   string is "IDNA-valid" if it meets all of the requirements of this
   specification for an IDNA label.  It may be either an "A-label" or a
   "U-label", and it is expected that specific reference will be made to
   the form appropriate to any context in which the distinction is
   important.
...
   A "U-label" is an IDNA-valid string of
   Unicode-coded characters that is a valid output of performing
   ToUnicode on an A-label, again regardless of how the label is
   actually produced.

These definitions appear circular, so they need to be teased out a bit.


Issues-7.

   Depending on the system involved, the major difficulty may not lie in
   the mapping but in accurately identifying the incoming character set
   and then applying the correct conversion routine.  It may be
   especially difficult when the character coding system in local use is
   based on conceptually different assumptions than those used by
   Unicode about, e.g., how different presentation or combining forms
   are handled.  Those differences may not easily yield unambiguous
   conversions or interpretations even if each coding system is
   internally consistent and adequate to represent the local language
   and script.

I suggest the following rewrite:

The main difficulty typically is that of  accurately identifying the
incoming character set so as to apply the correct conversion routine.
Theoretically, conversion could be difficult if the non-Unicode character
encoding system were based on conceptually different assumptions than those
used by Unicode about, e.g., how different presentation or combining forms
are handled. Some examples are the so-called "font-encodings" used on some
Indian websites. However, in modern software, such character sets are rarely
used except for specialized display.


Issues-8.

   That, in turn, indicates that the script community
   relevant to that character, reflecting appropriate authorities for
   all of the known languages that use that script, has agreed that the
   script and its components are sufficiently well understood.  This
   subsection discusses characters, rather than scripts, because it is
   explicitly understood that a script community may decide to include
   some characters of the script and not others.

   Because of this condition, which requires evaluation by individual
   script communities of the characters suitable for use in IDNs (not
   just, e.g., the general stability of the scripts in which those
   characters are embedded) it is not feasible to define the boundary
   point between this category and the next one by general properties of
   the characters, such as the Unicode property lists.

There is no justification given for this process. Moreover, it will be
doomed to failure. Merely the identification of "script communities" is an
impossible task. Who speaks for the Arabic script world? Saudi Arabia
(Arabic)? Iran (Persian,...)? Pakistan (Urdu,...)?, China (Uighur,...)?

Issues-9.

      it is removed from Unicode.

(multiple instances)

This is not necessary; characters aren't removed from Unicode. If you really
have to have it, then add "(however, the Unicode stability policies
expressly forbid this)"

Issues-10.

   Applications are expected to not treat "ALWAYS" and "MAYBE"
   differently with regard to name resolution ("lookup").  They may
   choose to provide warnings to users when labels or fully-qualified
   names containing characters in the "MAYBE" categories are to be

In practice, expecting applications to treat these differently is wishful
thinking; especially if it seems Eurocentric to users (see other notes on
MAYBE). In practice, registries always have the ability to filter characters
out. See above on removing Maybe.


Issues-11.

   5.1.3.  CONTEXTUAL RULE REQUIRED

I know what the point is supposed to be (and don't disagree), but this
section was very hard to make out.


Issues-12.

   Characters that are placed in the "NEVER" category are never removed
   from it or reclassified.  If a character is classified as "NEVER" in
   error and the error is sufficiently problematic, the only recourse is
   to introduce a new code point into Unicode and classify it as "MAYBE"
   or "ALWAYS" as appropriate.

The odds of this happening are extremely low. Anything in Never has to be
extremely certain.

Issues-13.

   Instead, we need to have a variety of approaches that, together,
   constitute multiple lines of defense.

Defense against what? Without examples, it is hard to say what the problems
are.

Issues-14.

   Applications MAY
   allow the display and user input of A-labels, but are not encouraged
   to do so except as an interface for special purposes, possibly for
   debugging, or to cope with display limitations.  A-labels are opaque
   and ugly, and, where possible, should thus only be exposed to users
   who absolutely need them.  Because IDN labels can be rendered either
   as the A-labels or U-labels, the application may reasonably have an
   option for the user to select the preferred method of display; if it
   does, rendering the U-label should normally be the default.

Add:


 It is, however, now common practice to display a suspect U-Label (such as a
mixture of Latin and Cyrillic) as an A-Label.


Issues-15.

   6.3.  The Ligature and Digraph Problem

   There are a number of languages written with alphabetic scripts in
   which single phonemes are written using two characters, termed a

   "digraph", for example, the "ph" in "pharmacy" and "telephone".

The text has been improved considerably from earlier versions, but the whole
issue is just a special case of the fact that words are spelled different
ways in different languages or language variants. And it has really nothing
to do with ligatures and diagraphs. The same issue is exhibited between
theatre.com and theater.com as between a Norwegian URL with ae and a Swedish
one with a-umlaut.

So if you retain this section, it should be recast as something like


6.3 Linguistic Expectations

 Users often have certain expectations based on their language. A Norwegian
user might expect a label with the ae-ligature to be treated as the same
label using the Swedish spelling with a-umlaut. A user in German might
expect a label with a u-umlaut and the same label with "ae" to resolve the
same. For that matter, an English user might expect "theater.com" and "
theatre.com" to resolve the same. [more in that vein].


Issues-16.

 there is no evidence that
they are important enough to Internet operations or
internationalization to justify large numbers of special cases
and character-specific handling (additional discussion and

I suggest the following wording instead:

 there is no evidence that
 they are important enough to Internet operations or
 internationalization to justify inclusion (additional discussion and

It doesn't actually involve "large numbers of special cases", there are a
rather small percentage of demonstrable problems in the symbol/punctuation
area. What we could say is that there is general consensus that removing all
but letters, digits, numbers, and marks (with some exceptions) causes little
damage in terms of backwards compatibility, and does remove some problematic
characters like fraction slash.


Issues-17.

   For example, an essential
   element of the ASCII case-mapping functions is that
   uppercase(character) must be equal to
   uppercase(lowercase(character)).

Remove or rephrase. It is a characteristic, but not an essential one. In
fact, case mappings of strings are lossy; once you lowercase "McGowan", you
can't recover the original.


Issues-18.

   o  Unicode names for letters are fairly intuitive, recognizable to
      uses of the relevant script, and unambiguous.  Symbol names are
      more problematic because there may be no general agreement on
      whether a particular glyph matches a symbol, there are no uniform
      conventions for naming, variations such as outline, solid, and

Actually, the formal Unicode names are often far from intuitive to users of
the relevant script. That's because the constraints of using ASCII for the
name, to line up with ISO standards for character encodings.


This section is not really needed. The use of I<heart>NY.com is not really
problematic; the main justification for removing it is that we don't think
it is needed (and has not been used much since IDNA was introduced). Better
to just stick with that.


Issues-19

   11.  IANA Considerations

   11.1.  IDNA Permitted Character Registry

   The distinction between "MAYBE" code points and those classified into
   "ALWAYS" and "NEVER" (see Section 5) requires a registry of
   characters and scripts and their categories.  IANA is requested to


Expecting an IANA registry to maintain this is setting it up for failure. If
this were to be done, precise and lengthy guidance as to the criteria for
removing characters (moving to NEVER) would have to be supplied, because of
the irrevocable nature of this step. The odds of a registry being able to
perform this correctly are very small.


The best alternative would be to simply have all the non-historic scripts
have the same status in
*draft-faltstrom-idnabis-tables-03.txt*<http://www.ietf.org/internet-drafts/draft-faltstrom-idnabis-tables-03.txt>,
by moving the non-historic scripts to the same status as Latin, Greek, and
Cyrillic.

The second best would be to have the Unicode consortium make the
determinations (and take the heat for objections).


Issues-20

   Some specific suggestion
   about identification and handling of confusable characters appear in
   a Unicode Consortium publication [???]

Use: [UTR36]
      UTR #36: *Unicode Security Considerations*
      http://www.unicode.org/reports/tr36/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20071213/e25d937d/attachment-0001.html


More information about the Idna-update mailing list