IDNAbis Goals

Mon Nov 27 18:19:17 CET 2006

In order to assess the advantages and disadvantages of any approach, we need
to have a good idea of the goals and the weights attached to them. Here is
an initial take on some of the issues so far discussed, divided into
categories.

A. Loosen some restrictions on IDNA. The goal is to allow, *where feasible*,
the same kind of expressive capability in other languages that is now
provided for in English. It should be recognized that not all reasonable
words of every language will qualify: even in English the lack of spaces and
other punctuation forces compromises: words like "can't" are disallowed.

Here is what I've heard so far:

   1. Allow Unicode 5.0 characters
   2. Provide for some mechanism for more quickly updating to successive
   Unicode versions.
   3. Allow for combining marks at the end of bidi fields
   4. Allow for ZWJ/ZWNJ in limited contexts (see a previous message).

Except for #4, which probably most people haven't looked through yet, it
appears that these are mostly uncontroversial.

B. Tighten some restrictions on IDNA. The purpose of this appears to be to
reduce the opportunity for spoofing. Thus any proposed restrictions should
be assessed against that metric. That is: (a) does the restriction reduce
spoofing significantly? (b) Are there no other reasonable mechanisms for
doing so?

Here is what I've heard so far:

   1. Remove (or discourage) symbols and (most) punctuation.
      - This appears to be mostly uncontroversial. While the vast
      majority of symbols and punctuation do not cause spoofing
problems (I♥NY.com
      is not a problem, for example), there is not enough value to
having them to
      be worth the effort.
   2. Remove (or discourage) non-spacing marks.
      - This is quite controversial. These marks are needed by many
      languages; excluding them is like removing vowels from English: "
      microsoft.com" becoming "mcrsft.cm".
      - A very good case has to be made that they (a) cause problems,
      and (b) those problems can't feasibly be handled with other mechanisms.
   3. Remove (or discourage) archaic / technical characters (characters
   not in common modern use)
   - Unicode supplies a proposed list of such characters, in
      http://www.unicode.org/reports/tr39/#General_Security_Profile.
      However, it is recognized that any such list will need refinement and
      extension in the future.
      - Certain scripts are quite clearly archaic, and could be easily
      removed or discouraged.
      - Judging whether a character in a modern script is archaic,
      especially those in broad usage such as Latin, Arabic, and
Cyrillic, can be
      quite difficult -- often these characters are pressed into use
in minority
      languages.

A major issue is the choice between removal and discouragement. Removal has
the very significant cost of breaking backwards compatibility, so a clear
case has to be made that there is no feasible alternative to handle spoofing
problems that would otherwise occur.

Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20061127/94a713bc/attachment-0001.html