Comments on IDNAbis tables-03

Mark Davis mark.davis at
Fri Dec 14 04:45:27 CET 2007


There is no operational difference between MAYBE YES and MAYBE NO, and no
characters that are in the latter. This distinction is really only
meaningful as internal tracking information inside whatever group controls
the future allocation of characters and should not appear here. (See also
Ken's email and trail under "Table issues (was: Re: IDNAbis documents)"

Even further, MAYBE YES should not exist at all: a day or two of work by
script experts would be enough to move the vast majority of the current
'MAYBE YES' to the ALWAYS category.


There is a preference for Latin, Greek, Cyrillic, and Han which has no
principled basis. In particular, Latin, Cyrillic, and Han are some of the
most complicated scripts: Latin and Cyrillic, since they ar used to write a
huge number of languages with a large number of variant characters, and Han
because of the history of character variations. Many, many scripts are less
problematic than Latin or Cyrillic, and there is no reason to favor Cyrillic
over say Armenian; it also gives the appearance of Eurocentrism where none
is intended.

 From an old email:

"No reason is given for the focus on only European scripts; and that focus
will surely raise suspicions in many circles. While I'm sure that the
restriction to European languages is just because those are the ones the
small group of authors is familiar with, it will not be received well. If
"we the community" have "experienced that a number of scripts have issues
that are not resolved", then those problems should be enumerated
*explicitly*, not hidden away.

The situation might be different if we were starting from zero; but we are
not. We already have an IDNA system that works for a great many people. And
while there are security problems with it, those are well known and vendors
are dealing with them. Moreover, of the problems that IDNAbis solves, they
are just the easy ones -- the harder ones are ones like the ""
case, which the current suggestion for IDNAbis doesn't touch. So it feels
like we are looking at a proposal that:

1. doesn't actually help much with the practical problems that people face
2. solves the easy problems, but not the hard ones; so people have to
essentially do the work anyway
3. and removes much of the functionality, except for some favored groups:
Europe and the Americas"


The CONTEXT class should be heavily restricted, as per Ken's email, to only
2 characters (see "Table issues (Part 3: CONTEXT)" for details). Moreover,
the term Context is problematic: **many** characters are disallowed or
allowed, depending on context. Even a-z are disallowed in a field that also
contains RTL characters.


The list of historic scripts is very outdated. See
more details. The characters in Table 3 should also be reviewed as
possible exceptions.


Key to the success of this is the group that determines the future
allocation of characters. It must be very clear precisely what the grounds
are for removing characters (moving from MAYBE to NEVER); otherwise there
will be never-ending battles over individual characters. (Frankly, I believe
that the correct course of action would be to disallow the historic scripts
for now, but allow the characters in all other scripts, with very few


Like draft-alvestrand-idna-bidi-01.txt<>,
there should be at least one example motivating every case where a class of
characters is removed (this might be in one of the other documents instead
of here).


The entire description of the process is far too complicated for what is, at
core, a relatively simple process. It is further obfuscated by referring to
classes of characters by a letter category instead of a mnemonics.

Take the following from

      *  If the codepoint does not appear in any of the categories B
         (Section 2.1.2), C (Section 2.1.3), D (Section 2.1.4), E
         (Section 2.1.5) or F (Section 2.1.6), the value is ALWAYS.

That formulation is completely opaque. I'd strongly recommend for
transparency you reformulate this considerably. You could maintain part of
the structure that you have, if you wanted, by consistently using mnemonics
instead of Sections.

That is, give ,meaningful names to each Category in Section 2, such as:

A => Language-Characters
B => Unnormalized
C => Ignorable
D => Historical-Scripts
E => Disallowed-Blocks

The formulation can then be something like the following. (This is not
precisely equivalent to your formulation, which I found difficult to follow
-- it is the style of presentation that I'm focusing on).

Use the following procedure to determine the IDNA-Property of any code point
cp. Proceed through the rules, and return a value at the first that applies.

1a. If cp is in Exceptional-Always, return Always
1b. If cp is in Exceptional-Never, return Never
1c. If cp is in Exceptional-Maybe, return Maybe

Functional Exclusions
2. Else if cp is in Unnormalized, return Never
3. Else if cp is in Not-Case-Folded, return Never
4. Else if cp is in Ignorable, return Never

Usage Exclusions
5. Else if cp is in Historical-Scripts, return Never
6. Else if cp is in Disallowed-Blocks, return Never

LMN Inclusion
7. Else if cp is in Language-Characters, return Maybe

Exclude everything else
8. Else return Never

Note: Exceptional-Always would contain your Category H Always characters,
plus grandfathered Always characters, plus a-z, 0-9, -; Exceptional-Maybe
would add the Category H Maybe characters, and so on. The mechanism already
described in email for providing perfect stability would be to add
characters, where necessary, to these classes.


      a character is never removed from
      it unless it is removed from Unicode.

This is not necessary. If you really have to have it, then add "(however,
the Unicode stability policies expressly forbid this)"


Re. Appendix A. There seem to be some errors in the generation of this
table. The code point range should be "0x0000 - 0x10FFFF".


The derivation of the table did not correctly distinguish *unassigned* code
points from *noncharacter* code points. Unassigned code points are
"<reserved>" and are available for future encoding of characters, whereas
noncharacter code points are *not* "<reserved (for future assignment)>" --
they are designated functions, constitute a kind of internal private use,
and are disallowed for interchange. (See Table 2-3, TUS 5.0, p. 27.) If PUA
code points (e.g. U+E000..U+F8FF) are to be NEVER in this table, then the
noncharacters must be NEVER, rather than UNASSIGNED.


In general, having this Appendix A listing include UNASSIGNED code points is
both distracting (from the other, more meaningful values) and an error-prone
reduplication of effort. The listing of gc=Cn values is already available
directly from:

And that file *does* make the distinction between true unassigned code
points and noncharacter code points (both of which are gc=Cn, but which
differ in the Noncharacter_Code_Point property [see PropList.txt].) The
derivation for the IDN inclusion table needs to pay attention to *both*
gc=Cn and Noncharacter_Code_Point=True. What *would* make sense is for the
Appendix listing to correctly identify the noncharacters as NEVER. The fact
that it doesn't suggests that there is an error in the way the calculation
is handling Category D.


Another general issue with the document, table, and Section 3, Calculation
of the Derived Property: The possible values of the IDN property still
include a value MAYBE NOT, but in fact the calculation has no branch now
that assigns a MAYBE NOT value, and the table contains on MAYBE NOT
characters. Either the thinking about "MAYBE NOT" has changed, and the
document hasn't caught up to that yet, or there is an error in how the
calculation has been set up. As it is now, nearly all of the "MAYBE NOT"
values from the 01 version of this ID are now listed in the Appendix as
"NEVER". As "NEVER", they would be prohibited from any future consideration
for IDN, which seems at odds with the tenor of the text describing "MAYBE


Section 4. Codepoints states:

"The Categories and Rules defined in Section 2 and Section 3 apply to all
assigned Unicode characters." In fact they also apply to *unassigned* code
points as well.

The correct formulation would be:

"The Categories and Rules defined in Section 2 and Section 3 apply to all
Unicode codepoints, assigned or unassigned."

[Note: the Unicode Standard systematically uses a space in the term "code
point", as well as for "code unit", "code position", "code value", etc. But
given that this document uses "codepoint" everywhere, I'm not suggesting
that be changed. Nobody is going to be confused as to what the word means.]


"Once assigned to this category, a character is never removed from it unless
it is removed from Unicode."

The qualification "unless it is removed from Unicode" is vacuous. Since
Unicode 1.1, no character ever has been removed from Unicode, nor will any
be -- in part because no character will ever be removed from ISO/IEC 10646.

So this is a quibble is a little like qualifying the definition of ASCII LDH
as "{0061..007A, 0030..0039, 002D} and no characters will be removed from
this definition unless they are removed from ASCII."

So I suggest just removing the vacuous qualification.


The grandfathering technique needs to be used so as to preserve stability,
since characters may change script. (See the email trail under "Table issues
(Part 2)" for details).
-------------- next part --------------
An HTML attachment was scrubbed...

More information about the Idna-update mailing list