Table issues (was: Re: IDNAbis documents)

Wed Dec 5 00:42:46 CET 2007

Patrik,

> We have worked quite hard to, for the first time, really have all four  
> documents that are core of IDNAbis in sync.
> 
> They are:
> 
> - draft-klensin-idnabis-issues-05.txt
> - draft-klensin-idnabis-protocol-02.txt
> - draft-faltstrom-idnabis-tables-03.txt
> - draft-alvestrand-idna-bidi-01.txt

I'll focus on issues I've found in draft-faltstrom-idnabis-tables-03.txt,
leaving to others more qualified the concerns regarding the overall
architecture, the articulation of the four documents, implementation
issues, and so on.

My feedback will come in parts, as my analysis is ongoing.
I just thought, given the time constraints here, that it
might be useful to get some of the more evident feedback
to you quickly.

Re. Appendix A.

There seem to be some errors in the generation
of this table.

The code point range should be "0x0000 - 0x10FFFF", rather
than "0x0000 - 0x10FFFD", as there is no principled reason
to exclude consideration of the last two noncharacter
code points, U+10FFFE..U+10FFFF, when other noncharacter
code points such as U+FFFFE..U+FFFFF, *are* included
in the table.

The derivation of the table did not correctly distinguish
*unassigned* code points from *noncharacter* code points.
Unassigned code points are "<reserved>" and are available
for future encoding of characters, whereas noncharacter
code points are *not* "<reserved (for future assignment)>" --
they are designated functions, constitute a kind of internal 
private use, and are disallowed for interchange. (See Table 2-3,
TUS 5.0, p. 27.) If PUA code points (e.g. U+E000..U+F8FF)
are to be NEVER in this table, then the noncharacters
should be NEVERNEVERNEVER! ;-), rather than UNASSIGNED.

In general, having this Appendix A listing include UNASSIGNED
code points is both distracting (from the other, more
meaningful values) and an error-prone reduplication of
effort. The listing of gc=Cn values is already available
directly from:

http://www.unicode.org/Public/UNIDATA/extracted/DerivedGeneralCategory.txt

And that file *does* make the distinction between true
unassigned code points and noncharacter code points
(both of which are gc=Cn, but which differ in the
Noncharacter_Code_Point property [see PropList.txt].)
The derivation for the IDN inclusion table needs to
pay attention to *both* gc=Cn and Noncharacter_Code_Point=True.
What *would* make sense is for the Appendix listing to
correctly identify the noncharacters as NEVER. The
fact that it doesn't suggests that there is an error
in the way the calculation is handling Category D.

Another general issue with the document, table, and
Section 3, Calculation of the Derived Property: The
possible values of the IDN property still include
a value MAYBE NOT, but in fact the calculation has no
branch now that assigns a MAYBE NOT value, and the
table contains on MAYBE NOT characters. Either the
thinking about "MAYBE NOT" has changed, and the
document hasn't caught up to that yet, or there
is an error in how the calculation has been
set up. As it is now, nearly all of the "MAYBE NOT"
values from the 01 version of this ID are now listed
in the Appendix as "NEVER". As "NEVER", they would be
prohibited from any future consideration for IDN, which
seems at odds with the tenor of the text describing "MAYBE NOT".

I have a number of issues with the new Category J
and its relation to the newly suggested "CONTEXT"
value for the property, but I'll take those up
separately.

Regards,

--Ken