<h1><a id="os4l" title="http://www.ietf.org/internet-drafts/draft-faltstrom-idnabis-tables-03.txt" href="http://www.ietf.org/internet-drafts/draft-faltstrom-idnabis-tables-03.txt">http://www.ietf.org/internet-drafts/draft-faltstrom-idnabis-tables-03.txt

</a> </h1>

<h2>Overall Comments:</h2>

<p><br></p>

<p>Tables-1.<br><br>There is no operational difference between MAYBE

YES and MAYBE NO, and no characters that are in the latter. This

distinction is really only meaningful as internal tracking information

inside whatever group controls the future allocation of characters and

should not appear here. (See also Ken&#39;s email and trail under &quot;Table

issues (was: Re: IDNAbis documents)&quot;</p>

<p>Even further, MAYBE YES should not exist at all: a day or two of

work by script experts would be enough to move the vast majority of the

current &#39;MAYBE YES&#39; to the ALWAYS category.<br><br>Tables-2.<br><br>There

is a preference for Latin, Greek, Cyrillic, and Han which has no

principled basis. In particular, Latin, Cyrillic, and Han are some of

the most complicated scripts: Latin and Cyrillic, since they ar used to

write a huge number of languages with a large number of variant

characters, and Han because of the history of character variations.

Many, many scripts are less problematic than Latin or Cyrillic, and

there is no reason to favor Cyrillic over say Armenian; it also gives

the appearance of Eurocentrism where none is intended.</p>

<p><br>

</p>

<p>From an old email:<br><br><span style="font-style: italic;">&quot;No

reason is given for the focus on only European scripts; and that focus

will surely raise suspicions in many circles. While I&#39;m sure that the

restriction to European languages is just because those are the ones

the small group of authors is familiar with, it will not be received

well. If &quot;we the community&quot; have &quot;experienced that a number of scripts

have issues that are not resolved&quot;, then those problems should be

enumerated *explicitly*, not hidden away.</span><br style="font-style: italic;"><br style="font-style: italic;"><span style="font-style: italic;">The

situation might be different if we were starting from zero; but we are

not. We already have an IDNA system that works for a great many people.

And while there are security problems with it, those are well known and

vendors are dealing with them. Moreover, of the problems that IDNAbis

solves, they are just the easy ones -- the harder ones are ones like

the &quot;<a href="http://paypal.com">paypal.com</a>&quot; case, which the current suggestion for IDNAbis doesn&#39;t

touch. So it feels like we are looking at a proposal that:</span><br style="font-style: italic;"><br style="font-style: italic;"><span style="font-style: italic;">1. doesn&#39;t actually help much with the practical problems that people face

</span><br style="font-style: italic;"><span style="font-style: italic;">2. solves the easy problems, but not the hard ones; so people have to essentially do the work anyway</span><br style="font-style: italic;"><span style="font-style: italic;">

3. and removes much of the functionality, except for some favored groups: Europe and the Americas&quot;</span><br><br>Tables-3. <br><br>The

CONTEXT class should be heavily restricted, as per Ken&#39;s email, to only

2 characters (see &quot;Table issues (Part 3: CONTEXT)&quot; for details).

Moreover, the term Context is problematic: <i>*many</i>* characters

are disallowed or allowed, depending on context. Even a-z are

disallowed in a field that also contains RTL characters. <br><br>Tables-4.<br><br>The list of historic scripts is very outdated. See <a id="ak8w" title="http://www.unicode.org/reports/tr31/tr31-8.html#Specific_Character_Adjustments" href="http://www.unicode.org/reports/tr31/tr31-8.html#Specific_Character_Adjustments">

http://www.unicode.org/reports/tr31/tr31-8.html#Specific_Character_Adjustments</a> for more details. The characters in Table 3 should also be reviewed as possible exceptions. <br><br>Tables-5.<br><br>Key

to the success of this is the group that determines the future

allocation of characters. It must be very clear precisely what the

grounds are for removing characters (moving from MAYBE to NEVER);

otherwise there will be never-ending battles over individual

characters. (Frankly, I believe that the correct course of action would

be to disallow the historic scripts for now, but allow the characters

in all other scripts, with very few exceptions.) <br><br>Tables-6.<br><br>Like <a href="http://www.ietf.org/internet-drafts/draft-alvestrand-idna-bidi-01.txt">draft-alvestrand-idna-bidi-01.txt</a>,

there should be at least one example motivating every case where a

class of characters is removed (this might be in one of the other

documents instead of here).<br><br>Tables-7.<br><br>The entire

description of the process is far too complicated for what is, at core,

a relatively simple process. It is further obfuscated by referring to

classes of characters by a letter category instead of a mnemonics.<br><br></p>

<p>Take the following from <a id="os4l" title="http://www.ietf.org/internet-drafts/draft-faltstrom-idnabis-tables-03.txt" href="http://www.ietf.org/internet-drafts/draft-faltstrom-idnabis-tables-03.txt">draft-faltstrom-idnabis-tables-03.txt

</a> </p>

<pre>      *  If the codepoint does not appear in any of the categories B<br>         (Section 2.1.2), C (Section 2.1.3), D (Section 2.1.4), E<br>         (Section 2.1.5) or F (Section 2.1.6), the value is ALWAYS.<br></pre>

That

formulation is completely opaque. I&#39;d strongly recommend for

transparency you reformulate this considerably. You could maintain part

of the structure that you have, if you wanted, by consistently using

mnemonics instead of Sections.<br><br>That is, give ,meaningful names to each Category in Section 2, such as:<br><br>A =&gt; Language-Characters<br>B =&gt; Unnormalized<br>C =&gt; Ignorable<br>D =&gt; Historical-Scripts<br>

E =&gt; Disallowed-Blocks<br>...<br><br>The

formulation can then be something like the following. (This is not

precisely equivalent to your formulation, which I found difficult to

follow -- it is the style of presentation that I&#39;m focusing on).<br><br>

<div>Use the following procedure to determine the IDNA-Property of any

code point cp. Proceed through the rules, and return a value at the

first that applies.<br><br>Exceptions<br>1a. If cp is in Exceptional-Always, return Always<br>1b. If cp is in Exceptional-Never, return Never<br>1c. If cp is in Exceptional-Maybe, return Maybe<br><br>Functional Exclusions

<br>2. Else if cp is in Unnormalized, return Never<br>3. Else if cp is in Not-Case-Folded, return Never<br>4. Else if cp is in Ignorable, return Never<br><br>Usage Exclusions<br>5. Else if cp is in Historical-Scripts, return Never

<br>6. Else if cp is in Disallowed-Blocks, return Never<br><br>LMN Inclusion<br>7. Else if cp is in Language-Characters, return Maybe<br><br>Exclude everything else<br>8. Else return Never<br><br></div>Note:

Exceptional-Always would contain your Category H Always characters,

plus grandfathered Always characters, plus a-z, 0-9, -;

Exceptional-Maybe would add the Category H Maybe characters, and so on.

The mechanism already described in email for providing perfect

stability would be to add characters, where necessary, to these classes.<br><br>

<h2>Details:</h2><br>Tables-8.<br><pre>      a character is never removed from<br>      it unless it is removed from Unicode.<br></pre>

<p>This is not necessary. If you really have to have it, then add &quot;(however, the Unicode stability policies expressly forbid this)&quot;</p><br>

<p><br></p>

<p>Tables-9.</p>

<p>Re. Appendix A. There seem to be some errors in the generation of

this table. The code point range should be &quot;0x0000 - 0x10FFFF&quot;.<br></p><br>

<p><br></p>

<p>Tables-10</p>

<p><br></p>

<p>The derivation of the table did not correctly distinguish

*unassigned* code points from *noncharacter* code points. Unassigned

code points are &quot;&lt;reserved&gt;&quot; and are available for future

encoding of characters, whereas noncharacter code points are *not*

&quot;&lt;reserved (for future assignment)&gt;&quot; -- they are designated

functions, constitute a kind of internal private use, and are

disallowed for interchange. (See Table 2-3, TUS 5.0, p. 27.) If PUA

code points (e.g. U+E000..U+F8FF) are to be NEVER in this table, then

the noncharacters must be NEVER, rather than UNASSIGNED.<br><br>Tables-10a</p>

<p><br></p>

<p>In general, having this Appendix A listing include UNASSIGNED code

points is both distracting (from the other, more meaningful values) and

an error-prone reduplication of effort. The listing of gc=Cn values is

already available directly from:<br><br><a href="http://www.unicode.org/Public/UNIDATA/extracted/DerivedGeneralCategory.txt" target="_blank">http://www.unicode.org/Public/UNIDATA/extracted/DerivedGeneralCategory.txt</a><br>

<br>And

that file *does* make the distinction between true unassigned code

points and noncharacter code points (both of which are gc=Cn, but which

differ in the Noncharacter_Code_Point property [see PropList.txt].) The

derivation for the IDN inclusion table needs to pay attention to *both*

gc=Cn and Noncharacter_Code_Point=True. What *would* make sense is for

the Appendix listing to correctly identify the noncharacters as NEVER.

The fact that it doesn&#39;t suggests that there is an error in the way the

calculation is handling Category D.<br></p><br>

<p><br></p>

<p>Tables-11</p>

<p><br></p>

<p>Another general issue with the document, table, and Section 3,

Calculation of the Derived Property: The possible values of the IDN

property still include a value MAYBE NOT, but in fact the calculation

has no branch now that assigns a MAYBE NOT value, and the table

contains on MAYBE NOT characters. Either the thinking about &quot;MAYBE NOT&quot;

has changed, and the document hasn&#39;t caught up to that yet, or there is

an error in how the calculation has been set up. As it is now, nearly

all of the &quot;MAYBE NOT&quot; values from the 01 version of this ID are now

listed in the Appendix as &quot;NEVER&quot;. As &quot;NEVER&quot;, they would be prohibited

from any future consideration for IDN, which seems at odds with the

tenor of the text describing &quot;MAYBE NOT&quot;.<br><br>Tables-12</p>

<p><br></p>

<p>Section 4. Codepoints states:<br><br>&quot;The Categories and Rules

defined in Section 2 and Section 3 apply to all assigned Unicode

characters.&quot; In fact they also apply to *unassigned* code points as

well.<br><br>The correct formulation would be:<br><br>&quot;The Categories and Rules defined in Section 2 and Section 3 apply to all Unicode codepoints, assigned or unassigned.&quot;<br><br>[Note:

the Unicode Standard systematically uses a space in the term &quot;code

point&quot;, as well as for &quot;code unit&quot;, &quot;code position&quot;, &quot;code value&quot;, etc.

But given that this document uses &quot;codepoint&quot; everywhere, I&#39;m not

suggesting that be changed. Nobody is going to be confused as to what

the word means.]<br></p><br>

<p><br></p>

<p>Tables-13<br><br>&quot;Once assigned to this category, a character is never removed from it unless it is removed from Unicode.&quot;<br><br>The

qualification &quot;unless it is removed from Unicode&quot; is vacuous. Since

Unicode 1.1, no character ever has been removed from Unicode, nor will

any be -- in part because no character will ever be removed from

ISO/IEC 10646.<br><br>So this is a quibble is a little like qualifying

the definition of ASCII LDH as &quot;{0061..007A, 0030..0039, 002D} and no

characters will be removed from this definition unless they are removed

from ASCII.&quot;<br><br>So I suggest just removing the vacuous qualification.<br></p><br>

<p><br></p>

<p>Tables-14</p>

<p><br>The grandfathering technique needs to be used so as to preserve

stability, since characters may change script. (See the email trail

under &quot;Table issues (Part 2)&quot; for details).</p>

<p>&nbsp;</p><br>

<p>&nbsp;</p>