IDNA protocol checking/processing

Michel Suignard michelsu at windows.microsoft.com
Sat Dec 1 03:36:49 CET 2007


As many have already said, there is clearly progress in the new documents and all the authors should be thanked for that. It is especially good to see the protocol document. There are however missing pieces, especially in the CONTEXT rules especially concerning the ZWJ and ZWNJ and the bidi special tests (section 4.4 of draft-klensin-idnabis-protocol-02.txt, tests 3 and 4).

I have proposed in the past new text for the bidi rules which does not require to run the Unicode bidi algorithm (as suggested in section 4 of draft-alvestrand-idna-bidi-01.txt referenced in test 4). By special casing the NSM category it is my belief we can avoid that step.

Concerning the CONTEXT rules (test 3) I am willing to prepare a set of rules for the ZWJ/ZWNJ based on recent work by the UTC. This could then be added to the protocol document.

I would still like to see a simple test in the same section 4.4 excluding combining mark starting a single label. It is possible to capture it in a modified bidi rule but it makes the latter more complicated so I would prefer a separate rule.

I won't enter in the mapping argument. I am ok with omitting the mapping from the normative part of the protocol, as long as the case mapping is clearly specified as a clearly identified optional step fully referenced in the same document.

Finally, I still think that eventually the data set expressed in draft-falstrom-idnabis-tables-3.txt should be expressed as a new set of Unicode properties part of the Unicode database. With Unicode 5.1 coming out, it would be a good opportunity to define a set agreeable by all parties as well as to create a mechanism for future version of the standard. There are many more characters in the pipeline, including minorities in Asia and Africa. There are many advantages including easy data mining capability either in flat text format or the forthcoming XML format. As of now it is also not clear how the current specs avoid the hard link between IDN and a specific version of Unicode. Today the protocol uses normatively terms such as NEVER, ALWAYS, MAYBE, etc... which are defined in an informative reference. We need to get to a point where the required data set can be defined as a profile registered with IANA as I suggested in the idnaprep document a while ago.

Best regards,

Michel

----For reference my bidi test proposal follows: ----
Bidirectional Characters
Most characters are displayed from left to right, but some are displayed from right to left. This feature of Unicode is called "bidirectional layout", or "bidi" for short. The Unicode Standard has an extensive discussion of how to reorder glyphs for display when dealing with bidirectional text such as Arabic or Hebrew. See the Unicode Bidirectional Algorithm [UAX9] (Davis, M., "The Bidirectional Algorithm," September 2006.) for more information. In particular, all Unicode text is stored in logical order.

[Unicode] (The Unicode Consortium, "The Unicode Standard Version 5.0," October 2006.) defines several bidirectional categories; each character has one bidirectional category assigned to it. For the purposes of the requirements below, three categories are used:

RCat character
Characters belonging to right to left scripts such as Hebrew, Arabic, Thaana, etc...
LCat character
Characters belonging to left to right script such as Latin, Greek, Cyrillic, etc...
NSMCat
Combining marks.
The character properties: "RCat", "LCat", and "NSMCat" are defined in appendix A.

The Unicode Bidirectional Algorithm [UAX9] (Davis, M., "The Bidirectional Algorithm," September 2006.) can result in various rearrangements of characters according to their direction. To prevent characters from rearranging across field boundaries, the following three requirements MUST be met. An error is returned if these requirements are not satisfied.

a.
The string MUST NOT contain any "RCat" character,
b.
Or if it does, the string must satisfy all of these requirements
1) The string MUST NOT contain any "LCat" character,
2) The string MUST start with an "RCat" character,
3) The string MUST either end with an "RCat" character, or end with an "RCat" character followed by a sequence of "NSMCat" characters.

Note that requirement 3 prohibits strings such as <U+0627, U+0031> ("aleph 1") but allows strings such as <U+0627, U+0031, U+0628> ("aleph 1 beh"), and <U+078B, U+07A8, U+0788, U+07AC, U+0780, U+07A8> ("Divehi in Thaana script ending with a "NSMCat" character). [UAX9] (Davis, M., "The Bidirectional Algorithm," September 2006.) goes into great detail about the display order of strings that contain particular categories of characters in particular sequences.
---------


More information about the Idna-update mailing list