Comments on IDNA Bidi

Michel Suignard michelsu at windows.microsoft.com
Mon Jan 14 23:01:03 CET 2008


Just to remember that the bidi rules that I proposed in my message dated 11/30/2007 to this list (excerpt below with some further editing) do not require to implement the bidi algorithm but only relies in bidi properties and some positional conditions, and are much simpler to define and implement than the bidi algorithm itself. As such there are a mere update of the rules expressed in clause 6 of RFC 3454 (stringprep) and can be used as processing rule in the idna200x protocol definition. I understand that validation of these rules as appropriate may imply to run the bidi algorithm on strings complying with these rules, but that is only a validation or proof of concept issue, not an implementation issue.

The rules I wrote then were also implying that the bidi control characters which were excluded in idna2003 were still excluded in this new context.

Michel
----edited excerpt from 11/30/2007 message follows:---------
Bidirectional Characters Most characters are displayed from left to right, but some are displayed from right to left. This feature of Unicode is called "bidirectional layout", or "bidi" for short. The Unicode Standard has an extensive discussion of how to reorder glyphs for display when dealing with bidirectional text such as Arabic or Hebrew. See the Unicode Bidirectional Algorithm [UAX9] (Davis, M., "The Bidirectional Algorithm," September 2006.) for more information. In particular, all Unicode text is stored in logical order.

[Unicode] (The Unicode Consortium, "The Unicode Standard Version 5.0," October 2006.) defines several bidirectional categories; each character has one bidirectional category assigned to it. For the purposes of the requirements below, three categories are used:

RCat character
Characters belonging to right to left scripts such as Hebrew, Arabic, Thaana, etc...
LCat character
Characters belonging to left to right script such as Latin, Greek, Cyrillic, etc...
NSMCat Combining marks.

The character properties: "RCat", "LCat", and "NSMCat" are defined as follows:
RCat: character with Bidi_Class value of 'R' or 'AL' as specified in UnicodeData.txt LCat: character with Bidi_Class value of 'L' as specified in UnicodeData.txt.
NSMCat: character with Bidi_Class value of 'NSM' as specified in UnicodeData.txt.

The Unicode Bidirectional Algorithm [UAX9] (Davis, M., "The Bidirectional Algorithm," September 2006.) can result in various rearrangements of characters according to their direction. To prevent characters from rearranging across field boundaries, the following alternative requirements MUST be met. An error is returned if these requirements are not satisfied.

a.
The string MUST NOT contain any "RCat" character,

b.
Or if it does, the string must satisfy all of these requirements
1) The string MUST NOT contain any "LCat" character,
2) The string MUST start with an "RCat" character,
3) The string MUST either end with an "RCat" character, or end with an "RCat" character followed by a sequence of "NSMCat" characters.

Note that requirement b.3 prohibits strings such as <U+0627, U+0031> ("aleph 1") but allows strings such as <U+0627, U+0031, U+0628> ("aleph 1 beh"), and <U+078B, U+07A8, U+0788, U+07AC, U+0780, U+07A8> ("Divehi in Thaana script ending with a "NSMCat" character). [UAX9] (Davis, M., "The Bidirectional Algorithm," September 2006.) goes into great detail about the display order of strings that contain particular categories of characters in particular sequences.
---------


More information about the Idna-update mailing list