Comments on IDNA Bidi

Mon Jan 14 23:40:07 CET 2008

Michel said:

> Just to remember that the bidi rules that I proposed in my message
> dated 11/30/2007 to this list (excerpt below with some further 
> editing) do not require to implement the bidi algorithm but only 
> relies in bidi properties and some positional conditions, and are 
> much simpler to define and implement than the bidi algorithm itself.
> As such there are a mere update of the rules expressed in clause 
> 6 of RFC 3454 (stringprep) and can be used as processing rule in 
> the idna200x protocol definition. 

I want to second Michel's approach here.

If you look at the current text of bidi-02.txt, the
proposed fix for RFC 3454 is to rewrite the definition of
RandALCat character and LCat character as following:

  For characters that have category "R", "AL" or "L", the
  category is fixed (UAX#9 defines them as having "strong"
  category);...

Note that that much is unchanged in Michel's textual approach.

  ... for characters in category EN, ES, ET, AN, CS, NSM, BN, B,
  S, WS and ON, the category is determined by applying the
  algorithm described in UAX#9 section 3.3 to the string.

But here, Michel's approach is much simpler. It focusses on
the main problem noted in RFC 3454, the problem of not
allowing labels to end with combining marks -- a problem
that was disallowing well-formed Dhivehi and Yiddish labels,
for example. That is also the main problem discussed
in Section 1 (and exemplified in Section 2) of bidi-02.txt.

Since the categorical treatment of bc=NSM characters is
trivial in the bidirectional algorithm, and doesn't imply
full application of the algorithm to understand and specify
it, simply adding the definition of NSMCat characters, and
tightening up the specification of allowable label strings,
to include the appropriate use of the NSMCat values, is
much, much simpler than requiring an actual application
of the full bidirectional algorithm to determine the
final, contextual resolution of weak types (X3.3.3) and
neutral types (X3.3.4).

Also, as somewhat of an aside, the current proposed wording
above in bidi-02.txt is overly broad in what it attempts
to accomplish, even if retained. bc=BN, while a weak
bidi type, is never resolved to a strong type by X3.3.3;
by rule X9 all BN codes are logically removed from the
string *before* any resolution of weak types. And bc=B,
bc=S, and bc=WS can never occur in IDN labels in the
first place, so you don't have to deal with the complications
of rule L1 for those, either. And finally, bc=AN never
gets resolved to one of the strong types. So at the very least, the
statement could be simplified to:

  ... for characters in category EN, ES, ET, CS, NSM,
  and ON, the category is determined by applying the
  algorithm described in UAX#9 section 3.3 to the string.

But I think Michel's focus on just dealing with bc=NSM is
much cleaner and still suffices to deal with the problem.

--Ken