Comments on IDNA Bidi

Mark Davis mark.davis at icu-project.org
Sat Jan 12 22:15:40 CET 2008


Fair enough.

If you assume that a URL is valid IDNAbis(draft) when it is actually
IDNA2003, you will break if

   1. you assume it is all case folded
   2. you assume that a non-spacing mark is at the end
   3. you assume that any of some 3000 characters are not in it
   4. you assume that ZWJ/NJ is insignificant
   5. you assume that a trailing combining mark in a label means it is
   not RTL
   6. you assume that a

and so on.

What I'm saying is that essentially all of the incompatible differences
between 2003 and the current bis are potential problems for some
implementation, and once we get done with bis, we will need to list them
all. So just calling out #5 is insufficient.

What I'd suggest doing right now is having a placeholder in each of the
documents that says (in better text) that:

   1. Any the differences from IDNA2003 could cause problems if an
   implementation assumes that a valid IDNA in a URL is actually IDNA2003.
   2. All of these will all be called out once the document is further
   along.
   3. In the bidi doc, one example of such a problem is: <the one you
   have currently>.

Mark

On Jan 12, 2008 12:43 AM, Harald Alvestrand <harald at alvestrand.no> wrote:

> Mark Davis skrev:
>
> (deleting matters already covered in the exchange with Ken)
> >
> >
> >     >
> >     > Bidi-5.
> >     >    One particular example of the last case is if a program
> >     chooses to
> >     >    examine the last character (in network order) of a string in
> >     order to
> >     >    determine its directionality, rather than its first; if it
> >     finds an
> >     >
> >     >    NSM character and tries to display the string as if it was a
> >     left-to-
> >     >    right string, the resulting display may be interesting, but not
> >     >    useful.
> >     >
> >     > I don't understand this paragraph. When and why would this
> >     happen with
> >     > IDNA-conformant programs?
> >     >
> >     I think the text is clear enough - if you get a label "ALEF BET
> <some
> >     NSM character>", an IDNA2003 program can look at the last character
> in
> >     the string and say "this is not a RTL string", and treat it as if
> >     it was
> >     LTR. In IDNA2003, that will be a safe assumption. In IDNAx, it
> >     will not
> >     be a safe assumption.
> >
> >
> > I find that a bit odd. The case you are taking is
> >
> > A program is looking at an IDNAbis URL, and thinks that it is a valid
> > IDNA2003 URL, and makes some assumptions about it, and things break.
> >
> > This case that you mention is just a tip of a iceberg. There are a
> > *very* large number of assumptions that a program can make about
> > IDNA2003 that will completely break under IDNAbis (as currently
> > drafted). Many, many things would break, not just this, and not just
> > this in BIDI. So I don't see why you are just calling out this one.
> Mark, this is not helpful.
>
> Speaking for IDNAbis-bidi ONLY:
>
> This is the *one* concrete example that people have come up with where
> an implementor could make a choice that might be reasonable to make in
> some context and actually have a concrete difference in behaviour
> between IDNA2003 and IDNAbis-bidi apart from the obvious one (that more
> characters are permitted). The concern was raised, and the text got added.
>
> If you can come up with another reasonable implementation choice that
> people could make because of IDNA2003 that would cause a difference in
> behaviour under IDNAbis-bidi that is not obvious, state it.
>
>                         Harald
>
>


-- 
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20080112/4b9c72e5/attachment.html


More information about the Idna-update mailing list