Comments on IDNA Bidi

Sun Jan 13 20:20:23 CET 2008

Mark,

Now I'm confused.  Keep in mind that, in an actual URL, having
an IDN in anything but A-label (punycode processed) form is
invalid.  It is invalid under IDNA2003, it is invalid under
(draft)IDNAbis, and, more important than either of those, it is
invalid under RFC3986 (which defines the general syntax for
URIs) for any URI that requires that the "host" be either a
domain name or an IP address literal.  The "http" URL is an
example of a URI with that requirement.

So the first place I'm confused is that, if a string containing
codepoints above U+007F (and any of a significant number of
codepoints at or below it, since RFC3986 references an
instantiation of the "hostname" (LDH) rule), it is in violation
of the URL spec.    

Certainly, as you, Martin, Erik, and others have pointed out in
various ways, there are many places in which strings appear that
look like URLs and don't conform to URL rules.   It may be
perfectly reasonable in some contexts to have a string that
looks like a URL but that contains non-ASCII characters.  But,
unless it is an IRI in a context in which IRIs are permitted,
one gets from such a string to a URL via exactly the sort of
preprocessing that we've been discussing as "user agent"
functionality in the IDNAbis context.

In contexts in which IRIs are permitted, they are still not
URLs.  More important, regardless of what one can write
lexically, any IRI that cannot be transformed into a valid URI
(using a very specific set of operations specified in the IRI
spec and including application of IDNA2003) is not a valid IRI.

Even the question of context is important, since many
applications are expected, in practice, to deduce the presence
of a domain names or URI (or IRI) even when it is placed in
running text and do something special with it.   The heuristics
used to do that are shaky and I think they are likely to get
more shakey as we move to top-level IDNs (no clues from a small
set of ASCII domain names in the final positions).  Systems that
simply assume that any string with one of more periods before
trailing white space are already in circularity trouble and that
trouble gets worse if they are expected to apply the RFC3490
"dot" rules (need to know that the string is an IDN to apply the
rules; need to apply the rules to guess whether the string is an
IDN).

Now, with that in mind, let's go through your list, considering
only IDNA2003 and the IDNAbis proposal and how they are
different.

--On Saturday, 12 January, 2008 13:15 -0800 Mark Davis
<mark.davis at icu-project.org> wrote:

> Fair enough.
> 
> If you assume that a URL is valid IDNAbis(draft) when it is
> actually IDNA2003, you will break if
> 
>    1. you assume it is all case folded

Since, if it contains any non-ASCII characters, it is required
to be an A-label, it is entirely case-folded in IDNA2003 _and_
IDNA200X.   So there is no difference.

>    2. you assume that a non-spacing mark is at the end

Since this is invalid in IDNA2003, it will never make it into an
A-label in that context and the assumption is necessarily false.

>    3. you assume that any of some 3000 characters are not in it

If you are referring to the characters mapped out by NFKC, those
characters are all mapped out before an A-label is formed and
none of them can actually appear in a URL.   If you are
referring to the symbols, line-drawing, and punctuation
characters, if they are not part of strings that are registered,
then they will fail on lookup with IDNA2003 and before lookup is
attempted with IDNA200X, which doesn't seem to me like a big
difference.  If strings containing them were registered contrary
to guidelines... see the discussion about those characters in
the comments I posted about your "issues-05" comments.

>    4. you assume that ZWJ/NJ is insignificant

Since these characters cannot be mapped into an A-label in
IDNA2003, the situation is identical to that of a non-spacing
mark.  The assumption is necessarily false.

>    5. you assume that a trailing combining mark in a label
> means it is not RTL

I'll have to let Harald or Cary respond to this one, but I
suspect it is similar to the other cases.

>    6. you assume that a
> 
> and so on.

It is also possible that I misunderstand what you mean by
"assume".   Neither an implementation of IDNA2003 nor an
implementation of IDNA200X is conformant with the intent of
those specifications if it "assumes" any of these things and
then goes off and behaves as if they are true.  In both cases,
implementations are expected to test the strings they intend to
pass (or intend others to pass) to the DNS so that
non-conforming strings will fail.  In IDNA2003, most of the
testing is built into ToASCII and the operations surrounding it.
In IDNA200X, much of the testing is more explicit.  But neither
assumes things that it doesn't verify.

Clearly, there is at least other issue.  It arises for names
that are valid under IDNA200X but not obviously valid under
IDNA2003.  An IDNA2003 lookup implementation will reject some of
them as invalid (some or most of those that merely contain
codepoints that are unassigned in Unicode 3.2 but assigned in
later versions may slip through).  In the long term, the only
way to make all of the newly-available characters and strings
available to IDN-using applications is for implementations of
those applications to upgrade.  That would be true of any update
to IDNA that moves beyond Unicode 3.2, especially since
registration of strings that contain codepoints that are are
unassigned at registration time is, fairly obviously, the worst
of bad practices.  

Now I'm being a little pedantic here, for which I apologize, but
I think the point is important.   If any of the majority of the
cases you list above, what the strings occur in is not a URL,
but something that must be transformed into a URL.   The
transformation process, except for IRIs in IRI-valid contexts,
is outside the standards process because, some hand waving in
3490 notwithstanding, there is nothing that says "if you see
something that looks like a URL but isn't because it has
non-ASCII characters in the 'host' part, you are required to go
off and apply your favorite flavor of IDNA even if you haven't
heard of IDNA".  The impossibility of writing such a rule and
applying it to pre-IDNA applications and implementations should
be obvious.  We would hope that would happen, but the decision
to do it is a user interface (or front end, or preprocessor)
decision (assuming the implementation is late enough that its
designers even had the choice as an option).  At the risk of
repeating myself, that decision is a UI (or preprocessing, or
front-end) decision as surely as the UI decisions we have
discussed in the IDNA200X context.

Now I'm going to make two assumptions with which you may
disagree.  The first is that the IDNA200X model is sufficiently
different from the IDNA2003 one that few, if any, applications
are going to switch (or be able or inclined to switch) from
IDNA2003 to IDNA200X by a completely automatic process without
anyone thinking about it or noticing.  That is a disadvantage in
some ways and an advantage in others.   In particular, if the
IDNA2003-based application actually invokes ToASCII or ToUnicode
or any of the tables on which they depend, an automatic and
transparent conversion is impossible because neither those
operations nor Nameprep or Stringprep appear in IDNA200X.
People are going to need to think about how to make the change.

The second assumption is that any implementation that now
depends upon, or offers to users, the input flexibilities of
IDNA2003 (some applications of IDNA2003 do not) would be stupid
to implement IDNA200X in a way that simply drops those
flexibilities.  Whether it should quietly retain them, or
produce more or less subtle warnings to users about the
conversions becomes a local design matter (and programs that
communicate with users obviously have choices that are not
available to ones do not), it appears to me that we are already
heading in the direction of applications (and, if that approach
isn't stopped for other reasons, "smart domain name servers")
making decisions about some things being safer than others and
conditioning their actions on those decisions.

The rationale document doesn't cover that situation nearly well
enough at -05, but there is a new section and extensive text
about it in the working version of -06.  I don't think anything
there will come as a surprise, since all of the issues have been
discussed on this list and much of the text is derived from
discussions on the list.  Unfortunately, there is a tradeoff: if
I have an hour to spend on this work, I can spend it either
working on the document(s) or responding to sometimes-repetitive
questions.

> What I'm saying is that essentially all of the incompatible
> differences between 2003 and the current bis are potential
> problems for some implementation, and once we get done with
> bis, we will need to list them all. So just calling out #5 is
> insufficient.

While our perspective on these "incompatible differences" is
quite different, I hope that the new text in issues-06 will
address many of your concerns.  But it is also true that many of
those differences are differences in how and when IDNA is
applied that are simply not defined by the original protocol or
are differences that are important only if applicability
principles or guidelines about the use of the original protocol
were violated.  If adjustments in those areas are impossible,
then we are in very difficult waters indeed.

>...

best,
    john