Remider: BIDI inter-label tests in -02
John C Klensin
klensin at jck.com
Fri Sep 5 22:32:20 CEST 2008
--On Friday, 05 September, 2008 11:15 +0200 Harald Alvestrand
<harald at alvestrand.no> wrote:
> For all those of you who care about the bidi interlabel issue,
> the following text is in -02:
>
> o The BIDI test MAY return failure if the BIDI rule is not
> satisfied by the label following the label that contains
> AL, AN or R in the domain name. For all the reasons
> given above, it may be impossible to know the following
> label, but there seems no or negative value to allowing
> the BIDI test to succeed if the following label is
> known. [[POSSIBLY CONTROVERSIAL]]
>
> In the example Alireza gave, this would mean that the bidi
> test is allowed to fail on the <ALEF>.3.com domain name, but
> won't fail on the 3.<ALEF>.com domain name - which would at
> least make sure the document gives guidance on the decision
> on which of the two names is going to be considered "valid"
> by whatever registry-specific or application-specific logic
> people implement to solve the problem elsewhere.
>
> (As a personal preference, I would prefer to make it a SHOULD,
> or even a "MUST if the following label is known by the BIDI
> test" - but that did not seem to be the WG's consensus in
> Dublin, so that's not what the text says).
>
> This editor needs WG direction on whether to either remove
> this bullet, reword this bullet, or remove the [[POSSIBLY
> CONTROVERSIAL]] tag.
Harald,
FWIW, let me describe where I think we emerged from Dublin on
this (some of this fall into the category of what I think of as
corollaries to the discussions, not the meeting discussions
themselves).
(1) Any requirement for inter-label checking is a
showstopper for DNS reasons and will remain a
showstopper regardless of anything this WG may or may
not wish to conclude. Put differently, including a
requirement for inter-label checking in a document is
just a way to ensure that the document will be shot down
by the DNS community during Last Call. Andrew or
others who raised the issue in Dublin might want to
clarify or affirm this, but my impression is any
statement that uses 2119-normative language (other than
MAY) would constitute a requirement in that regard.
(2) URIs do not contain domain names in U-label form.
It is, at best, in poor stylistic taste for them to
contain non-ASCII characters in the domain field using
percent-encoding of U-labels rather than A-labels.
Because there are no manifest RtoL characters in URIs
(because there are no non-ASCII characters), there are
no RtoL-related URI display issues.
(3) Some referral/indirection URIs constitute an
interesting challenge. Regardless of what current (and
draft) versions of the URI and IRI may say (or be
construed as saying), the domain-part of a URI is
clearly a "domain name slot" as that term is defined in
IDNA2003 (the definition in IDNA2008 is no different,
but I want to stress that this is not a new decision).
As such, it is expected to contain a U-label or A-label
(IDNA2008 terminology) and not some other encoding (%nn
form for octets or otherwise). On the other hand, the
tail and its substrings are not inherently domain name
slots. So, with apologies to Frank and RFC 2606 and the
hope that people will have close enough approximations
to UTF-8 MUAs to be able to get the gist of what is
happening here, if one had an IRI similar to
http://www.favorite-search-engine.пример/mumble=fu
bar&q=http://www.пример.com/
then one would probably expect the last label in
www.favorite-search-engine.пример to be in A-label
form in the URL but to see "пример" in the string
"www.пример.com" to be mapped into a string of
%-encoded octets of the UTF-8 form. That poses some
interesting problems for the software trying to un-map
the referral that are independent of any RtoL issues,
but possibly identifies just how complicated it can be
to make subtle inter-label tests even within a URL
(remember that, in principle, nothing other than the
host at www.favorite-search-engine.xn--e1afmkfd knows
that "http://www.пример.com/" is an embedded URL
containing an IDN, even though it obviously looks like
one. As far as anything else is concerned, that latter
string is just running text.
So one can have the "running text" problem, with or
without RtoL characters, even inside a URL and without
worrying about "paragraphs".
If this is a problem for IRIs (and whether or not it is
is debatable), it is not a problem for this WG.
Now, while I have never been an advocate of positions like "we
can't address all of the cases and solve all of the problems,
therefore we should do nothing", I'm finding that this leads me
to a position close to Alireza's conclusion (if I understand
that conclusion correctly). However, I also see zone policies
and registration procedures as an important part of the
protocol. To me, that means removing all of the normative
language from the bulleted paragraph above and replacing it with
some lavish advice that points out the nasty things that can
happen when naive (or not-so-naive) rendering engines display
labels containing certain types of characters in certain
positions next to labels containing certain other types of
characters. I think that advice should explain the cases, give
examples, and (i) indicate that administrators of zones that
contain RtoL characters in labels or that point into such zones
(via CNAME, DNAME, and maybe URI-containing NAPTR records) ought
to be very careful what they do and wish for lest massive user
confusion and astonishment occur and (ii) that applications
software that renders these strings in native-character form
(certainly including URI-> IRI conversion and display programs)
ought to be very sensitive to these issues as well, perhaps
contriving to warn users that what they are seeing might not be
what they might expect to see.
Much as I'd like to do more, I don't see a path that would
permit us to do so.
john
More information about the Idna-update
mailing list