Remider: BIDI inter-label tests in -02

Fri Sep 5 22:32:20 CEST 2008

--On Friday, 05 September, 2008 11:15 +0200 Harald Alvestrand
<harald at alvestrand.no> wrote:

> For all those of you who care about the bidi interlabel issue,
> the  following text is in -02:
> 
>    o  The BIDI test MAY return failure if the BIDI rule is not
> satisfied       by the label following the label that contains
> AL, AN or R in the       domain name.  For all the reasons
> given above, it may be       impossible to know the following
> label, but there seems no or       negative value to allowing
> the BIDI test to succeed if the       following label is
> known.  [[POSSIBLY CONTROVERSIAL]]
> 
> In the example Alireza gave, this would mean that the bidi
> test is  allowed to fail on the <ALEF>.3.com domain name, but
> won't fail on the  3.<ALEF>.com domain name - which would at
> least make sure the document  gives guidance on the decision
> on which of the two names is going to be  considered "valid"
> by whatever registry-specific or application-specific  logic
> people implement to solve the problem elsewhere.
> 
> (As a personal preference, I would prefer to make it a SHOULD,
> or even a  "MUST if the following label is known by the BIDI
> test" - but that did  not seem to be the WG's consensus in
> Dublin, so that's not what the text  says).
> 
> This editor needs WG direction on whether to either remove
> this bullet,  reword this bullet, or remove the [[POSSIBLY
> CONTROVERSIAL]] tag.

Harald,

FWIW, let me describe where I think we emerged from Dublin on
this (some of this fall into the category of what I think of as
corollaries to the discussions, not the meeting discussions
themselves).

	(1) Any requirement for inter-label checking is a
	showstopper for DNS reasons and will remain a
	showstopper regardless of anything this WG may or may
	not wish to conclude.  Put differently, including a
	requirement for inter-label checking in a document is
	just a way to ensure that the document will be shot down
	by the DNS community during Last Call.    Andrew or
	others who raised the issue in Dublin might want to
	clarify or affirm this, but my impression is any
	statement that uses 2119-normative language (other than
	MAY) would constitute a requirement in that regard.

	(2) URIs do not contain domain names in U-label form.
	It is, at best, in poor stylistic taste for them to
	contain non-ASCII characters in the domain field using
	percent-encoding of U-labels rather than A-labels.
	Because there are no manifest RtoL characters in URIs
	(because there are no non-ASCII characters), there are
	no RtoL-related URI display issues.  

	(3) Some referral/indirection URIs constitute an
	interesting challenge.  Regardless of what current (and
	draft) versions of the URI and IRI may say (or be
	construed as saying), the domain-part of a URI is
	clearly a "domain name slot" as that term is defined in
	IDNA2003 (the definition in IDNA2008 is no different,
	but I want to stress that this is not a new decision).
	As such, it is expected to contain a U-label or A-label
	(IDNA2008 terminology) and not some other encoding (%nn
	form for octets or otherwise).  On the other hand, the
	tail and its substrings are not inherently domain name
	slots.  So, with apologies to Frank and RFC 2606 and the
	hope that people will have close enough approximations
	to UTF-8 MUAs to be able to get the gist of what is
	happening here, if one had an IRI similar to

	http://www.favorite-search-engine.пример/mumble=fu
	bar&q=http://www.пример.com/

	then one would probably expect the last label in
	www.favorite-search-engine.пример to be in A-label
	form in the URL but to see "пример" in the string
	"www.пример.com" to be mapped into a string of
	%-encoded octets of the UTF-8 form.   That poses some
	interesting problems for the software trying to un-map
	the referral that are independent of any RtoL issues,
	but possibly identifies just how complicated it can be
	to make subtle inter-label tests even within a URL
	(remember that, in principle, nothing other than the
	host at www.favorite-search-engine.xn--e1afmkfd knows
	that "http://www.пример.com/" is an embedded URL
	containing an IDN, even though it obviously looks like
	one.   As far as anything else is concerned, that latter
	string is just running text.

	So one can have the "running text" problem, with or
	without RtoL characters, even inside a URL and without
	worrying about "paragraphs".

	If this is a problem for IRIs (and whether or not it is
	is debatable), it is not a problem for this WG.

Now, while I have never been an advocate of positions like "we
can't address all of the cases and solve all of the problems,
therefore we should do nothing", I'm finding that this leads me
to a position close to Alireza's conclusion (if I understand
that conclusion correctly).  However, I also see zone policies
and registration procedures as an important part of the
protocol.  To me, that means removing all of the normative
language from the bulleted paragraph above and replacing it with
some lavish advice that points out the nasty things that can
happen when naive (or not-so-naive) rendering engines display
labels containing certain types of characters in certain
positions next to labels containing certain other types of
characters.  I think that advice should explain the cases, give
examples, and (i) indicate that administrators of zones that
contain RtoL characters in labels or that point into such zones
(via CNAME, DNAME, and maybe URI-containing NAPTR records) ought
to be very careful what they do and wish for lest massive user
confusion and astonishment occur and (ii) that applications
software that renders these strings in native-character form
(certainly including URI-> IRI conversion and display programs)
ought to be very sensitive to these issues as well, perhaps
contriving to warn users that what they are seeing might not be
what they might expect to see.  

Much as I'd like to do more, I don't see a path that would
permit us to do so.

     john