Label separators (was: Re: Urdu and SPACE, FULL STOP (Re: comments on IDNAbis: draft-faltstrom-idnabis-tables-04.txt Arabic))

Mon Feb 25 18:36:26 CET 2008

--On Monday, 25 February, 2008 09:51 +0800 YAO Jiankang
<yaojk at cnnic.cn> wrote:

> 
> ----- Original Message ----- 
> From: "John C Klensin" <klensin at jck.com>
> To: "Sarmad Hussain" <sarmad.hussain at nu.edu.pk>
> Cc: <idna-update at alvestrand.no>
> Sent: Sunday, February 24, 2008 1:15 AM
> Subject: Label separators (was: Re: Urdu and SPACE, FULL STOP
> (Re: comments on IDNAbis:
> draft-faltstrom-idnabis-tables-04.txt Arabic))
> 
> 
> ...
>> The IDNA200X version is equivalent to "the only valid label
>> separator on the wire or in interchange is ASCII period.
>> However, since we have prohibited all other punctuation
>> characters (other than hyphen) from ever actually appearing
>> in a domain name, if you need to use a convention locally to
>> permit easier typing of that character, you can substitute any
>> convenient punctuation (or other disallowed) character for
>> it...

> if future  IDNA200X version does not allow other label
> separator except the ASCII period, the IDNA application such
> as IE browser would not support other label separator and
> would  regard these separators as illegal ones.

First of all, I think you are significantly overestimating the
relative importance of standards as compared to perceptions of
good user experience and protection of users.  You should note,
for example, that every major browser today will display
punycode under circumstances in which they decide that the
native-character string would be unsafe, despite a rather clear
requirement in RFC 3490 that they not do so.

Second, as has been explained repeatedly, applications, even
IDNA-unaware applications, must be able to parse domain names
into labels to convert to internal form.  Some will need to do
that before applying IDNA; others will not apply IDNA at all.
To do that parsing, the only possible label-separator --at least
without changes to the DNS and DNS interfaces that go well
beyond the bounds of IDNA-- is the label separator specified by
RFC 1034/1035.  Anything else (including the language in RFC
3490 about conversions) is just a temporary typing/substitution
convention that must be converted to ASCII period/dot before any
other application that reads or interprets domain names gets to
it.  The only differences between the proposed IDNA200X approach
and the approach in RFC 3490 in that regard are:

(i) The list of characters that one might sensibly want to treat
as label separators is apparently not correct in RFC 3490 (e.g.,
the Arabic/Urdu Full Stop was not included).  We don't know how
many other characters, in other scripts, people might want to
use as surrogate dots, whether they are identified as "full
stop" or not.   Demanding that people make conversions globally
from a changing list is not a realistic goal, as I hope would be
obvious.

(ii) RFC 3490 is, in retrospect, a little vague about when the
conversion/remapping needs to be made, leading to problems with
software that does not intend to call on IDNA but that does need
to parse labels.  It says (Section 3.1(1)) "recognized as dots"
but then (Section 3.1(2)) "IDN-unaware ... changing all the
label separators to U+002E".   The difficulty is that, while one
can usually know for certain when a "domain name slot" is
IDN-aware, there is no way to know definitively when one is
IDN-unaware.  The problem is that domain names are passed back
and forth among applications, copied and pasted, and otherwise
manipulated in contexts in which a domain lookup (and hence
invocation of IDNA from IDN-aware applications) is not going to
happen any time soon.  Requiring that the conversion be
performed before the domain name is stored --so that a stored
domain name that attempts to use anything but ASCII period as a
separator is simply invalid (possibly modulo very local
application of the robustness principle)-- eliminates that
difficulty.

> you said "if you need to use a convention locally to permit
>  easier typing of that character, you can substitute any
> convenient punctuation (or other disallowed) character for
> it...", I am a little afraid of how we can do it without the
> support of IDNA200X version ?   Install some plug in to
> translate these separators into ASCII period  before we store
> them in a file or transmit them on the wire?

Whatever works for you.  My instinct, based on some experience,
would be to choose a substitute character that would not
normally appear with characters in your script but that would be
easier to type than the ASCII period and then map to and from it
as a TTY function, not in the application.   If that is
infeasible or provides a poor user experience, then it would
need to be done in the application.   But, again, the
application behavior is no different from the application
behavior specified in 3490: whatever reads the character stream
and identifies a string as a domain name must make the necessary
conversion.

If we could somehow be convinced that 

	-- by adding ARABIC FULL STOP (U+06D4) to the list in
	RFC 3490 Section 3.1(1) and going through the code
	charts and adding any other sentence-separator (whatever
	that means) we find, we would be finished for all time
	_and_ 

	-- we understood how to make domain names organized as
	labels separated by the dots in that list parseable into
	length-label lists by applications that can handle
	Unicode but do not specifically implement IDNA,

then it is clear (at least to me) that the right thing to do
would be to simply update the list and operations of RFC 3490.
But neither is true.  Even if we captured all of the reasonable
surrogate label separators today, Unicode might add additional
ones along with new scripts in future versions.   And, if a
domain name appears in running text -- e.g., if I say something
about idn1.idn2.example.com in this paragraph-- it simply isn't
possible to know whether the context in which the domain name
will be used is IDN-aware or not, nor about whether an
application that needed those labels in the other format would
have to implement IDNA before it attempted to parse the string.

To repeat something I (and others) have said before, IDNA (in
both the 2003 and proposed 200X versions) is a patch to permit
accommodating non-ASCII DNS labels without making changes to the
DNS.  It is a rather clever patch.  It has done its job of
permitting IDNs to be deployed much more quickly (in the places
that really wanted them) than we would have seen with actual DNS
modifications.   But, like almost all such patches, it has some
rough edges which we have to live with as the price of getting
that quick solution.  In the case of IDNA2003, while there are
some more general problems that many of us believe should be
fixed in a comprehensive way (see RFC 4690 and
draft-klensin-idnabis-issues), there are at least three
fundamental issues that none of us anticipated (at least
specifically) that must be addressed to have the protocol "work"
even at the fairly minimal level anticipated for a Proposed
Standard.   They are:

	(i) IDNA2003 specifies the conditions (very few) under
	which punycode-converted ACE strings may be displayed to
	the user.  Applications often do not have sufficient
	information to make that decision competently.

	(ii) IDNA2003 specifies interpretation of, or conversion
	to, three characters as dots.  The list isn't quite long
	enough (hence the Arabic/Urdu problem) and the
	application may not have sufficient information to know
	how to parse and convert FQDNs at the time these
	characters are encountered, as discussed above and
	elsewhere.

	(iii) IDNA2003 permits some characters to be used in DNS
	labels that pose serious parsing or equivalent problems
	for contexts in which domain names appear.  The most
	prominent examples of such characters are things that
	look like slashes or colons.  This is not merely a
	matter of spoofing but raises more fundamental problems
	in which the parsing that users may expect is different
	from the parsing that will actually occur.

There is one more than we can't fix within the design concept of
IDNA but that we need to be aware of in these discussions:

	(iv) There are characters that are clearly different,
	and that are assigned to different code points, but that
	should match in comparisons.  With the traditional DNS,
	we get around the distinction between how a character is
	coded (and displayed) and what it matches by using a
	server-side matching procedure that has some small
	intelligence about the matter.   With IDNA, since we
	cannot do sophisticated server-side matching (to do so
	would have required significant changes to the DNS), we
	have to rely entirely on mapping, which loses
	information.   We see this problem most prominently with
	position-sensitive characters, but there are other
	cases.   Registry policies modeled on the JET work can
	help with this problem and some closely-related ones by
	ensuring that, at least, two similar labels cannot be
	registered by different parties.  But there is no
	general solution within the IDNA context.

--john