Rule H (was: Re: New version, draft-faltstrom-idnabis-tables-02.txt, available)

Wed Jun 13 00:01:34 CEST 2007

--On Tuesday, June 12, 2007 18:56 +0200 JFC Morfin 
<jefsey at jefsey.com> wrote:

> At 17:31 12/06/2007, Paul Hoffman wrote:
>> At 3:53 PM +0200 6/12/07, JFC Morfin wrote:
>>> IDNA made a distinction between countries on the ASCII TLD
>>> basis.
>>
>> This is not true, and I believe you are quite aware that it
>> is not true.
>
> Dear Paul,
> May be my Franglish logic was confusing. Anyway, what is of
> concern today is the way Rule H is perceived when reading
> IDNAbis. A blunt list makes a difference from an acceptable
> description of the conditions for a script to be accepted,
> even if the resulting list is the same.

Jefsey,

Now we are getting to the point.

I intensely dislike having Rule H.  I think that dislike is 
shared by Patrik, Harald, Cary, Tina and others.  I also don't 
think we have so far explained it, and the reasons for it, very 
well, and I'd appreciate the help of others in coming up with a 
better explanation.  But we have concluded, sadly and painfully, 
that it is necessary, at least for the short term.

You (and others) may reasonably disagree, but please read either 
or both of Harald's recent explanation or the one that follows 
(I hope they are consistent and complementary) and then try to 
help us with this rather than stirring up more FUD.

What we have discovered is that there is controversy about the 
optimal (or even adequate) way to handle many scripts, 
especially when the same script is used as all or part of the 
writing system for different languages.  Sometimes the problem 
involves differences in opinion about presentation forms. 
Sometimes one must know the language in order to sort out 
presentation issues correctly (and the DNS does not not provide 
for transmission of language information).  Sometimes, although 
they are primarily unusual edge cases, there are even questions 
about whether the codings and rules present in Unicode 5.0 are 
sufficient to handle some particular writing system adequately, 
whether some of the characters of a script are associated with 
the correct set of properties or not, and so on.   In each case, 
those uncertainties are opportunities for user confusion, 
astonishment, or disappointment and sometimes for not being able 
to write the words of some languages --should one want to use 
those words in DNS labels -- in a consistent and correct fashion.

Incidentally, the part of the IDNA200x model that is most 
different from the earlier one is another consequence of the 
problem outlined above:  Unicode provides compatibility mappings 
and case mappings that reasonable but that may not be precisely 
correct (or "as would be expected") in all cases.  IDNA2003 
applies those mappings as part of the protocol.   IDNA200x 
treats them as localization issues -- to be applied as desired 
as part of the localization process, but not to appear "on the 
wire" or as part of references to be used in interchange, 
thereby lowering the risks of incorrect or unpredictable 
behavior.

These distinctions are also important because we have discovered 
case in which, in order to make it possible to express more than 
a few words in the writing systems of some languages, characters 
must be permitted that, while not problems in those particular 
scripts (and usages more generally) would be problematic if used 
in other contexts.   So, for example, while IDNA2003 prohibited 
zero-width breaking and non-breaking characters entirely, we now 
have special rules that permits those characters only in 
contexts in which they are helpful (or necessary) rather than 
potentially harmful.

In practical terms, Rule H has to be understood in conjunction 
with the implications of its categories for registration and 
lookup.   For lookup, the "permitted", "maybe yes", and "maybe 
no" categories are all equivalent.    A process looking up a 
string need only verify that none of the prohibited characters 
are present and then relies on the trust that strings that 
should not have been registered will not be found.

By contrast, classification as "maybe yes" or "maybe not", 
implies that entities registering strings should refrain from 
registrations dependent on those scripts until they are 
confident that issues associated with them are resolved or that 
there are no issues.

Now, in practical terms, the IETF can't dictate policies to 
registries or to ICANN (or to anyone else who thinks they have a 
concern in this area) and (I hope) would not want to try.   My 
personal opinion is that these categories will work out as 
	follows on the registry side:

	* A ccTLD will make its own decisions about what
	scripts, used to write languages of importance in that
	country, should be treated as being "permitted",
	rather than "maybe yes".  They will follow the
	tables about other, less familiar, scripts or perhaps
	not register those as all, regardless of where they fall
	in the property table.   I would hope that they --or
	others in their countries-- would participate in the
	effort needed to move their scripts (and produce
	script-specific rules as needed) into the
	"permitted" category, but, if they don't care
	about the use of the script elsewhere in the world or in
	gTLDs enough to do that, perhaps no one should care
	about their opinions.

	I note that, while our understanding has improved in the
	last three or four years, we have always known that safe
	and successful deployment of IDNs was going to be easier
	for a ccTLD that could make "this script, or its use
	by that language, is more important than those other
	ones" decisions than for gTLDs that are presumably
	required to be equally fair to everyone.

	* gTLDs will be held to registrations from the
	"permitted" category only and will hence be strongly
	motivated to work with appropriate language authorities
	to come up with definitions that work well globally.

But those are just my personal opinions.  The more important 
thing is that we figure out, together, how to make this system 
work well --as a foundation for globally-accessible and usable 
references -- for the Internet.

Your notes, and those of several others on this list and 
otherwise have raised another issue having to do with the 
relationship of "language" to all of this.  I'll address that in 
a separate note.

    john