character set for Nepali IDN

John C Klensin klensin at jck.com
Tue Feb 19 19:02:56 CET 2008



--On Tuesday, 19 February, 2008 21:01 +0500 Sarmad Hussain
<sarmad.hussain at nu.edu.pk> wrote:

> 
> Dear John and all,
> 
> Agreed that some of the restrictions (PVALID --> DISALLOWED)
> can go to language tables, at the registry level.  
> 
> However, it is important that that none of the PVALID
> characters in a language (as determined by its community) are
> labelled as DISALLOWED by the IDNAbis revision process,
> because it would not be possible to override DISALLOWED status
> through the language tables.

In principle, I certainly agree with this.  In practice, we need
to be very careful that we do not escalate that principle to a
firm rule, at least in that form.

I hope there are no cases of the situation I'm concerned about
in actual practice because the discussions would inevitably be
very painful and hard to resolve.   But, if a character turns up
that is an important element of the writing system for some
language that would be seriously problematic, either for the
Internet as a whole or for some other language that (mostly)
shares the same script, then treating that character as an
ordinary Protocol-Valid one is just not going to work, even if
it means that some words of the language simply cannot be used
as IDN labels.

It is probably worth pointing out that this is nothing new:
contrary to a popular assumption, the subset of ASCII that we
call "LDH" is not sufficient to write all words of English.

ZWJ and ZWNJ are partial examples of the problem because they
are important in the right contexts but completely invisible
(and hence potentially disastrous) when used in other contexts
and with other scripts.   With IDNA2003, those two characters
were simply banned because of that problem: a word that required
either one simply could not be expressed as a valid label.   One
of the big innovations in the IDNA200X proposals that has not,
IMO, been discussed nearly enough is the idea of permitting some
globally-problematic characters by restricting the contexts in
which they can be used.  So ZWJ and ZWNJ are permitted, but
_only_ for scripts in which they have a significant presentation
effect and the need for them is unambiguous when transcribing a
domain name from paper into a computer.

This is not a complaint about anything anyone has done or not
done, but the most useful data we could get right now (at least
from my perspective) isn't "we need this character" but rather
"that character is problematic for some languages, but we need
it and it seems to us safe to use it if the following
restrictions are applied...".   If the character is something
that would normally be consider a letter or digit, the current
rules are almost certain to pick it up and make it
Protocol-Valid (but your verifying that against Patrik's rules
and tables is very important).  But suppose the character is,
for some reason, an edge case that wouldn't fall naturally and
in terms of Unicode properties into "letter".  We need to be
told about those, but it is even more important for us to know
what restrictions might be necessary or appropriate to prevent
problems.

Put more broadly, if we are going to have an IDN system that
works well globally, we must take care that global
interoperation is our primary criterion for success rather than
the ability to write the literature of any given language in the
DNS.  The ability to be able to use any valid word as a label
should be an important target, but, if we are to succeed, we
must not let it become the primary goal.

>  That is the case because
> applications will not allow users to type in DISALLOWED
> characters in the IDNs (as is the current practice). 

Applications will do what they do.  As others have heard me say
too often on other lists, we need to avoid too much belief that
the IETF's making a standard will, in and of itself, dictate
behavior that everyone will follow.  Under the current practice,
registries can register characters or sequences that IDNA200X
prohibits as actual DNS entries.  As applications have evolved
and their authors become convinced that they have the obligation
to protect their users, such registrations may not be looked up
and are likely to be displayed in ACE form... certainly not what
anyone wants.  Perhaps worse, different applications interpret
the rules and their obligations to users in different ways,
leading to somewhat unpredictable behavior as far as the user is
concerned.  

Viewed from that perspective, the IDNA200X proposal attempts to
regularize the situation by giving applications clear guidance
about the labels that they should or should not look up and the
user or page designer the information that explicit use of
U-labels is much less likely to cause problems than dependency
on either local or Nameprep-like mapping.   But, under either
the IDNA2003 protocol or the IDNA200X proposal, if a registry
starts registering labels containing prohibited characters, it
must do so with the understanding that such labels may not be
looked up and handled in the way that the user might expect.

     john




More information about the Idna-update mailing list