New version of strawman for IDNAv2

John C Klensin klensin at jck.com
Fri Feb 27 05:17:09 CET 2009



--On Thursday, February 26, 2009 19:46 -0800 Paul Hoffman
<phoffman at imc.org> wrote:

> This tees into John's recent thread on parsing the issues and
> finding a middle ground. I have included many of the
> suggestions from the mailing list and off-line responses. Most
> significantly, I have changed ZWNJ and ZWJ from "mapped to
> nothing" to being allowed so that Arabic labels will be more
> realistic.

Paul,

With the understanding that I still don't believe this is the
right way to go, one technical correction and one issue:

(1) ZWJ and ZWNJ are not needed for Arabic language orthography.
ZWNJ is needed for Persian languages and what are sometimes
called Indo-Arabic ones (e.g., Urdu, but there are _many_
others).  Both ZWJ and ZWNJ are needed for several of the Indic
scripts and associated languages (although slightly fewer with
Unicode 5.1 than with Unicode 3.2).

(2) When one considers the number of registries/zones on the
Internet or even those that exist only at the second level
(i.e., maintaining registrations for third-level names), it is
certain that some of them will be operated by people with bad
intentions.  Given that, are you confident that ZWJ/ZWNJ can
simply be treated as ordinary characters, relying on the
registries to prevent those characters where they would be fully
invisible?

When faced with that question very early in the IDNA2008 design
process, we concluded that there were four possible answers:

	(i) Yes, we trust the registries and are willing to live
	with labels like "ábc" failing to compare equal to
	"áb<ZWJ>c" despite looking exactly the same when
	displayed by normal rendering software.
	
	(ii) We don't quite trust the registries but are
	confident that all rendering software, on all operating
	systems, that encounter strings like "áb<ZWJ>c" will
	get upset in sufficient vivid ways to warn the user off.
	We didn't think rendering ZWJ as a little box or
	question mark would be adequate for that case because it
	might be a legitimate character for which no font was
	available even though it would at least not be confused
	with "ábc".
	
	(iii) We either leave things as they are in IDNA2003
	(map to nothing) or simply ban the character.  Either
	one puts the scripts that need one or both of these
	characters at an intolerable disadvantage.
	
	(iv) We adopt some sort of "contextual rule" model,
	despite the complexity it adds.

Obviously, we chose the fourth.  We did so because we didn't
believe the assumptions that (i) or (ii) implied and did not
consider (iii) to be acceptable given the number of people who
use the relevant scripts.   As I read your document, you are
proposing (i).   Is that correct and, if so, could you explain a
bit better how you see the tradeoffs?

Please also note that, if you permit ZWJ and/or ZWNJ as
characters, we end up in exactly the same situation that you and
others have objected to with Eszett and Final Sigma, i.e., an
input string that converts to a different A-label in IDNA2003
and IDNA2008.  I'm prepared to live with that but, to the degree
to which you consider it a problem so serious as to require
rechartering and a completely different document strategy, I'd
like to better understand the exception and its implications.
In particular, I don't see the section of your outline document
that discussed the transition strategy that many people (I think
including you, but could be wrong about that) have argued is
absolutely essential if there are going to be any
incompatibilities of that sort.

best,
   john



More information about the Idna-update mailing list