High-level changes from IDNA2003 in the "current work"

Fri Mar 7 15:51:50 CET 2008

--On Thursday, 06 March, 2008 15:02 -0800 Paul Hoffman
<phoffman at imc.org> wrote:

> Hi again. It would be useful for those coming in late to have
> a summary of what changes are embodied in the current set of
> documents. Here's my first take on such a list. If people like
> this format, it could be used as the beginning of an outline
> for the BoF/WG.

Paul, I think you have stated some of these goals a little more
narrowly than is intended.  I'll try to summarize below, but
strongly suggest that people read both the drafts and RFC 4690
before the BOF.

> a) Update base character set from Unicode 3.2 to Unicode 5.0
> or 5.1

Actually, the goal is to make the revised standard Unicode
version-agnostic.   Getting to 5.0 or 5.1 is a consequence of
that approach.  The underlying issue is discussed in Sections 3
and 5.2 of RFC 4690 (although, if those section were written
today, I believe it might differ in some details).

> b) Disallow most symbol characters

It is better to describe this as "disallow most characters that
are not letters or digits".   See section 5.1 of RFC 4690.

> c) Remove the mapping and normalization steps from the
> protocol and have them instead done by the applications
> themselves, possibly in a local fashion, before invoking the
> protocol

yes, but see forthcoming response to James (in a separate
thread).

> d) Change the way that the protocol specifies which characters
> are allowed in labels from "humans decide what the table of
> codepoints contains" to "decision about codepoints are based
> on Unicode properties plus a small exclusion list created by
> humans"

I'm not sure I know how to express this better, but note that
what the issue is about may seem to change a great deal based on
how one states it.   While I've had some serious disagreements
with some active participants in the development of Unicode, I
have never had reason to believe that any member of the Unicode
Technical Committee (UTC) is non-human.  Given that the Unicode
properties are created by humans, the distinction that the
statement above seems to make is not really a distinction.    

It seems to me that there are two underlying questions.  One is
where the locus of decision-making lies and about the units in
which decisions are made.  I think we still agree that the IETF
is not the right place to try to construct a consensus table on
a character-by-character basis.  The other is a question of how
the conclusions are expressed, a question described in the
proposed documents as whether the tables are normative or the
rules that generate them are.   That question is somewhat
intertwined with (a), above: if the standard is going to be
Unicode-version-agnostic, then having the IETF adopt a new
standard and new tables for each new version of Unicode (or to
try to freeze things and not move forward... again see RFC 4690
for a discussion of this) is pretty much a contradiction.  

> e) Allowing typical words and names in languages such as
> Dhivehi and Yiddish to be expressed

> f) Make bidirectional domain names (delimited strings of
> labels, not just labels standing on their own) display in a
> non-surprising fashion
> 
> g) Make bidirectional domain names in a paragraph display in a
> non-surprising fashion
> 
> Is the list a fair categorization? Should more items be added?
> Should some items be removed?

I think others have discussed the bidi-related issues, but it is
probably at least worth noting, for the benefit of new arrivals,
that all of (e)-(g), or any reasonable reformulation of them,
are issues with right-to-left writing systems only.

I would add two more...

(h) Introduce the new concept of characters that can be used
only in specific contexts.

	This is driven primarily by the need to permit the use
	of previously-prohibited (for the obvious specific case,
	mapped-to-nothing) joining characters where they are
	necessary to preserve information in scripts because of
	Unicode character shaping and presentation rules.  By
	means of explanation, rather than trying to go into the
	details, this is a particular problem for word formation
	and use with several scripts used in the Indian
	Subcontinent and nearby areas.  For those scripts, it is
	claimed that mapping the zero-width joiners and
	non-joiners to nothing loses too much information to be
	appropriate (as well as creating some serious problems
	with "meaning" when ToUnicode(ToASCII(string)) is
	applied).   There are some separate questions (still
	being discussed) about the use of the same or similar
	characters as virtual word-separators in other scripts.

(i) Explicitly separate the definitions for the "registration"
and "lookup" activities and introduce explicit rules for
validation of IDN strings before DNS lookup.

	IDNA2003, whether explicitly intended or not, has the
	effect of putting almost all responsibility for
	conformance on the registration process.   If something
	can be registered, even in violation of the standard, it
	will be looked up.   It appears that some registries
	have deliberately violated the registration rules to
	make things consistent with their beliefs about correct
	local conventions.   Some of these violations are
	harmless except that, for some of them, some
	applications will find the names on lookup and others
	will not.   Others could create significant risks.  The
	proposed documents attempt to identify the latter cases
	and prohibit the strings even being looked up in the
	DNS, creating an "even if you register that, the lookups
	will almost always fail" situation for rogue registries.

--john