High-level changes from IDNA2003 in the "current
work"
John C Klensin
klensin at jck.com
Fri Mar 7 15:51:50 CET 2008
--On Thursday, 06 March, 2008 15:02 -0800 Paul Hoffman
<phoffman at imc.org> wrote:
> Hi again. It would be useful for those coming in late to have
> a summary of what changes are embodied in the current set of
> documents. Here's my first take on such a list. If people like
> this format, it could be used as the beginning of an outline
> for the BoF/WG.
Paul, I think you have stated some of these goals a little more
narrowly than is intended. I'll try to summarize below, but
strongly suggest that people read both the drafts and RFC 4690
before the BOF.
> a) Update base character set from Unicode 3.2 to Unicode 5.0
> or 5.1
Actually, the goal is to make the revised standard Unicode
version-agnostic. Getting to 5.0 or 5.1 is a consequence of
that approach. The underlying issue is discussed in Sections 3
and 5.2 of RFC 4690 (although, if those section were written
today, I believe it might differ in some details).
> b) Disallow most symbol characters
It is better to describe this as "disallow most characters that
are not letters or digits". See section 5.1 of RFC 4690.
> c) Remove the mapping and normalization steps from the
> protocol and have them instead done by the applications
> themselves, possibly in a local fashion, before invoking the
> protocol
yes, but see forthcoming response to James (in a separate
thread).
> d) Change the way that the protocol specifies which characters
> are allowed in labels from "humans decide what the table of
> codepoints contains" to "decision about codepoints are based
> on Unicode properties plus a small exclusion list created by
> humans"
I'm not sure I know how to express this better, but note that
what the issue is about may seem to change a great deal based on
how one states it. While I've had some serious disagreements
with some active participants in the development of Unicode, I
have never had reason to believe that any member of the Unicode
Technical Committee (UTC) is non-human. Given that the Unicode
properties are created by humans, the distinction that the
statement above seems to make is not really a distinction.
It seems to me that there are two underlying questions. One is
where the locus of decision-making lies and about the units in
which decisions are made. I think we still agree that the IETF
is not the right place to try to construct a consensus table on
a character-by-character basis. The other is a question of how
the conclusions are expressed, a question described in the
proposed documents as whether the tables are normative or the
rules that generate them are. That question is somewhat
intertwined with (a), above: if the standard is going to be
Unicode-version-agnostic, then having the IETF adopt a new
standard and new tables for each new version of Unicode (or to
try to freeze things and not move forward... again see RFC 4690
for a discussion of this) is pretty much a contradiction.
> e) Allowing typical words and names in languages such as
> Dhivehi and Yiddish to be expressed
> f) Make bidirectional domain names (delimited strings of
> labels, not just labels standing on their own) display in a
> non-surprising fashion
>
> g) Make bidirectional domain names in a paragraph display in a
> non-surprising fashion
>
> Is the list a fair categorization? Should more items be added?
> Should some items be removed?
I think others have discussed the bidi-related issues, but it is
probably at least worth noting, for the benefit of new arrivals,
that all of (e)-(g), or any reasonable reformulation of them,
are issues with right-to-left writing systems only.
I would add two more...
(h) Introduce the new concept of characters that can be used
only in specific contexts.
This is driven primarily by the need to permit the use
of previously-prohibited (for the obvious specific case,
mapped-to-nothing) joining characters where they are
necessary to preserve information in scripts because of
Unicode character shaping and presentation rules. By
means of explanation, rather than trying to go into the
details, this is a particular problem for word formation
and use with several scripts used in the Indian
Subcontinent and nearby areas. For those scripts, it is
claimed that mapping the zero-width joiners and
non-joiners to nothing loses too much information to be
appropriate (as well as creating some serious problems
with "meaning" when ToUnicode(ToASCII(string)) is
applied). There are some separate questions (still
being discussed) about the use of the same or similar
characters as virtual word-separators in other scripts.
(i) Explicitly separate the definitions for the "registration"
and "lookup" activities and introduce explicit rules for
validation of IDN strings before DNS lookup.
IDNA2003, whether explicitly intended or not, has the
effect of putting almost all responsibility for
conformance on the registration process. If something
can be registered, even in violation of the standard, it
will be looked up. It appears that some registries
have deliberately violated the registration rules to
make things consistent with their beliefs about correct
local conventions. Some of these violations are
harmless except that, for some of them, some
applications will find the names on lookup and others
will not. Others could create significant risks. The
proposed documents attempt to identify the latter cases
and prohibit the strings even being looked up in the
DNS, creating an "even if you register that, the lookups
will almost always fail" situation for rogue registries.
--john
More information about the Idna-update
mailing list