IDNAbis discussion style, mappings, and (incidentally) Eszett

Fri Nov 30 20:37:06 CET 2007

Erik,

Patrik has already responded to part of this, but let me try on
the rest.  But, first, thanks for the kind words.   It has,
indeed, not been easy although I hope we are now entering the
home stretch.

--On Thursday, 29 November, 2007 19:00 -0800 Erik van der Poel
<erikv at google.com> wrote:

> Hi John,
> 
> Overall, I think the idnabis drafts are on a reasonable track.
> I remain concerned, though, that the mapping spec (case
> mapping and NFKC) is missing. It remains to be seen whether
> the major browser developers will stop mapping characters
> found in domain names in HTML, but if they continue to map
> them, I feel that it would be simpler if they performed
> similar mapping for the post-Unicode-3.2 characters. Such a
> spec could certainly be written, at least as an informative
> document.

Let me tell you what the underlying problem is and what I think
is going to happen (I have not reviewed this with my colleagues,
but it is impression formed after a lot of conversations with
people around the world).  

First, this whole business about mapping confusion turns out to
almost always be confusion about someone else's script or,
occasionally, about the use of one's script by someone else's
language.  A user, when looking at what has been done by mapping
with her own script, is almost never confused.  She may be
astonished or even angry (e.g., convinced that something is
seriously wrong and that we have been stupid and insulting), but
not confused.  Similarly, if a script is incomprehensible to a
given user, he isn't going to be confused about mappings, just
generally bewildered by the characters (although mappings into
or out of that script into another one could still be
problematic).

Second, there is an issue motivating trying to move away from
mappings that has more to do with the use of IDNs than with the
protocol.  There are many reasons why having URIs (and
presumably IRIs) be as standardized and consistent as possible.
While the notion of "I'll write my string correctly and let IDNA
compensate" has worked well for Eszett (see that long thread),
that is the exception rather than the rule.  In particular,
having compatibility characters that will be mapped out by NKFC
in domain names is just a collection of interoperability
problems waiting to happen, problems that get dramatically worse
if a system on which the characters are to be displayed has
fonts and rendering code for the base characters but not for the
compatibility ones.    So, a useful, and entirely intentional,
side-effect of the "no mappings" principle is to simplify IDNs
that might appear in either protocol (URI or IRI) or
non-protocol (running text, side of lorry, etc.) contexts
without forcing the user any further toward A-labels as the only
interoperable form than we need to go.  Knowing whether the
compatibility characters are likely to be available in fonts,
etc., in a particular environment is a localization issue about
which one probably cannot realistically make global statements.

My guess is that, as pressures mount to make Internet
applications (not just browsers, mail clients, and word
processors) more and more friendly to local cultural,
linguistic, and orthographic conventions, applications will be
increasingly localized.   To the extent to which the IDNA2003
mappings make sense in the local environment, I'd expect them to
be applied (and they will certainly be the obvious default).
But, for scripts that are rarely used locally or not used by the
user (defined either by choice of software or by some user
configuration options), I expect that we will evolve toward as
little mapping as possible because that will improve
interoperability.  

As another example, note that, for Roman-derived characters, the
decision about case-mapping is consistent with that analysis.
We use only lower-case, rather than upper (and, in IDNA2003, map
to lower-case) because the lower-case characters have better
distinguishability and fewer strange properties (e.g., loss of
diacritical marks and consequent overloading) than the
lower-case ones.  Within the subset of "Latin" characters that
are understood by a local population by virtue of use with the
local languages, all of the idiosyncrasies are well-understood
and case-mapping is safe, rational, and usually expected.  But,
for populations who can (perhaps only barely) read and
manipulate those characters (perhaps to work SMS on a mobile
phone), we really should not be complicating things and adding
confusion by introducing mappings that may be non-obvious  from
one character shape/glyph to another.

A different way of looking at this would be to say that we know
that localization is going to happen (and that it is a good
thing) and that it is desirable to eliminate protocol-based
mappings and everything that exploits them in order to permit
better localization and UIs.

> In your email below, you refer to non-protocol text. Does this
> include HTML? It might be nice to give HTML as an example if
> that is what you mean.

It would certainly include text in HTML.  Something like, e.g.,
the argument to an href is protocol text.   See below as to
examples.

> I admire the design team's desire to keep things simple and to
> avoid exceptions but with tongue in cheek I point out that
> Patrik's document does not really seem simple and appears to
> be a long list of exceptions (or exceptional rules). :-)

As Patrik has sort of indicated, we clearly need some additional
explanation about what is going on.

In general, there are three big changes between IDNA2003 and the
new proposal, with almost everything else being fixing up
details (including e.g., the Bidi work) or consequences of those
changes.  One is switching from "because it is in Unicode, it
ought be in IDNA unless there is a clear reason why not" to a
much more limited list that is based, depending on how one looks
at it, one either extending the LDH rule to non-ASCII characters
or on focusing only on characters used to write human languages
and that are not mapped away in compatibility normalization
processing.   The second, closely related, is the "no mapping"
principle.  As discussed earlier, it is not a rigid rule
because, if we really need exceptions --either to make things
right or to achieve an acceptable level of backward
compatibility-- then we will have exceptions (albeit one hopes
very few of them).  The third is that, while we expect real
applications in the real world to use tables rather than
rederiving things each time, the definition is in the rules, not
the tables.  In theory, we might want to make the "tables"
document even longer (and more forbidding) by generating and
including separate sets of tables for Unicode 3.2, 4.0, 4.1,
5.0, and, when it comes along, 5.1.  Modulo that short exception
list, the rules should be able to generate each of those tables
in a straightforward way, and that is precisely where
version-agility comes from.  We don't have to go back into the
standardization process to permit an application to remain
conforming when the relevant libraries are updated to a new
version of Unicode.

> Anyway, congrats on a job well done. It could not have been
> easy.

Again, thanks for the kind words.

> If it would be helpful, I could take a look at Google's data
> to see whether any of the characters listed as NEVER or MAYBE
> NOT appear to deserve a "higher" classification (e.g. MAYBE
> YES).

I think any input along those lines that would help calibrate
this work against real-world experience and practice would be
helpful.

I do believe that we need to be cautious about two things.  One
is that, while others may not agree, I'm reasonably comfortable
invalidating screwy registrations that have occurred in
violation of (or pushing the boundaries of) existing guidelines
if it leads to better long-term interoperability.  I get even
more comfortable if those registrations are not associated with
real, functional, hosts and applications and/or if they were
made after the direction of the IDNAbis work became clear in an
attempt to either prevent this work from concluding or to get
some sort of "grandfather" stake in the ground.   Others may, of
course, disagree (and some certainly will).

The second is that the really nasty traps involve characters
that may make perfectly good sense in one context but that may
be problematic in others.  There are not many of them, but they
are bad news.  The various zero-width things are the most common
examples, but there are others, especially where correct
rendering for one language that uses a script differs from
correct rendering for another language that uses the same
script.  Searching Google's databases may turn uses of those
characters up, but not tell us where the interpretation,
rendering, or contextual problems might occur.  So we need to be
careful about how we interpret what is found.

Erik, as a more general observation, a lot of this, especially
(if others agree) the material above about mappings,
localization, and UIs needs to be written up in some coherent
form.  I don't think very much of it belongs in the IDNA200X
documents, even the "issues" one.   I'm happy to keep explaining
and generating text, but would really appreciate a volunteer who
is sympathetic to the general model, has some experience with
the issues, would take lead responsibility for such a document.
Do you have any free cycles?

best,
     john