Mapping tables (was Re: IDNA online tool)

Fri Apr 10 19:01:29 CEST 2009

I usually try to err on the side of concision, since otherwise I know
people's eyes quickly glaze over, but I'll take a bit longer to address the
issue you raised (though I will never be any competition for John ;-). So
please bear with me this time.

The reason that I included the ability to change the mapping filter is just
so that people could try out different choices, and see what an actual
difference it would make. I suspect that we will end up with not using the
IDNA2003 mapping style (NFKC+CF+removing default ignorables), but we need to
examine our choices very carefully. For example, I think John's suggestion
of not having the "removing default ignorables" be part of the mapping is a
reasonable one.

I may seem pretty conservative on this account. This is probably due to the
experiences of over 20 years with Unicode, which is such a core technology
that changes ripple quite broadly. There have been times where we made a
change that looked absolutely reasonable, would be clearly the right thing
to do, and would not affect anyone negatively --* where all reports from
contacts in community X **(eg BIDI) **said that they didn't use the
characters, and so a change wouldn't matter.* Yet once systems start being
deployed, low and behold the error reports come rolling in -- it turns out
that they *are* used by X community, and users are extremely unhappy. Those
of us who have worked in operating systems are also painfully aware of this
kind of problem.

So even though Unicode could be improved in many ways with incompatible
changes, we have really gotten quite conservative about them, just because
of unintended and unforeseen consequences.

Someone on this listed noted that "The registrants are the main clients of
IDN domains, hence they are the main clients of this WG." That is not true:
there are many, many stakeholders involved. Registrants certainly, but also
registries, and *most importantly,* users of programs that accept and
display URLs: browsing, email, chat, IM, and so on. Those users include a
very a pretty signficant portion of the world's population. And we will face
a long transition period where both IDNA2003 and its successor are in play -
we at Google see just how many people are using very old browsers, and of
course the DNS people know just how long it has taken to deploy IPv6.

We need to recognize that the people in this working group are a very small
subset of those affected by the changes we make, and are only partially
representative. Someone whose primary editor is emacs, for example, is
hardly a typical user! Of course the WG is open to any and all comers, but
certainly not all types of people that will be directly or indirectly
affected by these changes will realize that this group even exists, let
alone that it will be making incompatible changes. (When I mention to
ordinary engineers that the changes in IDNA2008 will result in the same URL
going to different IP addresses on different browsers, they look at me like
I'm completely, utterly, crazy.)

Most of the people on this list will not be the ones that get the error
reports, where people are really upset because something used to work and
doesn't anymore. We will only find out after the fact just how bad it is -
sadly, we don't have the luxury of a beta phase, where our users can see
what the effects of these changes are, and let us know where we must make
changes.

That's why we in this group need to very careful about incompatible changes
to an existing, deployed standard (IDNA2003) except where:

(a) there is clear harm, or
(b) the characters in question occur in demonstrably very low frequency.

The latter is why the Unicode Consortium is ok, for example, with the
removal of the vast majority of symbols and punctuation, because the vast
majority are used with such low frequency, even though only a couple of them
have actually be shown to be at all harmful. And I think it is probably ok
to remove the circled items from mapping (
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:dt=Enc:]), even
though they cause no harm either (removal has little potential upside, only
potential downside).

But before we blithely rush to doing just case+width mapping, we have to do
due diligence: carefully looking over the other instances, and making sure
that exclusion is on balance the right approach. It is probably ok to remove
Arabic presentation forms, but we'd better check first with the
Arabic-script communities to make sure; just as we heard back from the CJK
communities that Width mapping was important.

And people cannot simply depend on the Unicode decomposition type (dt) (See
http://unicode.org/cldr/utility/properties.jsp#Decomposition_Type) to make
all the decisions for them; it would not be a good idea to exclude Ǳ from
mapping, for example, and its dt is Compat. The dt property will be useful,
in the the same way as other property-based rules in the tables doc are, but
the categories defined by that property may not match exactly to what
end-user's needs are, and thus need to be reviewed.

Mark

PS. And bringing this back to subject of the online tool, if there are
changes to the tools at http://unicode.org/cldr/utility/index.jsp that would
help in such a review, please let me know.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20090410/9bdebe5a/attachment-0001.htm