Q2: What mapping function should be used in a revised IDNA2008 specification?

Sat Apr 4 17:29:54 CEST 2009

I agree that mapping ZWJ/ZWNJ to nothing in IDNA2003 has now caused a
problem for us. The problem is that some languages appear to use these
characters in ways that make semantic and sometimes huge visual
differences.

Whether those languages should be using those characters in those ways
is beside the point (unless some of them are as lucky as Malayalam,
and can add Chillu-like characters that solve the problem). The point
is that we now have to decide what to do about ZWJ/ZWNJ, given the
desire stated by some that we maintain compatibility with IDNA2003,
and given the nascent idea of a display mechanism. We should probably
gather more info about ZWJ/ZWNJ in those languages, to see how many
pairs there are of words or names that only differ in the presence or
absence of ZWJ/ZWNJ. If we find that there is only a small number of
such pairs in the real world, and if some of us are convinced that the
display mechanism can be made to work, it's a good bet that some of
the implementers will choose to pursue that. We need to remember that
implementers will be faced with the decision to abandon or maintain
IDNA2003, and we should consider the very real possibility that some
implementers will effectively end up choosing bits and pieces of
IDNA2003 and bits and pieces of IDNAbis.

Having said all that, we need to look carefully at all the "map to
nothing" characters in IDNA2003, and all the "default ignorable"
characters that appeared after Unicode 3.2, and make good decisions
for each of them.

Some of the "map to nothing" characters in IDNA2003 have wasted the
time of people that bump into these and the time of people that have
to explain what happens to those characters. This happened at Google.
Some of our domain names were not being run through IDNA2003 routines,
so we ended up with weird URLs that looked like
http://micro%c2%adsoft.com/ (with the %-escaped UTF-8 encoding of the
soft hyphen U+00AD). Now it may be that some input methods or apps
make it very easy to accidentally enter soft hyphens, which would
explain their presence on the Web. So some implementers may choose to
keep the mapping (to nothing).

Of course, you can argue that the WG should try to reach consensus on
a spec that tries to push implementers in the "right" direction
(taking the "high road"), and that we will always have some
implementers that bend or break the rules. If that is the way this WG
wishes to proceed, that's fine, but implementers will be watching each
other and making their own decisions. Those decisions, if they stick,
will eventually be recorded in "descriptive" docs that appear some
time after the WG's "prescriptive" spec. Witness OSI vs TCP/IP, SGML
vs XML, DSSSL vs CSS and XHTML vs HTML5 (some of which are not so
analogous to the current IDNA situation).

Erik

On Sat, Apr 4, 2009 at 12:51 AM, John C Klensin <klensin at jck.com> wrote:
> FWIW, I agree with the sets of comments in both of the  messages
> cited below...
>
> One other observation, just to avoid sending an extra message.
> I really hope that, in whatever mapping we decide is appropriate
> (and whether we put it), we can avoid getting involved with the
> "maps to nothing/ default ignorable" function.   While I hope,
> as I trust everyone else does, that we never run into the kind
> of disastrous situation that would cause us to move a character
> from DISALLOWED to PVALID (or CONTEXTx) somewhere down the line,
> I think that one of the things we have learned from the ZWJ/ZWNJ
> situation is that the cases in which a character was discarded,
> leaving us with no clue at to what was intended to be in a
> registration, is even worse and therefore to be avoided in the
> interest of general prudence.
>
>    john
>
>
> --On Thursday, April 02, 2009 09:53 -0700 Erik van der Poel
> <erikv at google.com> wrote:
>
>> IDNA2008
>> is a much more careful effort, with detailed dissection, as
>> you can see in the Table draft. We should apply similar care
>> to the "mapping" table.
>>
>> I suggest that we come up with principles, that we then apply
>> to the question of mapping. For example, the reason for
>> lower-casing non-ASCII letters is to compensate for the lack
>> of matching on the server side. The reason for mapping
>> full-width Latin to normal is because it is easy to type those
>>...
>
> --On Thursday, April 02, 2009 10:50 -0700 Erik van der Poel
> <erikv at google.com> wrote:
>
>> It may not be necessary to do character-by-character analysis
>> of NFKC. We may be able to select a small number of the NFKC
>> tags:
>>
>> <font>        A font variant (e.g. a blackletter form).
>> <noBreak>     A no-break version of a space or hyphen.
>> <initial>     An initial presentation form (Arabic).
>>...
>
>
>
>
>