Mapping (was: Issues lists and the "preprocessing" topic)

John C Klensin klensin at jck.com
Tue Aug 19 20:31:22 CEST 2008



--On Tuesday, 19 August, 2008 09:47 -0400 Andrew Sullivan
<ajs at commandprompt.com> wrote:

> I've changed the subject line because I am responding to just
> one issue in John's mail.
> 
> On Mon, Aug 18, 2008 at 08:48:04PM -0400, John C Klensin wrote:
> 
>> The implications of the above are that we not only aren't
>> encouraging extensive local-option mapping, we are encouraging
>> no mapping at all except for backward compatibility when
>> necessary and as a user interface convenience.   For the
>> latter, the expectation is that one will make the mappings as
>> early as possible and use only the mapped (U-label or
>> A-label) form in files; storing anything else in a file or
>> sending it across the network is strongly discouraged.
>> Also, even when mappings are done, the rule that is now
>> present in the documents still stands, i.e., one must not map
>> a PVALID or CONTEXT character into anything else -- mapping
>> is permitted only for DISALLOWED characters.
> 
> For me, the nagging worry is that you can pack just about
> anything you like into "user interface convenience".  Why not
> just say that local-option mapping SHOULD NOT be used except
> when required for compatibility with IDNA2003?  I get that
> this could make some interfaces clunkier.  But it seems to me
> that local mapping on the grounds of convenience surely just
> means "map when you like", so we should expect that every
> DISALLOWED character ends up mapped somehow, in different ways
> depending on local policy.  Such a situation seems to me to
> have a great potential for surprising results.

Andrew, 

Reasonable question, and I'd actually be very happy if we could
impose that limit.   First, note that almost any character that
is DISALLOWED in IDNA2008 and that appears in IDNA2003 at all is
going to "map somehow" under IDNA2003 compatibility.  So, at
least statistically, it is IDNA2003 compatibility that opens the
doors to the problems you would like to avoid and not permission
to map other things -- IDNA2003 compatibility just specifies
what mappings are to occur, and there are a few exceptions even
then.

I haven't written things that way because I'm worried about two
cases:

(1) There are a few cases for which mapping at the interface is
clearly required because the characters that people type are
associated with different code points than the results of NFC.
Obviously, the examples we know about (such as the use of
fullwidth or halfwidth forms of Kana) would be covered by an
"IDNA2003 compatibility" exception/ position, but I don't know
how to exclude the possibility of similar situations arising in
the future, with newer versions of Unicode.   We could hand-wave
among many of those situations by noting that the characters are
first encoded in some non-Unicode CCS, then converted to
Unicode, and trying to pretend that  the mapping occurs during
the conversion.  But, in some cases, that would not be
consistent with reality and, more important, it would leave us
with a definitional problem if those systems were switched to
Unicode-native input and output in the future. 

(2) We've had a number of situations over the years where people
have made very strong claims that the structure of Unicode is
just inappropriate for their scripts.  As Mark has pointed out,
they usually eventually get over it.  But, during whatever
process occurs, it would be good to have a model in the protocol
for making whatever local conversions they are convinced are
appropriate rather than effectively saying "until you completely
accept Unicode as it is defined, you cannot use IDNs with your
script".  Certainly the IETF does not want to be in a position
of having to referee those disputes.  I would prefer to not try
to put this in the documents, lest it be construed as a
criticism of Unicode, but that doesn't make the issue less real.

That said, I'm not sure I see the same risks that you do if
things are defined this way.   If one wants something to work in
all cases, one will used the (reduced) U-label form.   If one
wants something in protocols, one should probably use the
A-label form.  I've got a device here whose "keyboard" contains
specific keys for "www." and for ".com" and several other TLDs.
My using those keys is a UI convention: I press the single keys,
but the four-letter strings go into the file (or browser
location bar, or whatever).  These provisions for local mapping
are not really different from those specialized multi-character
keys if the conversions are done immediately.

   john






More information about the Idna-update mailing list