Nameprep and NFKC

John C Klensin klensin at jck.com
Wed Oct 13 23:26:04 CEST 2010



--On Wednesday, October 13, 2010 19:25 +0000 Shawn Steele
<Shawn.Steele at microsoft.com> wrote:

> Many people would use Unicode UTS#46 in addition to the
> IDNA2008 RFCs:
> 
> http://www.unicode.org/reports/tr46/ 
>...

Yes, and some of the tables Abdulrahman's note referred to are
dependent on UTR 46.  But this is exactly where mapping gets us
into trouble:

-- Use of UTR 46 with IDNA2008 reduces some incompatibilities
with IDNA2003 and may cause a few others.   If I correctly
understand Abdulrahman's example, it may be one of those
incompatibilities.

-- While "many people" would use UTR 46, "many people" may use
RFC 5895 instead, or no mapping at all.  They are not compatible.

-- IDNA2008 itself (RFC 5890-5893) are very clear that input
(and U-labels) must already be in NFC form.  

So, unless a string that is NFC-valid in one version of Unicode
becomes invalid in the next version (something that is
guaranteed to not occur under the stability rules as I
understand them), programs conforming to IDNA2008 are going to
be completely stable wrt regard to validity of labels (absent
incompatible changes in Unicode properties that we decide to not
cancel out by adding the changed code points to the exception
list).   In the long term, programs that apply one type of
mapping or another before conceptual handoff to IDNA2008 are
likely to be less stable with regard to identifiers, identifier
interpretation, and identifier accessibility.  How much less
depends on what sort of mapping they do, the script involved,
and the luck of the draw.

I don't want to re-open the debate about mapping that was so
divisive in the WG --there would be no point having that
discussion again anyway-- but let me make an observation.  There
are, I think, three issues involved in making a decision about
what to map.  One is the quality of user experience, i.e.,
having those things happen which users would expect.  That issue
involves tradeoffs in and of itself: If one were to completely
optimize for the experience of individual users, making things
as predictable and comfortable as possible for each script and
culture, one necessarily ends up with different interpretation
of strings in different areas.  Conversely, having a single
global interpretation means that at least some users are going
to experience behavior they consider unnatural.  

The second is backward compatibility, i.e., between IDNA2003 and
how it was (or may have been) used and IDNA2008.  And the third
is forward compatibility: long-term stability from IDNA2008 (and
Unicode 5.2) into the future.  Backward and forward
compatibility are user experience issues too: a user who has
been using IDNs since 2003 (or earlier) may have learned
expectations that are different from someone who starts
seriously using IDNs for the first time with IDNA2008-compliant
systems, especially if the latter minimize mapping.

My own view is strongly (perhaps too strongly) conditioned by
two things.  First, I believe that the number of Internet users
for whom the primary language doesn't use only basic Latin
characters will be much, much, larger in the future than it is
today; I prefer to have as good a fit as possible for them even
if it where to mean that some early adopters who have been using
IDNs heavily for the last six or seven years need to do some
relearning.   Second, I had a great deal of experience many
years ago with "DWIM" (do what I mean) programs and got to
experience the user astonishment and confusion when they guessed
wrong about what was intended (which, of course, they did
periodically).  That experience has left me feeling that, in the
long run at least, users prefer predictability and stability,
even if they have to learn some things rather than just having
the computer guess.  The combination makes me resistant to
mappings that are not completely obvious and especially ones
whose obviousness depends more on the design of Unicode than on
local "common knowledge" about how a writing system works.

YMMD and, again, I don't think there would be any value in
trying to reopen the debate to try to determine who is "right".
There is no "right", only different ways to understand and make
the tradeoffs.

best,
   john




More information about the Idna-update mailing list