idna-mapping update

Mon Dec 21 07:51:03 CET 2009

--On Friday, December 18, 2009 19:13 -0800 Michel SUIGNARD
<Michel at suignard.com> wrote:

> I'd like to give a new feedback to that statement. The issue
> some of us have with the current recommendation in
> idna-mappings [draft-ietf-idnabis-mappings-05] is that it is
> vastly different from the mapping done in IDNA_2003,
> especially concerning compatibility mapping done beyond the
> narrow/wide mapping suggested in the current document. The
> solution proposes the referencing of a single mapping table,
> improving greatly odds that implementers will do the right
> thing. Finally, it makes trivial for the draft Unicode TR46 to
> refer to a common mapping definition, avoiding potential
> confusion and unnecessary duplication.

Michel,

With the understanding that I'm speaking for myself only, that I
was not significantly involved in the selection of the
recommendations in draft-ietf-idnabis-mappings-05, and that,
while I think my perspective may be shared by others, I'll let
them speak for themselves....

I think these comments are complementary to Paul Hoffman's and
Vint's.  I don't know if the three of us actually agree, but I
find nothing in either of their notes to disagree with.

One of the WG's starting premises is that the range of
characters permitted by IDNA2003, and some of the mappings from
unusual character form, provided opportunities for problems with
little or no positive payoff.  That is not a criticism of NFKC
or NFKC_CF.  Indeed, it is consistent with the general advice of
TUS and UAX 15 that normalization should be chosen to be
appropriate to the needs of particular applications.  The WG
observed that domain name labels are often short (too short to
establish language context), that they are often not actually
words in any given language, and that there was no practical way
to impose protocol-level restrictions on mixing scripts.  We
also observed that there is a perception in the community that
phishing is a major risk with unrestricted use of IDNs and,
while the WG concluded that it could not solve that problem and
should not make per-character decisions on that basis, there was
no point in going out of our way to make the job of the phisher
easier.

I think there is general agreement in the WG on those
principles.  Not unanimity, but much more than what is often
described in the IETF as "rough consensus".  I note that, had
the WG not wanted to discriminate among characters in those ways
(and to achieve a canonical and fully reversible mapping between
what we now call A-labels and U-labels and achieve other goals),
but instead preferred absolute compatibility with IDNA2003, it
would have been sensible to adopt one of the several proposals,
including yours, to simply update IDNA2003 from Unicode 3.2 to
Unicode 5.x.   That was definitely the path not taken, again
with fairly general support.

Now, speaking as an outsider to internal Unicode decisions, I
see canonical character relationships as very different from
compatibility ones.   The former, paraphrasing statements in
TUS, are used to resolve different codings of exactly the same
characters.  There is no question (at least I think there isn't)
that that adjustment is appropriate.  And its appropriateness is
why IDNA2008 requires that the input to its processing steps be
NFC-compatible strings.    But the compatibility relationships
are more complex, partially because several types of
relationships are lumped together as compatibility (those
different types of relationships were explored by the WG during
the discussions leading up to the Mapping document).   There are
strong arguments for mapping characters together if the
compatibility-equivalent character might be more easily typed
than the base one and there is strong evidence that
substantially all users would consider the two characters ([sets
of] code points) equivalent under all circumstances.  At the
same time, my understanding is that, other than the most obvious
cases, almost all compatibility characters in Unicode are
present because some one thought that they really represented
different characters or concepts.   

Not mapping those characters together for IDNA purposes lowers
the risks of confusion of the compatibility character with
something else entirely and, should the unlikely circumstance
arise in which someone in the future successfully argues that
the compatibility character really should be distinct, we will
"merely" have to go through the pain of changing a character
from DISALLOWED to PVALID.  We will avoid the issues that have
plagued us with, e.g., Eszett, namely having to guess whether a
different (distinct) character was really intended rather than
the one in the database.

There are also compatibility characters that are mapped under
IDNA2003 that people would use in domain names only with the
intent of causing mischief or in an excess of cuteness, either
of which can turn into a security problem with no real
advantages to identifier quality.  It is consistent with other
WG decisions, IMO, to discourage any use of those characters,
even as mapping sources.

Now, against that backdrop, let's examine the example characters
your note proposed to map (I've reordered your list slightly to
make explanation easier).

> 	00AA ( ª ) => 0061 ( a ) # FEMININE ORDINAL INDICATOR
> 	00BA ( º ) => 006F ( o ) # MASCULINE ORDINAL INDICATOR

No one has provide any justification for using Ordinal
Indicators in domain name labels, and you are proposing to map
them out anyway.  As such, they are essentially just
reduced-size superscript characters.  See below.

> 	00B9 ( ¹ ) => 0031 ( 1 ) # SUPERSCRIPT ONE
> 	00B2 ( ² ) => 0032 ( 2 ) # SUPERSCRIPT TWO
> 	00B3 ( ³ ) => 0033 ( 3 ) # SUPERSCRIPT THREE

No one has provided any justification for having superscripts
appear in domain name labels.  They are likely to be confusing
in IRI contexts (users unable to tell whether they match the
base characters or not). 

The five cases above are problematic for another reason (shared
by a few of those below), which is that they map non-ASCII
characters, which would hence invoke IDN treatment, into
ordinary ASCII strings, which do not.   That makes the potential
for interactions with other issues much more severe, as we have
seen with Sharp-S.  It seems to me that we need to have
DNS/IDN-related reasons to go looking for that kind of trouble.

> 	00B5 ( µ ) => 03BC ( μ ) # MICRO SIGN

"Micro Sign" is a symbol, and hence DISALLOWED under a more
basic rule even if it were not an compatibility equivalent.  By
contrast, U+03BC is a perfectly normal Greek character.  Again,
there is no possible reason for using Micro Sign in a DNS label
unless one intends its symbol meeting or to try to get around
rules against mixing scripts (if a lookup client application
wants to test names for reasonableness and to warn against
unreasonable ones --as some clients have done even with
IDNA2003-- they would presumably want to test the pre-mapping
strings because error messages about the target strings would
not be intelligible to users (it is worth noting that related
issues about error or warning reporting are another reason why
wholesale mapping is undesirable)).

> 	0130 ( İ ) => 0069 0307 ( i̇ ) # LATIN CAPITAL LETTER I
WITH DOT ABOVE

The opens up the entire dotted and dotless "i" mess.  Do you
have a substantive, IDN/DNS-related reason to believe the
mapping would be desirable and worth the marginal confusion
opportunities it would cause?

>  0132 ( Ĳ ) => 0069 006A ( ij ) # LATIN CAPITAL LIGATURE IJ
> 	01F3 ( ǳ ) => 0064 007A ( dz ) # LATIN SMALL LETTER DZ
> 	017F ( ſ ) => 0073 ( s ) # LATIN SMALL LETTER LONG S 

As you presumably know, these historical ligatures raise complex
issues and, while the communities are smaller (or at least less
present in Unicode and IDN circles so far), issues fully as
passionate as those that surround Sharp-S.  If the
composition/decomposition relationships were uncontroversial,
they would be handled by NFC.  It seems to me to be safer to
DISALLOW and not map them, especially if there is the slightest
possibility of the relevant communities successfully arguing
that they ought to be treated as independent characters (the
argument might be be summarized as "why are æ (U+00E6) and œ
(U+0153) treated as independent, PVALID, characters while ĳ,
ǳ, and ſ are not?).  The observation that some of these
ligatures create additional confusion points between
Roman-derived characters and Cyrillic ones  is probably an
addition argument to discourage mapping them unless there is a
strong IDN/DNS reason for doing so.

>  01C4 ( Ǆ ) => 0064 017E ( dž ) # LATIN CAPITAL LETTER DZ
WITH CARON
>  01C4 ( Ǆ ) => 0064 017E ( dž ) # LATIN CAPITAL LETTER DZ
WITH CARON

See comments above and the observation about mappings of this
sort that, if not handled properly as part of NFC, are just
invitations to confusion.

>	013F ( Ŀ ) => 006C 00B7 ( l• ) # LATIN CAPITAL LETTER L
WITH MIDDLE DOT 
>	0140 ( ŀ ) => 006C 00B7 ( l• ) # LATIN SMALL LETTER L WITH
MIDDLE DOT

As you probably know, this code point or decomposition gets
involved with the ela geminada digraph problem, which the
Catalan community (and gTLD, incidentally) believes has been
mishandled in Unicode.  In the absence of input from them, it
seems to me to be dangerous to perform this mapping, and we have
had no such input.

> 	0149 ( ŉ ) => 02BC 006E ( ʼn ) # LATIN SMALL LETTER N
PRECEDED BY APOSTROPHE

You may reasonably disagree with one or more of the explanations
above, and I imagine we would find many more characters to
disagree about if we compared the full list.  But my point is
that, when looked at primarily from a DNS, IDN, and
anti-confusion perspective, there are sound reasons for not
mapping many of  them.  

And that brings us to the two areas where I think our
assumptions differ in a fundamental way.  I see the principal
goal of the WG as trying to define a model for IDNs that will
serve us well into the very long term future, a future with an
Internet that is much larger and much more diverse along a whole
series of dimensions, languages and writing systems among them.
I see compatibility with IDNA2003 to be part of that goal,
especially when one can reduce confusion by having more
compatibility, but as distinctly subsidiary to having things
work better and more predictably vis-a-vis end user expectations
in that expanded Internet future.  In that regard, conformance
--at the UI level-- to user expectations about identical
characters that might be different as a consequence of entry
conventions (e.g., Asian narrow and wide characters, upper and
lower case equivalences when that does not lead to either
unexpected ambiguity or transformation of what users think of as
one character into another (or a string)) is very important.  To
the extent to which NFKC_CF can contribute to that goal, it is
useful.  But conformance to NFKC_CF as a goal in itself is not
particularly relevant to me if it interferes with those other
objectives.

Now, by inspection (i.e., without making judgments about the
intent of the author(s)), TR46 seems to start from another
assumption, an assumption that conformance with Unicode norms
generally and NFKC_CF in particular, is a useful, if not primary
goal.  It isn't the goal of the WG.  If it were, we would have
accepted one of those "update IDNA2003 and Stringprep to
incorporate Unicode 5.x" proposals.  

In that light, TR46 isn't a well-established and widely
implemented and deployed standard that we should be looking at
as a model for IDNs.  Instead, it is a position of the Unicode
Technical Committee (or some of its members) about what the WG
should have done instead of the Mappings document or, perhaps,
instead of the base IDNA2008 documents themselves.   UTC is
certainly entitled to that opinion but the point remains that it
was derived from fundamentally different base assumptions.
Suggesting that, as an independent goal, the IETF conform to it,
or to NFKC_CP, as ends in themselves (with or without the
exceptions already agreed to about NFKC_CP, assumptions that
include treating Eszett and Final Sigma as separate characters
and not mapping ZWJ and ZWNJ to nothing) just does not seem
reasonable.

regards,
   john