idna-mapping update

Sat Dec 26 00:55:42 CET 2009

My bandwidth is extremely limited until we get back to the states, so I will
be brief. Please forgive me if by being brief, I am also overly brusk.

   1. I have not been able to follow the 4 deviation character discussion,
   but it appears that there is agreement on some transition strategies that
   will work; a key approach appears to be to map on the client side if one is
   sure that the zone bundles, otherwise map.
   2. Given that, I'd anticipate that the UTC would modify TR46 to be (a)
   support of symbols for some transitional period, and (b) a standard mapping.
   The rest of my comments are on the mapping issue.
   3. One uniform mapping would be better than multiple, inconsistent
   mappings.
   4. While one could argue either way, the advantage of the TR46 mapping is
   that it preserves compatibility with IDNA2003.
   5. The current IDNA2008 mapping wouldn't maintain that compatibility,
   falls short in a number of cases for languages that don't have case/width
   issues, and has a number of formal problems.
   6. We have major vendors that intend to implement the TR46 mappings; I
   don't know of any that have signed up to implement the current idna2008
   spec.
   7. The supposed argument from "harm" is specious.
   8. First, there is a mixup below. If X is confusable with a PVALID Y, it
   is no problem to map X to Y; it would only be a (theoretical) problem if X
   were mapped to a PVALID Z.
   9. Vastly more importantly, the argument from "harm" is faith-based, not
   data-based. I don't have access here, but I previously posted notes on the
   relative frequencies of spoofing techniques. Form that data:
      1. Spoofing with confusable characters is FAR below spoofing with
      syntax (like http://safe-amazon.com) in frequency.
      2. There are essentially no letters that can be spoofed with the
      mapped characters that can't *also* be spoofed with other letters that
      are PVALID.
   10. In sum, allowing the additional mappings makes *no* significant
   difference in the ability to spoof.
   11. Best would be to incorporate the TR46 mappings into IDNA2008. Second
   best would be to reference them; third would be to remove the idna2008
   mappings document, and fourth would be to leave them as is, and just deal
   with the muddle that results.

Mark

On Mon, Dec 21, 2009 at 17:51, John C Klensin <klensin at jck.com> wrote:

> --On Friday, December 18, 2009 19:13 -0800 Michel SUIGNARD
> <Michel at suignard.com> wrote:
>
> > I'd like to give a new feedback to that statement. The issue
> > some of us have with the current recommendation in
> > idna-mappings [draft-ietf-idnabis-mappings-05] is that it is
> > vastly different from the mapping done in IDNA_2003,
> > especially concerning compatibility mapping done beyond the
> > narrow/wide mapping suggested in the current document. The
> > solution proposes the referencing of a single mapping table,
> > improving greatly odds that implementers will do the right
> > thing. Finally, it makes trivial for the draft Unicode TR46 to
> > refer to a common mapping definition, avoiding potential
> > confusion and unnecessary duplication.
>
>
> Michel,
>
> With the understanding that I'm speaking for myself only, that I
> was not significantly involved in the selection of the
> recommendations in draft-ietf-idnabis-mappings-05, and that,
> while I think my perspective may be shared by others, I'll let
> them speak for themselves....
>
> I think these comments are complementary to Paul Hoffman's and
> Vint's.  I don't know if the three of us actually agree, but I
> find nothing in either of their notes to disagree with.
>
> One of the WG's starting premises is that the range of
> characters permitted by IDNA2003, and some of the mappings from
> unusual character form, provided opportunities for problems with
> little or no positive payoff.  That is not a criticism of NFKC
> or NFKC_CF.  Indeed, it is consistent with the general advice of
> TUS and UAX 15 that normalization should be chosen to be
> appropriate to the needs of particular applications.  The WG
> observed that domain name labels are often short (too short to
> establish language context), that they are often not actually
> words in any given language, and that there was no practical way
> to impose protocol-level restrictions on mixing scripts.  We
> also observed that there is a perception in the community that
> phishing is a major risk with unrestricted use of IDNs and,
> while the WG concluded that it could not solve that problem and
> should not make per-character decisions on that basis, there was
> no point in going out of our way to make the job of the phisher
> easier.
>
> I think there is general agreement in the WG on those
> principles.  Not unanimity, but much more than what is often
> described in the IETF as "rough consensus".  I note that, had
> the WG not wanted to discriminate among characters in those ways
> (and to achieve a canonical and fully reversible mapping between
> what we now call A-labels and U-labels and achieve other goals),
> but instead preferred absolute compatibility with IDNA2003, it
> would have been sensible to adopt one of the several proposals,
> including yours, to simply update IDNA2003 from Unicode 3.2 to
> Unicode 5.x.   That was definitely the path not taken, again
> with fairly general support.
>
> Now, speaking as an outsider to internal Unicode decisions, I
> see canonical character relationships as very different from
> compatibility ones.   The former, paraphrasing statements in
> TUS, are used to resolve different codings of exactly the same
> characters.  There is no question (at least I think there isn't)
> that that adjustment is appropriate.  And its appropriateness is
> why IDNA2008 requires that the input to its processing steps be
> NFC-compatible strings.    But the compatibility relationships
> are more complex, partially because several types of
> relationships are lumped together as compatibility (those
> different types of relationships were explored by the WG during
> the discussions leading up to the Mapping document).   There are
> strong arguments for mapping characters together if the
> compatibility-equivalent character might be more easily typed
> than the base one and there is strong evidence that
> substantially all users would consider the two characters ([sets
> of] code points) equivalent under all circumstances.  At the
> same time, my understanding is that, other than the most obvious
> cases, almost all compatibility characters in Unicode are
> present because some one thought that they really represented
> different characters or concepts.
>
> Not mapping those characters together for IDNA purposes lowers
> the risks of confusion of the compatibility character with
> something else entirely and, should the unlikely circumstance
> arise in which someone in the future successfully argues that
> the compatibility character really should be distinct, we will
> "merely" have to go through the pain of changing a character
> from DISALLOWED to PVALID.  We will avoid the issues that have
> plagued us with, e.g., Eszett, namely having to guess whether a
> different (distinct) character was really intended rather than
> the one in the database.
>
> There are also compatibility characters that are mapped under
> IDNA2003 that people would use in domain names only with the
> intent of causing mischief or in an excess of cuteness, either
> of which can turn into a security problem with no real
> advantages to identifier quality.  It is consistent with other
> WG decisions, IMO, to discourage any use of those characters,
> even as mapping sources.
>
> Now, against that backdrop, let's examine the example characters
> your note proposed to map (I've reordered your list slightly to
> make explanation easier).
>
> >       00AA ( ª ) => 0061 ( a ) # FEMININE ORDINAL INDICATOR
> >       00BA ( º ) => 006F ( o ) # MASCULINE ORDINAL INDICATOR
>
> No one has provide any justification for using Ordinal
> Indicators in domain name labels, and you are proposing to map
> them out anyway.  As such, they are essentially just
> reduced-size superscript characters.  See below.
>
> >       00B9 ( ¹ ) => 0031 ( 1 ) # SUPERSCRIPT ONE
> >       00B2 ( ² ) => 0032 ( 2 ) # SUPERSCRIPT TWO
> >       00B3 ( ³ ) => 0033 ( 3 ) # SUPERSCRIPT THREE
>
> No one has provided any justification for having superscripts
> appear in domain name labels.  They are likely to be confusing
> in IRI contexts (users unable to tell whether they match the
> base characters or not).
>
> The five cases above are problematic for another reason (shared
> by a few of those below), which is that they map non-ASCII
> characters, which would hence invoke IDN treatment, into
> ordinary ASCII strings, which do not.   That makes the potential
> for interactions with other issues much more severe, as we have
> seen with Sharp-S.  It seems to me that we need to have
> DNS/IDN-related reasons to go looking for that kind of trouble.
>
> >       00B5 ( µ ) => 03BC ( μ ) # MICRO SIGN
>
> "Micro Sign" is a symbol, and hence DISALLOWED under a more
> basic rule even if it were not an compatibility equivalent.  By
> contrast, U+03BC is a perfectly normal Greek character.  Again,
> there is no possible reason for using Micro Sign in a DNS label
> unless one intends its symbol meeting or to try to get around
> rules against mixing scripts (if a lookup client application
> wants to test names for reasonableness and to warn against
> unreasonable ones --as some clients have done even with
> IDNA2003-- they would presumably want to test the pre-mapping
> strings because error messages about the target strings would
> not be intelligible to users (it is worth noting that related
> issues about error or warning reporting are another reason why
> wholesale mapping is undesirable)).
>
> >       0130 ( İ ) => 0069 0307 ( i̇ ) # LATIN CAPITAL LETTER I
> WITH DOT ABOVE
>
> The opens up the entire dotted and dotless "i" mess.  Do you
> have a substantive, IDN/DNS-related reason to believe the
> mapping would be desirable and worth the marginal confusion
> opportunities it would cause?
>
> >  0132 ( Ĳ ) => 0069 006A ( ij ) # LATIN CAPITAL LIGATURE IJ
> >       01F3 ( ǳ ) => 0064 007A ( dz ) # LATIN SMALL LETTER DZ
> >       017F ( ſ ) => 0073 ( s ) # LATIN SMALL LETTER LONG S
>
> As you presumably know, these historical ligatures raise complex
> issues and, while the communities are smaller (or at least less
> present in Unicode and IDN circles so far), issues fully as
> passionate as those that surround Sharp-S.  If the
> composition/decomposition relationships were uncontroversial,
> they would be handled by NFC.  It seems to me to be safer to
> DISALLOW and not map them, especially if there is the slightest
> possibility of the relevant communities successfully arguing
> that they ought to be treated as independent characters (the
> argument might be be summarized as "why are æ (U+00E6) and œ
> (U+0153) treated as independent, PVALID, characters while ĳ,
> ǳ, and ſ are not?).  The observation that some of these
> ligatures create additional confusion points between
> Roman-derived characters and Cyrillic ones  is probably an
> addition argument to discourage mapping them unless there is a
> strong IDN/DNS reason for doing so.
>
> >  01C4 ( Ǆ ) => 0064 017E ( dž ) # LATIN CAPITAL LETTER DZ
> WITH CARON
> >  01C4 ( Ǆ ) => 0064 017E ( dž ) # LATIN CAPITAL LETTER DZ
> WITH CARON
>
> See comments above and the observation about mappings of this
> sort that, if not handled properly as part of NFC, are just
> invitations to confusion.
>
> >       013F ( Ŀ ) => 006C 00B7 ( l• ) # LATIN CAPITAL LETTER L
> WITH MIDDLE DOT
> >       0140 ( ŀ ) => 006C 00B7 ( l• ) # LATIN SMALL LETTER L WITH
> MIDDLE DOT
>
> As you probably know, this code point or decomposition gets
> involved with the ela geminada digraph problem, which the
> Catalan community (and gTLD, incidentally) believes has been
> mishandled in Unicode.  In the absence of input from them, it
> seems to me to be dangerous to perform this mapping, and we have
> had no such input.
>
> >       0149 ( ŉ ) => 02BC 006E ( ʼn ) # LATIN SMALL LETTER N
> PRECEDED BY APOSTROPHE
>
> You may reasonably disagree with one or more of the explanations
> above, and I imagine we would find many more characters to
> disagree about if we compared the full list.  But my point is
> that, when looked at primarily from a DNS, IDN, and
> anti-confusion perspective, there are sound reasons for not
> mapping many of  them.
>
> And that brings us to the two areas where I think our
> assumptions differ in a fundamental way.  I see the principal
> goal of the WG as trying to define a model for IDNs that will
> serve us well into the very long term future, a future with an
> Internet that is much larger and much more diverse along a whole
> series of dimensions, languages and writing systems among them.
> I see compatibility with IDNA2003 to be part of that goal,
> especially when one can reduce confusion by having more
> compatibility, but as distinctly subsidiary to having things
> work better and more predictably vis-a-vis end user expectations
> in that expanded Internet future.  In that regard, conformance
> --at the UI level-- to user expectations about identical
> characters that might be different as a consequence of entry
> conventions (e.g., Asian narrow and wide characters, upper and
> lower case equivalences when that does not lead to either
> unexpected ambiguity or transformation of what users think of as
> one character into another (or a string)) is very important.  To
> the extent to which NFKC_CF can contribute to that goal, it is
> useful.  But conformance to NFKC_CF as a goal in itself is not
> particularly relevant to me if it interferes with those other
> objectives.
>
> Now, by inspection (i.e., without making judgments about the
> intent of the author(s)), TR46 seems to start from another
> assumption, an assumption that conformance with Unicode norms
> generally and NFKC_CF in particular, is a useful, if not primary
> goal.  It isn't the goal of the WG.  If it were, we would have
> accepted one of those "update IDNA2003 and Stringprep to
> incorporate Unicode 5.x" proposals.
>
> In that light, TR46 isn't a well-established and widely
> implemented and deployed standard that we should be looking at
> as a model for IDNs.  Instead, it is a position of the Unicode
> Technical Committee (or some of its members) about what the WG
> should have done instead of the Mappings document or, perhaps,
> instead of the base IDNA2008 documents themselves.   UTC is
> certainly entitled to that opinion but the point remains that it
> was derived from fundamentally different base assumptions.
> Suggesting that, as an independent goal, the IETF conform to it,
> or to NFKC_CP, as ends in themselves (with or without the
> exceptions already agreed to about NFKC_CP, assumptions that
> include treating Eszett and Final Sigma as separate characters
> and not mapping ZWJ and ZWNJ to nothing) just does not seem
> reasonable.
>
> regards,
>    john
>
>
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20091226/5da93051/attachment-0001.htm