My bandwidth is extremely limited until we get back to the states, so I will be brief. Please forgive me if by being brief, I am also overly brusk.<div><ol><li>I have not been able to follow the 4 deviation character discussion, but it appears that there is agreement on some transition strategies that will work; a key approach appears to be to map on the client side if one is sure that the zone bundles, otherwise map.</li>

<li>Given that, I&#39;d anticipate that the UTC would modify TR46 to be (a) support of symbols for some transitional period, and (b) a standard mapping. The rest of my comments are on the mapping issue.</li><li>One uniform mapping would be better than multiple, inconsistent mappings.</li>

<li>While one could argue either way, the advantage of the TR46 mapping is that it preserves compatibility with IDNA2003.</li><li>The current IDNA2008 mapping wouldn&#39;t maintain that compatibility, falls short in a number of cases for languages that don&#39;t have case/width issues, and has a number of formal problems.</li>

<li>We have major vendors that intend to implement the TR46 mappings; I don&#39;t know of any that have signed up to implement the current idna2008 spec.</li><li>The supposed argument from &quot;harm&quot; is specious.</li>

<li>First, there is a mixup below. If X is confusable with a PVALID Y, it is no problem to map X to Y; it would only be a (theoretical) problem if X were mapped to a PVALID Z.</li><li>Vastly more importantly, the argument from &quot;harm&quot; is faith-based, not data-based. I don&#39;t have access here, but I previously posted notes on the relative frequencies of spoofing techniques. Form that data:</li>

<ol><li>Spoofing with confusable characters is FAR below spoofing with syntax (like <a href="http://safe-amazon.com">http://safe-amazon.com</a>) in frequency.</li><li>There are essentially no letters that can be spoofed with the mapped characters that can&#39;t <b><i>also</i></b> be spoofed with other letters that are PVALID.</li>

</ol><li>In sum, allowing the additional mappings makes *no* significant difference in the ability to spoof.</li><li>Best would be to incorporate the TR46 mappings into IDNA2008. Second best would be to reference them; third would be to remove the idna2008 mappings document, and fourth would be to leave them as is, and just deal with the muddle that results.</li>

</ol>Mark</div><div>

<br><br><div class="gmail_quote">On Mon, Dec 21, 2009 at 17:51, John C Klensin <span dir="ltr">&lt;<a href="mailto:klensin@jck.com">klensin@jck.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

--On Friday, December 18, 2009 19:13 -0800 Michel SUIGNARD<br>

<div class="im">&lt;<a href="mailto:Michel@suignard.com">Michel@suignard.com</a>&gt; wrote:<br>

<br>

&gt; I&#39;d like to give a new feedback to that statement. The issue<br>

&gt; some of us have with the current recommendation in<br>

&gt; idna-mappings [draft-ietf-idnabis-mappings-05] is that it is<br>

&gt; vastly different from the mapping done in IDNA_2003,<br>

&gt; especially concerning compatibility mapping done beyond the<br>

&gt; narrow/wide mapping suggested in the current document. The<br>

&gt; solution proposes the referencing of a single mapping table,<br>

&gt; improving greatly odds that implementers will do the right<br>

&gt; thing. Finally, it makes trivial for the draft Unicode TR46 to<br>

&gt; refer to a common mapping definition, avoiding potential<br>

&gt; confusion and unnecessary duplication.<br>

<br>

<br>

</div>Michel,<br>

<br>

With the understanding that I&#39;m speaking for myself only, that I<br>

was not significantly involved in the selection of the<br>

recommendations in draft-ietf-idnabis-mappings-05, and that,<br>

while I think my perspective may be shared by others, I&#39;ll let<br>

them speak for themselves....<br>

<br>

I think these comments are complementary to Paul Hoffman&#39;s and<br>

Vint&#39;s.  I don&#39;t know if the three of us actually agree, but I<br>

find nothing in either of their notes to disagree with.<br>

<br>

One of the WG&#39;s starting premises is that the range of<br>

characters permitted by IDNA2003, and some of the mappings from<br>

unusual character form, provided opportunities for problems with<br>

little or no positive payoff.  That is not a criticism of NFKC<br>

or NFKC_CF.  Indeed, it is consistent with the general advice of<br>

TUS and UAX 15 that normalization should be chosen to be<br>

appropriate to the needs of particular applications.  The WG<br>

observed that domain name labels are often short (too short to<br>

establish language context), that they are often not actually<br>

words in any given language, and that there was no practical way<br>

to impose protocol-level restrictions on mixing scripts.  We<br>

also observed that there is a perception in the community that<br>

phishing is a major risk with unrestricted use of IDNs and,<br>

while the WG concluded that it could not solve that problem and<br>

should not make per-character decisions on that basis, there was<br>

no point in going out of our way to make the job of the phisher<br>

easier.<br>

<br>

I think there is general agreement in the WG on those<br>

principles.  Not unanimity, but much more than what is often<br>

described in the IETF as &quot;rough consensus&quot;.  I note that, had<br>

the WG not wanted to discriminate among characters in those ways<br>

(and to achieve a canonical and fully reversible mapping between<br>

what we now call A-labels and U-labels and achieve other goals),<br>

but instead preferred absolute compatibility with IDNA2003, it<br>

would have been sensible to adopt one of the several proposals,<br>

including yours, to simply update IDNA2003 from Unicode 3.2 to<br>

Unicode 5.x.   That was definitely the path not taken, again<br>

with fairly general support.<br>

<br>

Now, speaking as an outsider to internal Unicode decisions, I<br>

see canonical character relationships as very different from<br>

compatibility ones.   The former, paraphrasing statements in<br>

TUS, are used to resolve different codings of exactly the same<br>

characters.  There is no question (at least I think there isn&#39;t)<br>

that that adjustment is appropriate.  And its appropriateness is<br>

why IDNA2008 requires that the input to its processing steps be<br>

NFC-compatible strings.    But the compatibility relationships<br>

are more complex, partially because several types of<br>

relationships are lumped together as compatibility (those<br>

different types of relationships were explored by the WG during<br>

the discussions leading up to the Mapping document).   There are<br>

strong arguments for mapping characters together if the<br>

compatibility-equivalent character might be more easily typed<br>

than the base one and there is strong evidence that<br>

substantially all users would consider the two characters ([sets<br>

of] code points) equivalent under all circumstances.  At the<br>

same time, my understanding is that, other than the most obvious<br>

cases, almost all compatibility characters in Unicode are<br>

present because some one thought that they really represented<br>

different characters or concepts.<br>

<br>

Not mapping those characters together for IDNA purposes lowers<br>

the risks of confusion of the compatibility character with<br>

something else entirely and, should the unlikely circumstance<br>

arise in which someone in the future successfully argues that<br>

the compatibility character really should be distinct, we will<br>

&quot;merely&quot; have to go through the pain of changing a character<br>

from DISALLOWED to PVALID.  We will avoid the issues that have<br>

plagued us with, e.g., Eszett, namely having to guess whether a<br>

different (distinct) character was really intended rather than<br>

the one in the database.<br>

<br>

There are also compatibility characters that are mapped under<br>

IDNA2003 that people would use in domain names only with the<br>

intent of causing mischief or in an excess of cuteness, either<br>

of which can turn into a security problem with no real<br>

advantages to identifier quality.  It is consistent with other<br>

WG decisions, IMO, to discourage any use of those characters,<br>

even as mapping sources.<br>

<br>

Now, against that backdrop, let&#39;s examine the example characters<br>

your note proposed to map (I&#39;ve reordered your list slightly to<br>

make explanation easier).<br>

<div class="im"><br>

&gt;       00AA ( ª ) =&gt; 0061 ( a ) # FEMININE ORDINAL INDICATOR<br>

</div><div class="im">&gt;       00BA ( º ) =&gt; 006F ( o ) # MASCULINE ORDINAL INDICATOR<br>

<br>

</div>No one has provide any justification for using Ordinal<br>

Indicators in domain name labels, and you are proposing to map<br>

them out anyway.  As such, they are essentially just<br>

reduced-size superscript characters.  See below.<br>

<div class="im"><br>

&gt;       00B9 ( ¹ ) =&gt; 0031 ( 1 ) # SUPERSCRIPT ONE<br>

</div><div class="im">&gt;       00B2 ( ² ) =&gt; 0032 ( 2 ) # SUPERSCRIPT TWO<br>

&gt;       00B3 ( ³ ) =&gt; 0033 ( 3 ) # SUPERSCRIPT THREE<br>

<br>

</div>No one has provided any justification for having superscripts<br>

appear in domain name labels.  They are likely to be confusing<br>

in IRI contexts (users unable to tell whether they match the<br>

base characters or not).<br>

<br>

The five cases above are problematic for another reason (shared<br>

by a few of those below), which is that they map non-ASCII<br>

characters, which would hence invoke IDN treatment, into<br>

ordinary ASCII strings, which do not.   That makes the potential<br>

for interactions with other issues much more severe, as we have<br>

seen with Sharp-S.  It seems to me that we need to have<br>

DNS/IDN-related reasons to go looking for that kind of trouble.<br>

<div class="im"><br>

&gt;       00B5 ( µ ) =&gt; 03BC ( μ ) # MICRO SIGN<br>

<br>

</div>&quot;Micro Sign&quot; is a symbol, and hence DISALLOWED under a more<br>

basic rule even if it were not an compatibility equivalent.  By<br>

contrast, U+03BC is a perfectly normal Greek character.  Again,<br>

there is no possible reason for using Micro Sign in a DNS label<br>

unless one intends its symbol meeting or to try to get around<br>

rules against mixing scripts (if a lookup client application<br>

wants to test names for reasonableness and to warn against<br>

unreasonable ones --as some clients have done even with<br>

IDNA2003-- they would presumably want to test the pre-mapping<br>

strings because error messages about the target strings would<br>

not be intelligible to users (it is worth noting that related<br>

issues about error or warning reporting are another reason why<br>

wholesale mapping is undesirable)).<br>

<div class="im"><br>

&gt;       0130 ( İ ) =&gt; 0069 0307 ( i̇ ) # LATIN CAPITAL LETTER I<br>

WITH DOT ABOVE<br>

<br>

</div>The opens up the entire dotted and dotless &quot;i&quot; mess.  Do you<br>

have a substantive, IDN/DNS-related reason to believe the<br>

mapping would be desirable and worth the marginal confusion<br>

opportunities it would cause?<br>

<div class="im"><br>

&gt;  0132 ( Ĳ ) =&gt; 0069 006A ( ij ) # LATIN CAPITAL LIGATURE IJ<br>

</div><div class="im">&gt;       01F3 ( ǳ ) =&gt; 0064 007A ( dz ) # LATIN SMALL LETTER DZ<br>

</div><div class="im">&gt;       017F ( ſ ) =&gt; 0073 ( s ) # LATIN SMALL LETTER LONG S<br>

<br>

</div>As you presumably know, these historical ligatures raise complex<br>

issues and, while the communities are smaller (or at least less<br>

present in Unicode and IDN circles so far), issues fully as<br>

passionate as those that surround Sharp-S.  If the<br>

composition/decomposition relationships were uncontroversial,<br>

they would be handled by NFC.  It seems to me to be safer to<br>

DISALLOW and not map them, especially if there is the slightest<br>

possibility of the relevant communities successfully arguing<br>

that they ought to be treated as independent characters (the<br>

argument might be be summarized as &quot;why are æ (U+00E6) and œ<br>

(U+0153) treated as independent, PVALID, characters while ĳ,<br>

ǳ, and ſ are not?).  The observation that some of these<br>

ligatures create additional confusion points between<br>

Roman-derived characters and Cyrillic ones  is probably an<br>

addition argument to discourage mapping them unless there is a<br>

strong IDN/DNS reason for doing so.<br>

<div class="im"><br>

&gt;  01C4 ( Ǆ ) =&gt; 0064 017E ( dž ) # LATIN CAPITAL LETTER DZ<br>

WITH CARON<br>

&gt;  01C4 ( Ǆ ) =&gt; 0064 017E ( dž ) # LATIN CAPITAL LETTER DZ<br>

WITH CARON<br>

<br>

</div>See comments above and the observation about mappings of this<br>

sort that, if not handled properly as part of NFC, are just<br>

invitations to confusion.<br>

<div class="im"><br>

&gt;       013F ( Ŀ ) =&gt; 006C 00B7 ( l• ) # LATIN CAPITAL LETTER L<br>

WITH MIDDLE DOT<br>

&gt;       0140 ( ŀ ) =&gt; 006C 00B7 ( l• ) # LATIN SMALL LETTER L WITH<br>

MIDDLE DOT<br>

<br>

</div>As you probably know, this code point or decomposition gets<br>

involved with the ela geminada digraph problem, which the<br>

Catalan community (and gTLD, incidentally) believes has been<br>

mishandled in Unicode.  In the absence of input from them, it<br>

seems to me to be dangerous to perform this mapping, and we have<br>

had no such input.<br>

<div class="im"><br>

&gt;       0149 ( ŉ ) =&gt; 02BC 006E ( ʼn ) # LATIN SMALL LETTER N<br>

PRECEDED BY APOSTROPHE<br>

<br>

</div>You may reasonably disagree with one or more of the explanations<br>

above, and I imagine we would find many more characters to<br>

disagree about if we compared the full list.  But my point is<br>

that, when looked at primarily from a DNS, IDN, and<br>

anti-confusion perspective, there are sound reasons for not<br>

mapping many of  them.<br>

<br>

And that brings us to the two areas where I think our<br>

assumptions differ in a fundamental way.  I see the principal<br>

goal of the WG as trying to define a model for IDNs that will<br>

serve us well into the very long term future, a future with an<br>

Internet that is much larger and much more diverse along a whole<br>

series of dimensions, languages and writing systems among them.<br>

I see compatibility with IDNA2003 to be part of that goal,<br>

especially when one can reduce confusion by having more<br>

compatibility, but as distinctly subsidiary to having things<br>

work better and more predictably vis-a-vis end user expectations<br>

in that expanded Internet future.  In that regard, conformance<br>

--at the UI level-- to user expectations about identical<br>

characters that might be different as a consequence of entry<br>

conventions (e.g., Asian narrow and wide characters, upper and<br>

lower case equivalences when that does not lead to either<br>

unexpected ambiguity or transformation of what users think of as<br>

one character into another (or a string)) is very important.  To<br>

the extent to which NFKC_CF can contribute to that goal, it is<br>

useful.  But conformance to NFKC_CF as a goal in itself is not<br>

particularly relevant to me if it interferes with those other<br>

objectives.<br>

<br>

Now, by inspection (i.e., without making judgments about the<br>

intent of the author(s)), TR46 seems to start from another<br>

assumption, an assumption that conformance with Unicode norms<br>

generally and NFKC_CF in particular, is a useful, if not primary<br>

goal.  It isn&#39;t the goal of the WG.  If it were, we would have<br>

accepted one of those &quot;update IDNA2003 and Stringprep to<br>

incorporate Unicode 5.x&quot; proposals.<br>

<br>

In that light, TR46 isn&#39;t a well-established and widely<br>

implemented and deployed standard that we should be looking at<br>

as a model for IDNs.  Instead, it is a position of the Unicode<br>

Technical Committee (or some of its members) about what the WG<br>

should have done instead of the Mappings document or, perhaps,<br>

instead of the base IDNA2008 documents themselves.   UTC is<br>

certainly entitled to that opinion but the point remains that it<br>

was derived from fundamentally different base assumptions.<br>

Suggesting that, as an independent goal, the IETF conform to it,<br>

or to NFKC_CP, as ends in themselves (with or without the<br>

exceptions already agreed to about NFKC_CP, assumptions that<br>

include treating Eszett and Final Sigma as separate characters<br>

and not mapping ZWJ and ZWNJ to nothing) just does not seem<br>

reasonable.<br>

<br>

regards,<br>

<font color="#888888">   john<br>

</font><div><div></div><div class="h5"><br>

<br>

<br>

_______________________________________________<br>

Idna-update mailing list<br>

<a href="mailto:Idna-update@alvestrand.no">Idna-update@alvestrand.no</a><br>

<a href="http://www.alvestrand.no/mailman/listinfo/idna-update" target="_blank">http://www.alvestrand.no/mailman/listinfo/idna-update</a><br>

</div></div></blockquote></div><br></div>