<br clear="all">Mark<br>
<br><br><div class="gmail_quote">On Tue, Jun 30, 2009 at 02:07, John C Klensin <span dir="ltr"><<a href="mailto:klensin@jck.com">klensin@jck.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<br>
<br>
--On Monday, June 29, 2009 12:46 -0700 Mark Davis ⌛<br>
<div class="im"><<a href="mailto:mark@macchiato.com">mark@macchiato.com</a>> wrote:<br>
<br>
>...<br>
>> > Now, my position is still that the simplest and most<br>
>> > compatible option open to us is to simply map with NFKC +<br>
>> > Casefold.<br>
>><br>
>> I continue to believe that CaseFold is a showstopper. When<br>
>> its results are not identical to those produced by LowerCase,<br>
>> it produces results that are astonishing to some users and<br>
>> leads us into the "is that a separate character or not" trap<br>
>> that we've seen manifested at least twice. I note that TUS<br>
>> recommends against its use for mapping (as distinct from<br>
>> comparison) and appears to do so for just the reason that it<br>
>> involves too much information loss.<br>
<br>
> You need to provide actual data behind this. Please list<br>
> exactly the characters that you mean, and why you think they<br>
> are problematic. Note also that the formulation that I gave<br>
> means that any character that is PVALID would automatically be<br>
> excluded, eg if final-sigma is PVALID then it is unaffected.<br>
> And we can certainly introduce other exceptions.<br>
<br>
</div>I don't like operating by exception when it can be avoided (an<br>
argument you have made as well). Getting into situations in<br>
which exceptions are required is not advantageous if we are<br>
trying to be as Unicode version independent as possible. I<br>
also prefer operations that casual users understand and believe<br>
they do (e.g., a Lower Case operation is fairly comprehensible<br>
to any user of a script that supports case distinctions, while<br>
CaseFolding is dependent on Unicode coding decisions. Final<br>
Sigma and Eszett are, indeed, the current examples but it<br>
continues to appear that LowerCase is both necessary and<br>
sufficient and that, if transformations it does not cover are<br>
needed, it is they that should be handled by exception.</blockquote><div><br>What I am saying is that there would be exceptions to the set of mappings in any event, not particularly just the case mappings.<br><br>I find myself puzzled. You seem to be focused on the name of the property, rather than the results. I suggest that you list the differences between the Lowercase mapping and the CaseFold mapping, and indicate at least one example where there is the possibility of a real problem. The only possible issues I could see would be:<br>
<ul><li>The character would be more likely interpreted as a different valid character than the one it maps to, or<br></li><li>We might add it as PVALID in the future.</li></ul>I don't see any of those cases, so if you do, please list them for discussion.<br>
<br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><br>
<div class="im"><br>
> And I know full well about the issues in TUS, having written<br>
> or participated in the writing of them.<br>
<br>
</div>I assumed, given your role in the case-handling material, that<br>
you had written it. What I'm having trouble understanding is<br>
why, given the perfectly logical (at least to me) explanation of<br>
why CaseFold should be used for matching only that appears<br>
there, you keep wanting to use it as a mandatory mapping<br>
operation here.</blockquote><div><br>The problem is that we are calling upon mapping to do the work of matching. As you know, we can't actually change the mapping operation in the DNS, so we are forced into this. Part of what I did was to go through all the mappings, and try to make a reasoned judgment about which mapping operations would serve as both, and not serve as complications. What I'd like is a concrete review of the results, rather than vague (untestable) statements of doubt.<br>
<br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><br>
<br>
A different way of looking at this is that I'm trying to resist<br>
mapping transformations that non-expert users believe lose<br>
significant information unless they can be demonstrated to be<br>
really important. Whatever can be said for, e.g., the<br>
FinalSigma -> Lower Case Sigma transformation and the tradeoffs<br>
between information-preservation and IDNA2003 compatibility, it<br>
seems to be generally understood that the transformation is<br>
information-losing. From that perspective, and the related<br>
perspective of minimizing complexity by choosing simpler<br>
operations rather than more complicated ones and not performing<br>
mappings that are not justified by real-world usage, it seems to<br>
me that it is you who need to make the case for operations that<br>
lose more information, for more complexity, and for mapping of<br>
more characters.</blockquote><div><br>There are at least two different motivations for the mapping.<br><ul><li>Don't have unnecessary compatibility breakage with IDNA2003</li><li>Meet peoples' expectations. A subcategory of this is where people see a name, paste it in, and it doesn't work because there is a variant character (eg µ instead of μ).</li>
</ul>As to performing mappings that are not justified by real-world usage: what data are you making your claims based on?<br><br>As to losing information, I think you are quite mistaken. The information lost in case folding is far greater than the other mappings proposed. Look at the following, for example:<br>
<br><a href="http://Therapist.com">http://Therapist.com</a> vs. <br><a href="http://TheRapist.com">http://TheRapist.com</a><br><br>That is far more different, to more people, than the difference between l+j and the lj character in http://<span style="font-weight: normal;"><span class="t_nihongo_kanji" lang="ja"></span></span>ljubav.rs (the lj being a single character transliteration of љ).<br>
<br>Or the difference between fullwidth and normal:<br><br>http://<span style="font-weight: normal;"><span class="t_nihongo_kanji" lang="ja">日本Sony....</span></span><br>vs<br>http://<span style="font-weight: normal;"><span class="t_nihongo_kanji" lang="ja">日本</span></span>Sony
<span style="font-weight: normal;"><span class="t_nihongo_kanji" lang="ja">....</span></span><br><br>I'm afraid that a focus on case-mapping is, and will be preceived as, a Western-European language focus; excluding mappings that are important to other parts of the world.<br>
<br><br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><br>
<br>
>> ...<br>
<div class="im">> You make it sounds like final sigma, ZWJ/NJ, eszett and the<br>
> other cases under discussion were oversights in the process of<br>
> developing the current IDNA. That wasn't the case; these were<br>
> deliberate choices made at the time. A case mapping is also a<br>
> 'loss of information', but one that people clearly want.<br>
<br>
</div>Taking the last as an example, I think "a case mapping" was a<br>
deliberate choice, one that I supported at the time and, given<br>
the assumptions behind IDNA2003 would support it again. I do<br>
not believe it is plausible to argue that a majority of the<br>
participants in the original IDNA WG, much less in the IETF,<br>
understood the implications of the differences between case<br>
folding and lower case mapping well enough to have exercised<br>
informed consent, much less to have made a "deliberate choice".<br>
Instead, they were informed by experts, yourself included, that<br>
toCaseFold was the correct explanation and went along with it<br>
despite some concerns about individual characters (which most of<br>
the participants did not understand either).<br>
<br>
Obviously one can have both "not an oversight" and "insufficient<br>
understanding to have informed consent", so we are not<br>
necessarily disagreeing.</blockquote><div><br>It would, of course, be useful if we were all experts on all the topics involved. Failing that, we do have to rely on others' information; for example, on your knowledge of the DNS. That doesn't, of course, mean that anyone gets a free pass...<br>
<br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><br>
<br>
>...<br>
<div class="im"><br>
>> > The rest of the tests for U-Label remain unchanged.<br>
>><br>
>> I believe that doing this by the type of change to Tables that<br>
>> you recommend either requires a change to the way that the<br>
>> definition of U-label is stated or requires us to abandon the<br>
>> very clear concept of a U-label that is completely symmetric,<br>
>> with no information loss in either direction, with an A-label.<br>
<br>
> I don't see why you would think that. A U-Label remains just<br>
> the way it is, and has a 1-1 relation with an A-Label. The<br>
> only difference is that we have an additional category of<br>
> M-Label; one that is not a U-Label but maps to one.<br>
<br>
</div>At a minimum, the already-complicated pictures in Defs will<br>
require redrawing, which was not mentioned in your list. But,<br>
independent of that bit of work, I still wish we could avoid<br>
introducing yet another label category.</blockquote><div><br>If an ASCII picture were the only thing standing between us and a successful mapping ;-)<br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<br>
<br>
>...<br>
<font color="#888888"><br>
john<br>
<br>
</font></blockquote></div><br>