<div dir="ltr">Mark,<div><br></div><div>I am reading your message as saying "ambiguity is ok if there are few instances of it" while some of us would like the handling of identifiers encoded with Unicode to be unambiguous. </div><div><br></div><div>vint</div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jan 28, 2015 at 3:15 AM, Mark Davis ☕️ <span dir="ltr"><<a href="mailto:mark@macchiato.com" target="_blank">mark@macchiato.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_default" style="font-family:'times new roman',serif">You said earlier:</div><div class="gmail_default" style="font-family:'times new roman',serif"><br></div><div class="gmail_default"><div class="gmail_default" style="font-family:'times new roman',serif">> That's a false alternative, and I don't think it's in any way reasonable. </div><div class="gmail_default" style="font-family:'times new roman',serif">> We don't argue, "<i>Car accidents cause lots of death, so death from influenza isn't important.</i>"</div><div class="gmail_default" style="font-family:'times new roman',serif"><br></div><div class="gmail_default" style="font-family:'times new roman',serif">That is, however, a false analogy. A more accurate one would be:</div><div class="gmail_default" style="font-family:'times new roman',serif"><br></div><div class="gmail_default" style="font-family:'times new roman',serif"><i>Car accidents cause lots of death, so in comparison, the sniffles are not that important.</i><br></div><div class="gmail_default" style="font-family:'times new roman',serif"><i><br></i></div><div class="gmail_default" style="font-family:'times new roman',serif"><b>That is:</b></div><div class="gmail_default" style="font-family:'times new roman',serif"><br></div><div class="gmail_default" style="font-family:'times new roman',serif"><i>Many characters and sequences cause confusability problems, and in comparison U+08A1 is not that important.</i></div><div class="gmail_default"><div class="gmail_default"><font face="times new roman, serif"><br></font></div><div class="gmail_default"><font face="times new roman, serif">And, like so many of these discussions, there is <b><i>no</i></b> data behind any of this Sturm und Drang around U+08A1 and related characters</font><span style="font-family:'times new roman',serif">. If the IETF were serious about these issues, it would gather the data to see where the biggest problems are</span><i style="font-family:'times new roman',serif"> in reality</i><span style="font-family:'times new roman',serif">. It would then focus on the biggest ticket items, to see if it can come up with solutions to those.</span></div><div class="gmail_default"><font face="times new roman, serif"><br></font></div><div class="gmail_default"><font face="times new roman, serif">With your analogy, it would figure out how many deaths there are due to car accidents, slips in the bathtub, and so on (<a href="http://www.who.int/mediacentre/factsheets/fs310/en/" target="_blank">http://www.who.int/mediacentre/factsheets/fs310/en/</a>), and focus its resources on those big ticket causes of death where it can make a difference, not focus on the sniffles.</font></div></div><div class="gmail_default" style="font-family:'times new roman',serif"><br></div></div></div><div class="gmail_extra"><br clear="all"><div><div><div dir="ltr"><font face="'times new roman', serif"><div style="background-color:transparent;margin-top:0px;margin-left:0px;margin-bottom:0px;margin-right:0px"><div></div></div><div style="background-color:transparent;margin-top:0px;margin-left:0px;margin-bottom:0px;margin-right:0px"><br></div><div style="background-color:transparent;margin-top:0px;margin-left:0px;margin-bottom:0px;margin-right:0px"><a href="https://google.com/+MarkDavis" target="_blank">Mark</a></div><div style="background-color:transparent;margin-top:0px;margin-left:0px;margin-bottom:0px;margin-right:0px"><i><br></i></div><div style="background-color:transparent;margin-top:0px;margin-left:0px;margin-bottom:0px;margin-right:0px"><i>— Il meglio è l’inimico del bene —</i></div></font><div><div><font face="'times new roman', serif"><i><span style="font-style:normal"><i></i></span><i></i></i></font></div></div></div></div></div><div><div class="h5">

<br><div class="gmail_quote">On Wed, Jan 28, 2015 at 3:22 AM, Andrew Sullivan <span dir="ltr"><<a href="mailto:ajs@anvilwalrusden.com" target="_blank">ajs@anvilwalrusden.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span>On Wed, Jan 28, 2015 at 01:13:28AM +0000, Shawn Steele wrote:<br>

> It focuses on edge cases of confusable characters.  These are a very small part of the potential for confusion in IDNA.<br>

><br>

<br>

</span>I'm sorry, but I do not agree that it focusses on that.  In<br>

particular, it says this:<br>

<br>

"What is peculiar about these cases, as distinct from other confusable<br>

cases, is that the decomposed and precomposed forms are in the same<br>

script and cannot be distinguished visually by users, even in large<br>

fonts designed for clarity. It is only by knowing the language that it<br>

is possible to detect whether a use of the character is the correct<br>

one."<br>

<br>

Other kinds of confusable characters are also important, but that is<br>

_not_ what this particular case is about, and just saying, "They're<br>

all the same," doesn't make that true.<br>

<br>

In addition, there is something not in the statement but that occurred<br>

to me today because of a conversation.  Some of the examples that have<br>

been used have different properties.  For instance, all the cases that<br>

the IAB statement is talking about are always in the same script.  In<br>

addition, it strikes me, they have a bunch of other properties in<br>

common (for instance, they're all letters).  The basic problem is that<br>

there isn't an algorithmic way to distinguish between them at all;<br>

indeed, that's how it is that several of these are PVALID in IDNA2008,<br>

because I think if this issue had been clear to all of us when working<br>

on that specification we'd have worked a little harder to determine<br>

whether we had an extra exception class.<br>

<br>

The entire justification that we've seen for these different encodings<br>

is linguistic.  And that is no doubt correct, for the purposes that<br>

Unicode needs to be put in general.  The problem is that identifiers<br>

aren't _in_ a language, and even if they were most of the time you<br>

can't know what the right language is because there's no such metadata<br>

with the identifier.  This is the nub of the problem.<br>

<span><br>

> Certainly having identifiers that are consistent is good<br>

<br>

</span>Perhaps you have a different understanding of the meaning of<br>

"identifier" than I do.  I do not think that consistent identifiers is<br>

some sort of nice to have pretty good idea.  I think it is an<br>

essential element of any identifier system that it be as consistent<br>

and predictable as is possible.<br>

<br>

The present example is a case where the critical determining factor --<br>

the linguistic metadata -- is the thing that is necessarily missing.<br>

That's quite different from ï/i or ß/ss because a clued-in user can<br>

handle those things (even if Joe Random Language Speaker can't).  It's<br>

also different from cases like TAMIL LETTER KA vs. TAMIL DIGIT ONE:<br>

the former is general category Other_Letter and the latter is<br>

Decimal_Number.  So in that case, it's at least possible to write some<br>

rules about what things you can use by category.  There's no way to<br>

tell whether you have a two-codepoint composition that renders BEH<br>

with a HAMZA ABOVE or whether you have a single codepoint BEH WITH<br>

HAMZA ABOVE (cf. the other _hamza_ cases), _and_ there's no way in<br>

principle to write any software that could possibly detect that you<br>

might have an issue here, at least without carrying around a big<br>

exception table.<br>

<br>

Somewhat earlier, Asmus argued that the UTC had discovered that<br>

exception lists is the only thing that would work for some of these<br>

cases.  If so, then good, but it suggests to me that we might need a<br>

new list of exceptions for identifiers.  It appears to me that this<br>

may be a different list of exceptions than any of the existing ones,<br>

but I confess that I have not managed to peruse every single possible<br>

candidate exception list yet.<br>

<br>

Again, remember, this is not just domain names we're talking about (at<br>

least in the IAB statement), so saying "let the registries solve this"<br>

won't automatically work.<br>

<br>

Best regards,<br>

<span><br>

A<br>

<br>

--<br>

Andrew Sullivan<br>

<a href="mailto:ajs@anvilwalrusden.com" target="_blank">ajs@anvilwalrusden.com</a><br>

</span><div><div>_______________________________________________<br>

Idna-update mailing list<br>

<a href="mailto:Idna-update@alvestrand.no" target="_blank">Idna-update@alvestrand.no</a><br>

<a href="http://www.alvestrand.no/mailman/listinfo/idna-update" target="_blank">http://www.alvestrand.no/mailman/listinfo/idna-update</a><br>

</div></div></blockquote></div><br></div></div></div>

<br>_______________________________________________<br>

Idna-update mailing list<br>

<a href="mailto:Idna-update@alvestrand.no">Idna-update@alvestrand.no</a><br>

<a href="http://www.alvestrand.no/mailman/listinfo/idna-update" target="_blank">http://www.alvestrand.no/mailman/listinfo/idna-update</a><br>

<br></blockquote></div><br></div>