<div dir="ltr"><div class="gmail_default" style="font-family:'times new roman',serif">You said earlier:</div><div class="gmail_default" style="font-family:'times new roman',serif"><br></div><div class="gmail_default"><div class="gmail_default" style="font-family:'times new roman',serif">> That's a false alternative, and I don't think it's in any way reasonable. </div><div class="gmail_default" style="font-family:'times new roman',serif">> We don't argue, "<i>Car accidents cause lots of death, so death from influenza isn't important.</i>"</div><div class="gmail_default" style="font-family:'times new roman',serif"><br></div><div class="gmail_default" style="font-family:'times new roman',serif">That is, however, a false analogy. A more accurate one would be:</div><div class="gmail_default" style="font-family:'times new roman',serif"><br></div><div class="gmail_default" style="font-family:'times new roman',serif"><i>Car accidents cause lots of death, so in comparison, the sniffles are not that important.</i><br></div><div class="gmail_default" style="font-family:'times new roman',serif"><i><br></i></div><div class="gmail_default" style="font-family:'times new roman',serif"><b>That is:</b></div><div class="gmail_default" style="font-family:'times new roman',serif"><br></div><div class="gmail_default" style="font-family:'times new roman',serif"><i>Many characters and sequences cause confusability problems, and in comparison U+08A1 is not that important.</i></div><div class="gmail_default"><div class="gmail_default"><font face="times new roman, serif"><br></font></div><div class="gmail_default"><font face="times new roman, serif">And, like so many of these discussions, there is <b><i>no</i></b> data behind any of this Sturm und Drang around U+08A1 and related characters</font><span style="font-family:'times new roman',serif">. If the IETF were serious about these issues, it would gather the data to see where the biggest problems are</span><i style="font-family:'times new roman',serif"> in reality</i><span style="font-family:'times new roman',serif">. It would then focus on the biggest ticket items, to see if it can come up with solutions to those.</span></div><div class="gmail_default"><font face="times new roman, serif"><br></font></div><div class="gmail_default"><font face="times new roman, serif">With your analogy, it would figure out how many deaths there are due to car accidents, slips in the bathtub, and so on (<a href="http://www.who.int/mediacentre/factsheets/fs310/en/">http://www.who.int/mediacentre/factsheets/fs310/en/</a>), and focus its resources on those big ticket causes of death where it can make a difference, not focus on the sniffles.</font></div></div><div class="gmail_default" style="font-family:'times new roman',serif"><br></div></div></div><div class="gmail_extra"><br clear="all"><div><div class="gmail_signature"><div dir="ltr"><font face="'times new roman', serif"><div style="background-color:transparent;margin-top:0px;margin-left:0px;margin-bottom:0px;margin-right:0px"><div></div></div><div style="background-color:transparent;margin-top:0px;margin-left:0px;margin-bottom:0px;margin-right:0px"><br></div><div style="background-color:transparent;margin-top:0px;margin-left:0px;margin-bottom:0px;margin-right:0px"><a href="https://google.com/+MarkDavis" target="_blank">Mark</a></div><div style="background-color:transparent;margin-top:0px;margin-left:0px;margin-bottom:0px;margin-right:0px"><i><br></i></div><div style="background-color:transparent;margin-top:0px;margin-left:0px;margin-bottom:0px;margin-right:0px"><i>— Il meglio è l’inimico del bene —</i></div></font><div><div><font face="'times new roman', serif"><i><span style="font-style:normal"><i></i></span><i></i></i></font></div></div></div></div></div>

<br><div class="gmail_quote">On Wed, Jan 28, 2015 at 3:22 AM, Andrew Sullivan <span dir="ltr"><<a href="mailto:ajs@anvilwalrusden.com" target="_blank">ajs@anvilwalrusden.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Wed, Jan 28, 2015 at 01:13:28AM +0000, Shawn Steele wrote:<br>

> It focuses on edge cases of confusable characters.  These are a very small part of the potential for confusion in IDNA.<br>

><br>

<br>

</span>I'm sorry, but I do not agree that it focusses on that.  In<br>

particular, it says this:<br>

<br>

"What is peculiar about these cases, as distinct from other confusable<br>

cases, is that the decomposed and precomposed forms are in the same<br>

script and cannot be distinguished visually by users, even in large<br>

fonts designed for clarity. It is only by knowing the language that it<br>

is possible to detect whether a use of the character is the correct<br>

one."<br>

<br>

Other kinds of confusable characters are also important, but that is<br>

_not_ what this particular case is about, and just saying, "They're<br>

all the same," doesn't make that true.<br>

<br>

In addition, there is something not in the statement but that occurred<br>

to me today because of a conversation.  Some of the examples that have<br>

been used have different properties.  For instance, all the cases that<br>

the IAB statement is talking about are always in the same script.  In<br>

addition, it strikes me, they have a bunch of other properties in<br>

common (for instance, they're all letters).  The basic problem is that<br>

there isn't an algorithmic way to distinguish between them at all;<br>

indeed, that's how it is that several of these are PVALID in IDNA2008,<br>

because I think if this issue had been clear to all of us when working<br>

on that specification we'd have worked a little harder to determine<br>

whether we had an extra exception class.<br>

<br>

The entire justification that we've seen for these different encodings<br>

is linguistic.  And that is no doubt correct, for the purposes that<br>

Unicode needs to be put in general.  The problem is that identifiers<br>

aren't _in_ a language, and even if they were most of the time you<br>

can't know what the right language is because there's no such metadata<br>

with the identifier.  This is the nub of the problem.<br>

<span class=""><br>

> Certainly having identifiers that are consistent is good<br>

<br>

</span>Perhaps you have a different understanding of the meaning of<br>

"identifier" than I do.  I do not think that consistent identifiers is<br>

some sort of nice to have pretty good idea.  I think it is an<br>

essential element of any identifier system that it be as consistent<br>

and predictable as is possible.<br>

<br>

The present example is a case where the critical determining factor --<br>

the linguistic metadata -- is the thing that is necessarily missing.<br>

That's quite different from ï/i or ß/ss because a clued-in user can<br>

handle those things (even if Joe Random Language Speaker can't).  It's<br>

also different from cases like TAMIL LETTER KA vs. TAMIL DIGIT ONE:<br>

the former is general category Other_Letter and the latter is<br>

Decimal_Number.  So in that case, it's at least possible to write some<br>

rules about what things you can use by category.  There's no way to<br>

tell whether you have a two-codepoint composition that renders BEH<br>

with a HAMZA ABOVE or whether you have a single codepoint BEH WITH<br>

HAMZA ABOVE (cf. the other _hamza_ cases), _and_ there's no way in<br>

principle to write any software that could possibly detect that you<br>

might have an issue here, at least without carrying around a big<br>

exception table.<br>

<br>

Somewhat earlier, Asmus argued that the UTC had discovered that<br>

exception lists is the only thing that would work for some of these<br>

cases.  If so, then good, but it suggests to me that we might need a<br>

new list of exceptions for identifiers.  It appears to me that this<br>

may be a different list of exceptions than any of the existing ones,<br>

but I confess that I have not managed to peruse every single possible<br>

candidate exception list yet.<br>

<br>

Again, remember, this is not just domain names we're talking about (at<br>

least in the IAB statement), so saying "let the registries solve this"<br>

won't automatically work.<br>

<br>

Best regards,<br>

<span class="im HOEnZb"><br>

A<br>

<br>

--<br>

Andrew Sullivan<br>

<a href="mailto:ajs@anvilwalrusden.com">ajs@anvilwalrusden.com</a><br>

</span><div class="HOEnZb"><div class="h5">_______________________________________________<br>

Idna-update mailing list<br>

<a href="mailto:Idna-update@alvestrand.no">Idna-update@alvestrand.no</a><br>

<a href="http://www.alvestrand.no/mailman/listinfo/idna-update" target="_blank">http://www.alvestrand.no/mailman/listinfo/idna-update</a><br>

</div></div></blockquote></div><br></div>