IAB Statement on Identifiers and Unicode 7.0.0
vint at google.com
Wed Jan 28 09:20:19 CET 2015
I am reading your message as saying "ambiguity is ok if there are few
instances of it" while some of us would like the handling of identifiers
encoded with Unicode to be unambiguous.
On Wed, Jan 28, 2015 at 3:15 AM, Mark Davis ☕️ <mark at macchiato.com> wrote:
> You said earlier:
> > That's a false alternative, and I don't think it's in any way
> > We don't argue, "*Car accidents cause lots of death, so death from
> influenza isn't important.*"
> That is, however, a false analogy. A more accurate one would be:
> *Car accidents cause lots of death, so in comparison, the sniffles are not
> that important.*
> *That is:*
> *Many characters and sequences cause confusability problems, and in
> comparison U+08A1 is not that important.*
> And, like so many of these discussions, there is *no* data behind any of
> this Sturm und Drang around U+08A1 and related characters. If the IETF
> were serious about these issues, it would gather the data to see where the
> biggest problems are* in reality*. It would then focus on the biggest
> ticket items, to see if it can come up with solutions to those.
> With your analogy, it would figure out how many deaths there are due to
> car accidents, slips in the bathtub, and so on (
> http://www.who.int/mediacentre/factsheets/fs310/en/), and focus its
> resources on those big ticket causes of death where it can make a
> difference, not focus on the sniffles.
> Mark <https://google.com/+MarkDavis>
> *— Il meglio è l’inimico del bene —*
> On Wed, Jan 28, 2015 at 3:22 AM, Andrew Sullivan <ajs at anvilwalrusden.com>
>> On Wed, Jan 28, 2015 at 01:13:28AM +0000, Shawn Steele wrote:
>> > It focuses on edge cases of confusable characters. These are a very
>> small part of the potential for confusion in IDNA.
>> I'm sorry, but I do not agree that it focusses on that. In
>> particular, it says this:
>> "What is peculiar about these cases, as distinct from other confusable
>> cases, is that the decomposed and precomposed forms are in the same
>> script and cannot be distinguished visually by users, even in large
>> fonts designed for clarity. It is only by knowing the language that it
>> is possible to detect whether a use of the character is the correct
>> Other kinds of confusable characters are also important, but that is
>> _not_ what this particular case is about, and just saying, "They're
>> all the same," doesn't make that true.
>> In addition, there is something not in the statement but that occurred
>> to me today because of a conversation. Some of the examples that have
>> been used have different properties. For instance, all the cases that
>> the IAB statement is talking about are always in the same script. In
>> addition, it strikes me, they have a bunch of other properties in
>> common (for instance, they're all letters). The basic problem is that
>> there isn't an algorithmic way to distinguish between them at all;
>> indeed, that's how it is that several of these are PVALID in IDNA2008,
>> because I think if this issue had been clear to all of us when working
>> on that specification we'd have worked a little harder to determine
>> whether we had an extra exception class.
>> The entire justification that we've seen for these different encodings
>> is linguistic. And that is no doubt correct, for the purposes that
>> Unicode needs to be put in general. The problem is that identifiers
>> aren't _in_ a language, and even if they were most of the time you
>> can't know what the right language is because there's no such metadata
>> with the identifier. This is the nub of the problem.
>> > Certainly having identifiers that are consistent is good
>> Perhaps you have a different understanding of the meaning of
>> "identifier" than I do. I do not think that consistent identifiers is
>> some sort of nice to have pretty good idea. I think it is an
>> essential element of any identifier system that it be as consistent
>> and predictable as is possible.
>> The present example is a case where the critical determining factor --
>> the linguistic metadata -- is the thing that is necessarily missing.
>> That's quite different from ï/i or ß/ss because a clued-in user can
>> handle those things (even if Joe Random Language Speaker can't). It's
>> also different from cases like TAMIL LETTER KA vs. TAMIL DIGIT ONE:
>> the former is general category Other_Letter and the latter is
>> Decimal_Number. So in that case, it's at least possible to write some
>> rules about what things you can use by category. There's no way to
>> tell whether you have a two-codepoint composition that renders BEH
>> with a HAMZA ABOVE or whether you have a single codepoint BEH WITH
>> HAMZA ABOVE (cf. the other _hamza_ cases), _and_ there's no way in
>> principle to write any software that could possibly detect that you
>> might have an issue here, at least without carrying around a big
>> exception table.
>> Somewhat earlier, Asmus argued that the UTC had discovered that
>> exception lists is the only thing that would work for some of these
>> cases. If so, then good, but it suggests to me that we might need a
>> new list of exceptions for identifiers. It appears to me that this
>> may be a different list of exceptions than any of the existing ones,
>> but I confess that I have not managed to peruse every single possible
>> candidate exception list yet.
>> Again, remember, this is not just domain names we're talking about (at
>> least in the IAB statement), so saying "let the registries solve this"
>> won't automatically work.
>> Best regards,
>> Andrew Sullivan
>> ajs at anvilwalrusden.com
>> Idna-update mailing list
>> Idna-update at alvestrand.no
> Idna-update mailing list
> Idna-update at alvestrand.no
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Idna-update