I-D Action: draft-klensin-idna-rfc5891bis-00.txt

Sun Mar 12 07:14:58 CET 2017

On 3/11/2017 7:23 PM, Shawn Steele wrote:
> My concern is not that people are worried about mixing Latin and Cyrillic.  Clearly it's good to avoid that.
>
> My concern is that because we can say Latin+Cyrillic is bad, and because we can figure out a rule to avoid it (don't mix scripts), then we presume that there is a solution for all other homograph or other related conditions.  The space is too big.  There cannot be.  Especially if that system must be extended to humans writing URLs on a napkin.  Assuming that everything evil is quantifiable and correctable by some perfect algorithm is a fallacy.  That does not mean we should not try to protect the space, but we need to recognize that there are limits to what is achievable there.

There's an absolute limit on what you can achieve on the "per code 
point" level. Because users interact with the system on the "per label" 
level.

Registration allows some per-label techniques. For example, instead of 
focusing on not mixing scripts, it's possible to have one label block 
another if the other is the "same" label, but simply substitutes 
look-alike code points from the other script.

Sometimes, doing both is the right answer.

Back to your "fallacy".

The problem of labels overlapping in perceptual space is not a single 
issue and there's no single best solution for it.

The proper strategy seems to me consists of a set of nested defenses. We 
use normalization to deal with dual encoding of the same item, we use 
script rules to deal with encoding disambiguation by script, we use 
limits on the repertoire to simplify the recognition task by removing 
unfamiliar code points, we use context rules to prevent combinations 
that render unpredictably or the same, we use blocked variants to deal 
with labels that are perceived the same, and there are further 
techniques to deal with labels that are merely "similar" that is, they 
have a non-zero distance in perceptual space, but that distance is below 
some threshold.

Each of these strategies reduces the problem space, if designed properly.

Where I strongly agree with you is that insisting on some "perfection" 
on each individual level is a serious mistake. It misallocates resources 
on efforts to chase down some, as you call them 'seriously esoteric' 
edge cases, while missing the elephant in the room - which might be able 
to be attacked by the next layer of the defense.
>
> Bucureşti, since it's side-by side, may be trivial to tell the difference between București and other possible forms that someone might try like Bucureṣti, but without context people may just think it's a quirky font.  And on paper the distinction is lost completely.  For some languages the differences are far more confusing because of the number of characters involved.

Chinese is interesting: the set of code points in typical repertoires 
exceeds (and not by a little) the number of characters that an "average" 
user would know. So the majority of those code points are actually 
unfamiliar to the typical user. That has always struck me as backwards, 
even allowing for the fact that ideographs can be taken apart into 
components so that side by side comparisons are possible even if you 
don't know much. However, without side-by-side, I'm sure the chance that 
some users pick the correct one that 's not in the set they are familiar 
with would seem pretty low.

Compared to that, the poor hamza issue pales into insignificance.

> Others have alternate spellings.
>
> And then when we get to emoji, well it's pretty hard to mix up a cat with a human with the work "cat", so, whether they seem "serious" or not, it seems to me to be pretty harsh to just get rid of the whole set when people clearly want to use them.  Sure, they're silly, but sometimes people aren't in businesses that deal entirely in life-and-death matters.
Emojis (those based on sequences) have their own issues, but, I agree 
that I've never understood why "I <heart> NY" would inherently be more 
problematic than having IDNS that use "ROCK DOTS".

A./
>
> It is unclear to me, since you make the distinction, what you feel the boundaries are between the ICAAN/IETF/Unicode/Registrar 'problem's are?
>
> -Shawn
>
> -----Original Message-----
> From: John C Klensin [mailto:klensin at jck.com]
> Sent: Saturday, March 11, 2017 7:01 PM
> To: Shawn Steele <Shawn.Steele at microsoft.com>; idna-update at alvestrand.no
> Subject: RE: I-D Action: draft-klensin-idna-rfc5891bis-00.txt
>
>
>
> --On Saturday, March 11, 2017 21:52 +0000 Shawn Steele <Shawn.Steele at microsoft.com> wrote:
>
>> I'm not at all sure where language has anything to do with it.
>> And language independence is critical as the people trying to use the
>> IDN may not be native speakers of the IDN's language.
> You will find an explanation of where language is injected into the discussion in draft-klensin-idna-5892upd-unicode70
>
>> The fundamental "problem" is that some things seem like other things
>> to some people some of the time.
>>
>> The size of the set of codepoints makes that inevitable.
>>
>> There's been tons of discussion about strange quirks of seriously
>> esoteric characters and how those can lead to identifiers that seem
>> disparate, but aren't (or vice versa.) Kinda like ratholing on naive
>> vs naïve.
> Shawn, I've got a lot of sympathy in a lot of cases for "user beware", but that just isn't how contemporary systems are designed and users expect them to be designed.  As an example, consider the number of "do you really want to do that" messages that pop up by default in what I assume is your favorite operating system.  The examples of cases in which it is important to take some protective measures go back, in some communities, to strings that mix Latin and Cyrillic characters or that use strings of Cyrillic characters that look like Latin ones or even English words.  Those are not esoteric characters, much less "seriously esoteric" ones -- they are used by millions of people who write Slavic and other languages every day.
>
> There are also a separate set of issues associated with the rather basic Unicode design decision to incorporate both code points representing what, for lack of a better term, are often called precomposed characters and combining sequences consisting of a base character and a combining mark or two (there are other cases, but they are mostly esoteric).  In the overwhelming number of cases, the relationship between the precombined character and the combining sequence is that they are The Same Abstract Character by any reasonable definition... and the key reason for Unicode normalization (certainly not something the IETF invented) is to make sure that, in comparison operations, the two forms are treated as equivalent.  Sometimes normalization doesn't do the job people expect and _that_ is one place where things do get esoteric.
>
>> Yet we completely ignore other common problems, like Mueller vs Müller
>> - which are admittedly language independent.
> I hope you meant "language dependent".  On the one hand, I agree with you that, if characters, or strings more generally, that are actually different are going to be treated as the same, then it would be much better to support orthographic equivalences like the above and, for that matter, "coror" and "colour" too.
> However, at least historically, that is an ICANN and/or registry problem, not an IETF one -- IDNA stops with character identities
> or the lack thereof.   I'm sure ICANN would be happy to hear
> from you on the subject, including your opinion about why orthographic differences are considered topics for matching (and treatment as "the same" to the extent that is possible) in some scripts but not in others.
>
>> In my opinion, part of the problem is goals:  Human usability vs
>> machine uniqueness.  For machine uniqueness I'd think any set of rules
>> would suffice because it boils down to a bunch of numbers.  For human
>> usability you end up with confusables being an issue.  Those cannot be
>> perfectly resolved because there are lots of minor pixel variations
>> that are perfectly valid yet different.
> But, as long as you expect, e.g., "ö" (U+00F6) and "ö" (U+006F
> U+0308) to be treated as the same character, you have a machine
> uniqueness problem that does not yield to bit string comparison (or "a bunch of numbers").  That is not a human perception or anything having to do with pixels.
>
>
>> ...
>> I'd prefer that the IETF standard be very lax with respect to
>> permissible characters, and, coming back to your document,  encourage
>> registrars to do the right thing for their customers  with respect to
>> permissible &/or mapped characters.
> Then, as far as the present document, and the material in IDNA on which it builds, are concerned, we are in agreement.  We may disagree on a lot of code points that IDNA disallows because they are identical to the abstract characters represented by other code points except for presumed use context (have a look at the "Mathematical" representations of some Greek and Latin characters, a distinction that depends largely on distinctive type styling and characters that Unicode doesn't think belong in identifiers either).  But, again, those decisions were not about pixels or even type styles and they have little or nothing to do with potential confusion among similarly-looking characters.
>
> FWIW, the Unicode "identifier and pattern syntax" spec is fairly restrictive (not "very lax") about the allowable code points too
> -- the rules they use are just different from the ones the IETF uses in IDNA for assorted subtle and not-so-subtle reasons.
>
>       john
>
>
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update