I-D Action: draft-klensin-idna-rfc5891bis-00.txt

Sun Mar 12 04:23:41 CET 2017

My concern is not that people are worried about mixing Latin and Cyrillic.  Clearly it's good to avoid that.

My concern is that because we can say Latin+Cyrillic is bad, and because we can figure out a rule to avoid it (don't mix scripts), then we presume that there is a solution for all other homograph or other related conditions.  The space is too big.  There cannot be.  Especially if that system must be extended to humans writing URLs on a napkin.  Assuming that everything evil is quantifiable and correctable by some perfect algorithm is a fallacy.  That does not mean we should not try to protect the space, but we need to recognize that there are limits to what is achievable there.

Bucureşti, since it's side-by side, may be trivial to tell the difference between București and other possible forms that someone might try like Bucureṣti, but without context people may just think it's a quirky font.  And on paper the distinction is lost completely.  For some languages the differences are far more confusing because of the number of characters involved.  Others have alternate spellings.

And then when we get to emoji, well it's pretty hard to mix up a cat with a human with the work "cat", so, whether they seem "serious" or not, it seems to me to be pretty harsh to just get rid of the whole set when people clearly want to use them.  Sure, they're silly, but sometimes people aren't in businesses that deal entirely in life-and-death matters.

It is unclear to me, since you make the distinction, what you feel the boundaries are between the ICAAN/IETF/Unicode/Registrar 'problem's are?

-Shawn

-----Original Message-----
From: John C Klensin [mailto:klensin at jck.com] 
Sent: Saturday, March 11, 2017 7:01 PM
To: Shawn Steele <Shawn.Steele at microsoft.com>; idna-update at alvestrand.no
Subject: RE: I-D Action: draft-klensin-idna-rfc5891bis-00.txt

--On Saturday, March 11, 2017 21:52 +0000 Shawn Steele <Shawn.Steele at microsoft.com> wrote:

> I'm not at all sure where language has anything to do with it.
> And language independence is critical as the people trying to use the 
> IDN may not be native speakers of the IDN's language.

You will find an explanation of where language is injected into the discussion in draft-klensin-idna-5892upd-unicode70

> The fundamental "problem" is that some things seem like other things 
> to some people some of the time.
> 
> The size of the set of codepoints makes that inevitable.   
> 
> There's been tons of discussion about strange quirks of seriously 
> esoteric characters and how those can lead to identifiers that seem 
> disparate, but aren't (or vice versa.) Kinda like ratholing on naive 
> vs naïve.

Shawn, I've got a lot of sympathy in a lot of cases for "user beware", but that just isn't how contemporary systems are designed and users expect them to be designed.  As an example, consider the number of "do you really want to do that" messages that pop up by default in what I assume is your favorite operating system.  The examples of cases in which it is important to take some protective measures go back, in some communities, to strings that mix Latin and Cyrillic characters or that use strings of Cyrillic characters that look like Latin ones or even English words.  Those are not esoteric characters, much less "seriously esoteric" ones -- they are used by millions of people who write Slavic and other languages every day.  

There are also a separate set of issues associated with the rather basic Unicode design decision to incorporate both code points representing what, for lack of a better term, are often called precomposed characters and combining sequences consisting of a base character and a combining mark or two (there are other cases, but they are mostly esoteric).  In the overwhelming number of cases, the relationship between the precombined character and the combining sequence is that they are The Same Abstract Character by any reasonable definition... and the key reason for Unicode normalization (certainly not something the IETF invented) is to make sure that, in comparison operations, the two forms are treated as equivalent.  Sometimes normalization doesn't do the job people expect and _that_ is one place where things do get esoteric.

> Yet we completely ignore other common problems, like Mueller vs Müller 
> - which are admittedly language independent.

I hope you meant "language dependent".  On the one hand, I agree with you that, if characters, or strings more generally, that are actually different are going to be treated as the same, then it would be much better to support orthographic equivalences like the above and, for that matter, "coror" and "colour" too.
However, at least historically, that is an ICANN and/or registry problem, not an IETF one -- IDNA stops with character identities
or the lack thereof.   I'm sure ICANN would be happy to hear
from you on the subject, including your opinion about why orthographic differences are considered topics for matching (and treatment as "the same" to the extent that is possible) in some scripts but not in others.

> In my opinion, part of the problem is goals:  Human usability vs 
> machine uniqueness.  For machine uniqueness I'd think any set of rules 
> would suffice because it boils down to a bunch of numbers.  For human 
> usability you end up with confusables being an issue.  Those cannot be 
> perfectly resolved because there are lots of minor pixel variations 
> that are perfectly valid yet different.

But, as long as you expect, e.g., "ö" (U+00F6) and "ö" (U+006F
U+0308) to be treated as the same character, you have a machine
uniqueness problem that does not yield to bit string comparison (or "a bunch of numbers").  That is not a human perception or anything having to do with pixels.

>...
> I'd prefer that the IETF standard be very lax with respect to  
>permissible characters, and, coming back to your document,  encourage 
>registrars to do the right thing for their customers  with respect to 
>permissible &/or mapped characters.

Then, as far as the present document, and the material in IDNA on which it builds, are concerned, we are in agreement.  We may disagree on a lot of code points that IDNA disallows because they are identical to the abstract characters represented by other code points except for presumed use context (have a look at the "Mathematical" representations of some Greek and Latin characters, a distinction that depends largely on distinctive type styling and characters that Unicode doesn't think belong in identifiers either).  But, again, those decisions were not about pixels or even type styles and they have little or nothing to do with potential confusion among similarly-looking characters.

FWIW, the Unicode "identifier and pattern syntax" spec is fairly restrictive (not "very lax") about the allowable code points too
-- the rules they use are just different from the ones the IETF uses in IDNA for assorted subtle and not-so-subtle reasons.

     john