I-D Action: draft-klensin-idna-rfc5891bis-00.txt

Sun Mar 12 04:01:24 CET 2017

--On Saturday, March 11, 2017 21:52 +0000 Shawn Steele
<Shawn.Steele at microsoft.com> wrote:

> I'm not at all sure where language has anything to do with it.
> And language independence is critical as the people trying to
> use the IDN may not be native speakers of the IDN's language.

You will find an explanation of where language is injected into
the discussion in draft-klensin-idna-5892upd-unicode70

> The fundamental "problem" is that some things seem like other
> things to some people some of the time.
> 
> The size of the set of codepoints makes that inevitable.   
> 
> There's been tons of discussion about strange quirks of
> seriously esoteric characters and how those can lead to
> identifiers that seem disparate, but aren't (or vice versa.)
> Kinda like ratholing on naive vs naïve.

Shawn, I've got a lot of sympathy in a lot of cases for "user
beware", but that just isn't how contemporary systems are
designed and users expect them to be designed.  As an example,
consider the number of "do you really want to do that" messages
that pop up by default in what I assume is your favorite
operating system.  The examples of cases in which it is
important to take some protective measures go back, in some
communities, to strings that mix Latin and Cyrillic characters
or that use strings of Cyrillic characters that look like Latin
ones or even English words.  Those are not esoteric characters,
much less "seriously esoteric" ones -- they are used by millions
of people who write Slavic and other languages every day.  

There are also a separate set of issues associated with the
rather basic Unicode design decision to incorporate both code
points representing what, for lack of a better term, are often
called precomposed characters and combining sequences consisting
of a base character and a combining mark or two (there are other
cases, but they are mostly esoteric).  In the overwhelming
number of cases, the relationship between the precombined
character and the combining sequence is that they are The Same
Abstract Character by any reasonable definition... and the key
reason for Unicode normalization (certainly not something the
IETF invented) is to make sure that, in comparison operations,
the two forms are treated as equivalent.  Sometimes
normalization doesn't do the job people expect and _that_ is one
place where things do get esoteric.

> Yet we completely ignore other common problems, like Mueller
> vs Müller - which are admittedly language independent.

I hope you meant "language dependent".  On the one hand, I agree
with you that, if characters, or strings more generally, that
are actually different are going to be treated as the same, then
it would be much better to support orthographic equivalences
like the above and, for that matter, "coror" and "colour" too.
However, at least historically, that is an ICANN and/or registry
problem, not an IETF one -- IDNA stops with character identities
or the lack thereof.   I'm sure ICANN would be happy to hear
from you on the subject, including your opinion about why
orthographic differences are considered topics for matching (and
treatment as "the same" to the extent that is possible) in some
scripts but not in others.

> In my opinion, part of the problem is goals:  Human usability
> vs machine uniqueness.  For machine uniqueness I'd think any
> set of rules would suffice because it boils down to a bunch of
> numbers.  For human usability you end up with confusables
> being an issue.  Those cannot be perfectly resolved because
> there are lots of minor pixel variations that are perfectly
> valid yet different.  

But, as long as you expect, e.g., "ö" (U+00F6) and "ö" (U+006F
U+0308) to be treated as the same character, you have a machine
uniqueness problem that does not yield to bit string comparison
(or "a bunch of numbers").  That is not a human perception or
anything having to do with pixels.

>...
> I'd prefer that the IETF standard be very lax with respect to
> permissible characters, and, coming back to your document,
> encourage registrars to do the right thing for their customers
> with respect to permissible &/or mapped characters.

Then, as far as the present document, and the material in IDNA
on which it builds, are concerned, we are in agreement.  We may
disagree on a lot of code points that IDNA disallows because
they are identical to the abstract characters represented by
other code points except for presumed use context (have a look
at the "Mathematical" representations of some Greek and Latin
characters, a distinction that depends largely on distinctive
type styling and characters that Unicode doesn't think belong in
identifiers either).  But, again, those decisions were not about
pixels or even type styles and they have little or nothing to do
with potential confusion among similarly-looking characters.

FWIW, the Unicode "identifier and pattern syntax" spec is fairly
restrictive (not "very lax") about the allowable code points too
-- the rules they use are just different from the ones the IETF
uses in IDNA for assorted subtle and not-so-subtle reasons.

     john