I-D Action: draft-klensin-idna-rfc5891bis-00.txt

Sun Mar 12 06:40:48 CET 2017

On 3/11/2017 1:52 PM, Shawn Steele wrote:
> I'm not at all sure where language has anything to do with it.  And language independence is critical as the people trying to use the IDN may not be native speakers of the IDN's language.

I strongly agree; in the case of Unicode plain text (that is, without 
font binding) knowing the "language" can, in principle, lead to some 
differences in rendering, but they tend to be truly minor. Like the a 
difference in preferred angle of the acute sign for Polish vs. French. 
The same differences can be achieved via font selection -- in general, 
the differences in appearance of plain text due to different fonts 
selected when reading the text tends to exceed the nominal differences 
in preferred appearance based on "language".

For identifiers, the fonts that truly matter are the standard UI fonts. 
For those, there's not a whole lot of variation, and also not a lot of 
language dependence.
>
> The fundamental "problem" is that some things seem like other things to some people some of the time.
>
> The size of the set of codepoints makes that inevitable.

When humans need to distinguish identifiers, human perception (and 
reading habits) come into play.

>
> There's been tons of discussion about strange quirks of seriously esoteric characters and how those can lead to identifiers that seem disparate, but aren't (or vice versa.)  Kinda like ratholing on naive vs naïve.

"Seriously esoteric" -- I like that.

Some of the esoteric stuff, like historical characters, or characters 
for unused/dying orthographies should be filtered out -- that simply 
removes a set of problems without affecting utility.
>
> Yet we completely ignore other common problems, like Mueller vs Müller - which are admittedly language independent.
>
> In my opinion, part of the problem is goals:  Human usability vs machine uniqueness.  For machine uniqueness I'd think any set of rules would suffice because it boils down to a bunch of numbers.  For human usability you end up with confusables being an issue.  Those cannot be perfectly resolved because there are lots of minor pixel variations that are perfectly valid yet different.

It is useful to think of distances in perceptual space. Things like "rn" 
vs. "m" have surprisingly little distance in perceptual space, but "r", 
"n", and "m", taken by themselves are readily identified.

This is just one aspect where it's clear that trying to understand the 
issue on the basis of individual code points does not lead you to a 
solution of the full problem.

>
> Another human usability desire is to have the "right" display form, so that my businesses advertising "looks right" and yet customers can still find my business online.  IMO, that's the most interesting part of mapping, allowing pretty human forms to turn into machine-readable and somewhat consistent forms.  An unfortunate problem of the mapping is that the mapped form may not be as pretty.

In working on the Root Zone LGR we keep coming across issues that never 
made it into any of these discussions before. For example, in Ethiopic, 
the dominant language Amharic contains many words derived from an older 
language. As a consequence of how the pronunciation and orthography 
changed over time, the current situation is that a large percentage of 
words has alternate spellings that are equally acceptable (and are 
sounded out the same) but apparently without a canonical form.

It would be like having the spelling alternations "or/our" or "ise/ize" 
in English be personal or even 'ad-hoc' preferences.

The DNS cannot handle an unbounded number of alternates; the best that 
can be done is to prevent malicious registrations of sound-alike forms. 
Variants (what you call "mapped" characters) are not directly supported 
in the DNS. Thus there is no guarantee that all aliased or vairant 
labels (derived from mapped code points) lead to the same server. (And 
in some cases it is necessary for a server to know all ways by which it 
can be reached).

I don't think we can simply wish away that problem - as attractive as it 
would be for those cases where "free alternation" of spellings is an issue.
>
> There's perhaps another perceived unspoken requirement for uniqueness while still round-tripping through humans.  IMO, that's an unachievable expectation; there's no way a human can transcribe all of the reasonable names perfectly every time.  We can't even get O and 0 right, so dot below vs comma below or other more subtle issues are hopeless when written on a napkin.

The way people actually use domain names, writing them on napkins is 
probably not the most common scenario. Does happen, but far from the 
majority use case.
>
> I'd prefer that the IETF standard be very lax with respect to permissible characters, and, coming back to your document, encourage registrars to do the right thing for their customers with respect to permissible &/or mapped characters.

I see no reason to add to the perceptual problems by allowing code 
points for which there is not good use case (i.e. they are not part of 
some users' actual orthography and those users are among the targets of 
that zone).

The protocols, on the other hand, need to serve all possible zones, so 
they must be the least restrictive - what John is after is the reminder 
that for a given zone, restrictions appropriate for the intended user 
group must be put in place to make the zone as secure as possible.

A./
>
> -Shawn
>
> -----Original Message-----
> From: John C Klensin [mailto:klensin at jck.com]
> Sent: Saturday, March 11, 2017 1:27 PM
> To: Shawn Steele <Shawn.Steele at microsoft.com>; idna-update at alvestrand.no
> Subject: RE: I-D Action: draft-klensin-idna-rfc5891bis-00.txt
>
>
>
> --On Saturday, March 11, 2017 19:47 +0000 Shawn Steele <Shawn.Steele at microsoft.com> wrote:
>
>> It makes sense to reinforce that registrars need to do their own
>> narrowing of code points according to their needs.
> That was the position that motivated the document.
>
>> WRT the other issues that are avoided here, IMO the IETF should defer
>> to Unicode as they are the ones that add new codepoints and they fully
>> understand the security and other issues in the space.  Encoding
>> characters is, after all, their expertise.
> No one has questioned their ability to encode characters.  The issues are things that have been issues since IDN work was
> initiated (at least since the decision to use Unicode).   I
> think Unicode is a great system for encoding running text and an even better one for encoding text that is to be rendered and printed or otherwise displayed.  There is no dispute about that.
> However, for identifiers and identifier matching, there are differences in philosophy, several of which have been illustrated by issues that have shown up in the last few years.
> As one example, the DNS, at least the way IDNs and IDNA were conceived, does not have any "language" context, so Unicode distinctions among code points or ways of composing characters that are based strictly on language distinctions don't work well within IDNA (we could have designed IDNA to incorporate language information but we didn't for what seemed like good ideas at the time --and still do to many of us-- but, if anyone has the stomach to reopen that design question and start planning a major incompatible change were the decision to go the other way,
> go for it).
>
> best,
>     john
>
>
>
>
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update