Visually confusable characters (2)

John C Klensin klensin at jck.com
Mon Aug 11 02:42:59 CEST 2014



--On Sunday, August 10, 2014 12:16 -0700 Asmus Freytag
<asmusf at ix.netcom.com> wrote:

> This message responds to point (2)
> 
> A./
>> (2) ICANN has real authority in this space.
>> 
>> ICANN has real authority in only two areas: decisions about
>> what top-level domains to allocate and delegate and
>> obligations they can impose on "contracted parties".   Even
>> those authorities are limited: for example, in the IDN space,
>...
> John,
> 
> thanks for the additional examples, but so far, this matches my
> understanding of the situation. I certainly wasn't thinking of
> ICANN as an authority, which however does prevent me from
> applying insights gained while working on projects in its
> sphere.

ICANN's authority is relevant only if one tries to project TLD
LGR tables and rules, or other restrictive global registration
guidelines on DNS registration activities lower in the tree or
by what ICANN sometimes calls non-contracted entities.   I was
responding in that regard to your apparent suggestion that more
general application and enforcement of those sorts of rules
would be helpful.   Perhaps that is covered in one of your other
notes.

> Using text as an identifier (as opposed to IP addresses) drags
> with
> it all the messy, conflicting  and sometimes self-inconsistent
> ways in which communities use text.

Let me say something a little different, from more of a DNS
perspective, and see if that helps us understand whether we
agree or not.  The hierarchical structured identifiers that are
used in and with the DNS are primarily text-based mnemonics and
not primarily words or other language elements.  When people get
those mnemonics confused with elements of languages or expect
them to be able to represent the full range of orthographic
conventions associated with particular languages, the either get
themselves into trouble or get very frustrated, sometimes by
things that they expect to "work" in ordinary language that
don't work in the DNS.    

A large fraction of those issues and surprises don't even have
anything to do with non-ASCII characters.   "O'Reilly's Bar and
Grill" and "Big Company, LLC." have never been able to have a
usable domain name of that form (even though, technically, the
DNS can accommodate it, protocols like SMTP and HTTP will not
tolerate it):  the apostrophe is prohibited, spaces and commas
are prohibited too, the use of period as part of a label is
problematic in several ways, and various compression rules make
it impossible to preserve case distinctions.   Even if those
organizations are willing to accept "oreillysbarandgrill" and
"bigcompanyllc" as labels, they promptly encounter problems with
global uniqueness (especially when using global, generic, TLDs)
that are not significant issues in what we usually consider the
real world.

> Sometimes that is expressed as being a problem of "Unicode",
> when in fact it is not.

Indeed, as the above examples illustrate.  

At the same time, because the DNS, of necessity, focuses on
mnemonics and character shapes (at least within a script) rather
than on language-specific or phonetically-linked criteria, there
is the potential for Unicode to not have the right facilities to
support what IDNA needs.  The RFC 5892 part of IDNA itself is an
example: it creates new derived properties by mixing and
matching existing Unicode character properties because the right
combination did not exist in the Unicode specs.   Whether that
is a "problem of (or with) Unicode" or not is a matter of
perception.  My preference is to think of the ability to create
such application-specific derived properties (even if some
explicit exceptions are needed, as they were with IDNA2008) as
evidence of Unicode's strength and flexibility.  

At the same time, as Andrew has just explained, paralleling
things that Patrik, myself, and others have also said, IDNA is
hugely dependent on consistency and predictability in Unicode
behavior, including decisions about coding, properties, and
normalization, and especially going forward from about 5.0.  If
we have assumed, based on what we read in the standard and what
we were told, that normalization was going to preserve equality
of identical-appearing characters in the same script or that the
statements about language-insensitivity in Section 2.2 were
primary and those assumptions turn out to not be correct, then
that still may not be "a problem with Unicode" but it may be a
very significant problem at the Unicode-IDNA boundary (and the
Unicode IETF coding and protocol boundaries more generally).  As
I have said before, I don't think that justifies trying to
abandon Unicode and I don't think I've heard anyone advocating
abandoning Unicode who hasn't thought Unicode was the wrong
solution for years.  But it may require either excluding
characters that, under different circumstances, we'd want to
allow or developing new mechanisms at the IDNA-Unicode boundary
so that the derived properties we use still end up with as
nearly the right set of behaviors for IDNA as possible.

best,
   john



More information about the Idna-update mailing list