Confusability (Re: New version, draft-faltstrom-idnabis-tables-02.txt, available)

Tue Jun 19 23:03:46 CEST 2007

--On Monday, 18 June, 2007 12:06 +0100 Gervase Markham
<gerv at mozilla.org> wrote:

>...
> And this, incidentally, is one good reason why the chap whose
> email just came through who bought www.<peace sign>.com as a
> speculator was sadly misguided. How is anyone going to type
> that? Or talk about it in a way which doesn't confuse it with
> www.peacesign.com?

Or "www.inverted-y-in-a-circle.com".  If one can't type them,
one is in big trouble.  If one also can't describe them to
others in a reasonable, clear, and consistent way, one is in
worse trouble.  It may very well be that, by doing a
character-by-character analysis, we could discover symbols that
are safe.  Indeed, I'm sure we could, since characters like the
ASCII plus-sign, which has been prohibited since the dawn of the
host naming rules of the ARPANET, would probably meet that
criterion.   But there has been no persuasive case that such a
search and definition process is worth the trouble.

>...
> It isn't clear to me that we must exclude these. Like Greek
> omicron and Latin o, assuming that these "font variations" are
> in fact characters in someone's written language, can't we
> include them and sort the confusable problem out at a higher
> level?

Maybe.

There are at least two issues here.

The first, and one that I think we need to keep very much in
mind, is that there are two ways of implementing that "higher
level".  In one, applications that encounter and present IDNs
make their own decisions, presumably on a basis that is local to
the application and implementation, as to what should be
displayed and/or processed and how.   I believe that is a useful
tool in our collection of tools for dealing with problem IDNs.
At the same time, it implies that different users are seeing
different presentations under different circumstances and,
indeed, that the same user may see different presentations while
running different software on the same machine.   Users tend to
think that violates the principle of least astonishment and it
makes the already-difficult problem of transmitting a URI or IRI
containing IDNs in arbitrary scripts to others and having them
be interpreted accurately even more difficult.   So I think we
need to be extremely careful to understand that there are
tradeoffs and that "let's push it upward rather than addressing
it here" has associated costs and risks.

The second involves applying restrictions to registrations.  But
registration restrictions are ultimately voluntary with
registries, registrars, or bodies having authority over
registrations.  If we encourage applying different rules to
different scripts and in different environments, as we probably
must, we will certainly get even more variability.  There are
many millions of domains on the Internet, at all levels of the
hierarchy, each of which may potentially have its own
registration policies.  Even if ICANN makes firm policies in
this area and can enforce them (another discussion for other
groups), their ability to do so extends to on the order of a
dozen domains.  Of the 250-odd ccTLDs, we can assume that most
of those who are committed to the utility of the Internet and
good references will be persuaded by a good explanation of why
restrictions are needed.  But some of those domains are not thus
committed.   And, for the balance of the domains, at all levels
of the tree, all bets are off.

At one level, variation among registries about the restrictions
they apply are not a problem: names that are not registered are
just not found, regardless of the reasons they were not
registered.  But, because users who expected to be protected by
those restrictions may be surprised or even harmed if the
restrictions aren't there, we again need to be careful about a
"just push up" strategy even though I consider registry
restrictions to be a critical tool in the kit.

This combination of things has led several of us to conclude
that any restrictions or provisions that are sensible and can be
incorporated in the protocol should be there, rather than in
upper layers that will inevitably exhibit varying and
inconsistent behavior.  Of course, the devil is in figuring out
what is sensible and appropriate.

      john