Symbols and Line-drawing (was: Re: Proposed Charter for the IDNAbis Working Group)

Thu Mar 27 17:41:31 CET 2008

--On Thursday, 27 March, 2008 11:23 +0900 Martin Duerst
<duerst at it.aoyama.ac.jp> wrote:

> At 20:49 08/03/26, Gervase Markham wrote:
> 
>> However, I don't want to have to have the "why are you
>> allowing line-drawing and other non-letters in your IDNs?" -
>> "Er, it's a business differentiator" conversation with 50
>> different registries.
> 
> I frankly don't understand why some people are affraid of
> line-drawing or smiley or similar characters. Whether they are
> included or excluded doesn't really make any difference. If a
> registry thinks they are a differentiator, they may be, but
> only in the sense that the registry will be able to cheat some
> users into buying domain names with these characters in them.
> 
> Let's face it: These characters, for the most part, are very
> difficult to type (or otherwise input) in most contexts.
> Characters that are difficult to input are almost useless in
> domain names. Domain names containing these characters are
> very much useless, too. As a bottom line, natural selection
> will make sure we won't get too many of these.
> 
> This is different from some very specific, small subset of
> symbols that may be highly similar to e.g. dot or hyphen or
> so, which indeed can create problems.

Martin,

Let me plead, again, for getting the charter under control and
then, as appropriate, dig into these kinds of arguments on a
case-by-case basis.

To review what has been said before, line- (or box-) drawing
characters are not an ideal example, for the reasons you
identify, but they are still relevant.  We could also debate,
probably endlessly, whether the key issue is "easy to type",
"easy to recognize accurately", "easy to deal with in a file",
or "easy to describe in a database record".  Each of  those is
important, perhaps most important, to a different community and
the difference between the first two is key to a collection of
confusability issues (including the phishing ones).

But the reason for banning those characters is a matter of
protocol design that, for the DNS, goes back to the first "host
name" rules.  It goes back much further in programming language
contexts and is reflected in Unicode's "identifier" rules. Even
though each one of those systems may end up with a slightly
different character list, the general principle is that there
are characters that one uses to form identifiers or names and
characters that are reserved for use as delimiters or as other
pieces of syntax.  There have been historical exceptions to the
need to make that distinction, but they have not been successful
without additional assistance (e.g., Publication Algol doesn't
work directly as a programming language unless one can figure
out how to type parts of programs in boldface and even it
doesn't permit, e.g., spaces in identifiers).

Banning them also help prevents the future incompatibilities
that would occur if some group of people improvised forms for a
few characters with symbols and symbol or other combining marks
and the characters themselves were included in a future version
of Unicode.

That leaves a choice.  One can pick out a list of characters and
say "these are reserved for special bits of syntax" and permit
everything else in names/identifiers.  That may be plausible for
a closed 128 or 256 character repertoire, but it gets a little
complicated for Unicode.   Doing that often causes one to end up
with the "sometimes these are delimiters and sometimes they are
not" situation that helps make it impossible to build a general
URI parser that always works.   At the other extreme, one can
say "we permit letters and digits in identifiers, perhaps with a
few exceptions, and reserve everything else for other purposes".
That is, ultimately, the programming language approach, the host
name approach, the Unicode identifier approach, and so on (for a
rather long list).  

We are proposing taking that approach with IDNA200X as well.
The fact that it helps with confusability problems, with
phishing, with database indexing and label description problems
(partially a collation issue and partially one that is often
known around the DNS as the "whois" issue), and others all help
reinforce the impression that it is a good idea.  

regards,
   john