The lookalike problem(s)

Mon Nov 27 13:11:13 CET 2006

Use of the term "language" in connection with IDNs contributes to the
expectations that many people have of the IDN capability that may prove
possible to implement. We need to remind ourselves and others that IDNs are
not really language in the richest sense. I think we may not have explored
how to express "safe" and "safe enough" in specific terms for the characters
to be included in IDNs. We may need to balance safety against liguistic
expectation or demand (0/O, 1/I in the case of Latin characters might be an
example of risk that has been implicitly accepted).  

Vinton G Cerf
Chief Internet Evangelist
Google
Regus Suite 384
13800 Coppermine Road
Herndon, VA 20171

+1 703 234-1823
+1 703-234-5822 (f)

vint at google.com
www.google.com

-----Original Message-----
From: Michael Everson [mailto:everson at evertype.com] 
Sent: Monday, November 27, 2006 4:48 AM
To: Vint Cerf; idna-update at alvestrand.no
Subject: RE: The lookalike problem(s)

Vint,

>Please stop for a moment and think about the problem the engineers have.
>They are trying to determine whether a relatively simply-described 
>algorithm would produce a suitable subset of the UNICODEs for use in 
>IDNs. This is simply an exercise.

Ah! That was by no means clear.

>If it doesn't work, for a variety of reasons, we will be back to 
>considering every character, one at a time, still trying to group them 
>so as to determine which subsets can be used freeely within a given 
>label in a domain name.

My mistake was in thinking that we were already at a place where we were
doing that. It simply wasn't clear that this was only an exercise. Perhaps I
missed a particular e-mail, or this was discussed at one of the two meetings
in Stockholm which I did not attend.

>So far, the exercise seems to me to point in that direction, but this 
>was worth trying.

Yes. The way I work (in analysing and encoding scripts for addition) is hard
to describe, but with experience lots of options are easily ruled out at the
beginning. Sometimes it confuses me when (for
instance) the UTC asks for explanations of the kind "Why didn't you choose
Option S?" when that option is (to me) obviously suboptimal or worse.

It seems to  me "obvious" that an algorithmic approach to IDN (without the
use of tables) simply wouldn't work. To explore that option, however, will
be worthwhile if it helps people to understand why. My surprise here has
been that I thought this was understood long ago.

>Moreover, it is vital that you appreciate the difference between the 
>set of expressions that it is reasonable to support for IDNs and the 
>production of general language.

I do understand this, completely and entirely.

>It is NOT the same thing. In fact, it is absolutely clear that we 
>cannot support general language in IDNs, for many of the character sets 
>under consideration.

What is not clear is whether on the IETF side you (any or all) understand
what minimum "general language support" is for a given script; whether you
understand that some scripts may be minimized more easily than others; how
character properties which are declared universal may differ from script to
script.

>The problem of confusables contributes significantly to this 
>limitation. If you continue to view IDN space as a space for general 
>discourse,

... but I don't and I never have done.

>you will come to completely unsuitable conclusions about the pragmatic 
>solution for choice of characters to permit in IDNs.

What I've seen here recently is:

1) a suggestion that character selection might be table-based for Latin
2) a suggestion that combining characters (the U+03xx block) might not be
included

The first one is completely arbitrary and has the effect of making several
official Latin-script languages ineligible for IDN. It also has the effect
of including letters which are not used in contexts other than
transliteration or transcription.

The second excludes at least dozens if not hundreds of languages from IDN.
It is astonishing to me that it is being considered, because the alternative
would be to rescind the normalization stability agreement and add a whole
lot of pre-composed characters to the UCS.

I understand that IDN must be simple to work. I also understand that it must
be safe. I further understand that there are commercial concerns. But if
(regarding the latter) <greekcompanyname>-USA is banned on security grounds,
then that's just too bad for the company which will just have to think up
another IDN.
--
Michael Everson * http://www.evertype.com