The lookalike problem(s)

Michael Everson everson at evertype.com
Mon Nov 27 10:47:58 CET 2006


Vint,

>Please stop for a moment and think about the problem the engineers have.
>They are trying to determine whether a relatively simply-described algorithm
>would produce a suitable subset of the UNICODEs for use in IDNs. This is
>simply an exercise.

Ah! That was by no means clear.

>If it doesn't work, for a variety of reasons, we will be back to 
>considering every character, one at a time, still trying to group 
>them so as to determine which subsets can be used freeely within a 
>given label in a domain name.

My mistake was in thinking that we were already at a place where we 
were doing that. It simply wasn't clear that this was only an 
exercise. Perhaps I missed a particular e-mail, or this was discussed 
at one of the two meetings in Stockholm which I did not attend.

>So far, the exercise seems to me to point in that direction, but 
>this was worth trying.

Yes. The way I work (in analysing and encoding scripts for addition) 
is hard to describe, but with experience lots of options are easily 
ruled out at the beginning. Sometimes it confuses me when (for 
instance) the UTC asks for explanations of the kind "Why didn't you 
choose Option S?" when that option is (to me) obviously suboptimal or 
worse.

It seems to  me "obvious" that an algorithmic approach to IDN 
(without the use of tables) simply wouldn't work. To explore that 
option, however, will be worthwhile if it helps people to understand 
why. My surprise here has been that I thought this was understood 
long ago.

>Moreover, it is vital that you appreciate the difference between the set of
>expressions that it is reasonable to support for IDNs and the production of
>general language.

I do understand this, completely and entirely.

>It is NOT the same thing. In fact, it is absolutely clear that we 
>cannot support general language in IDNs, for many of the character 
>sets under consideration.

What is not clear is whether on the IETF side you (any or all) 
understand what minimum "general language support" is for a given 
script; whether you understand that some scripts may be minimized 
more easily than others; how character properties which are declared 
universal may differ from script to script.

>The problem of confusables contributes significantly to this 
>limitation. If you continue to view IDN space as a space for general 
>discourse,

... but I don't and I never have done.

>you will come to completely unsuitable conclusions about the 
>pragmatic solution for choice of characters to permit in IDNs.

What I've seen here recently is:

1) a suggestion that character selection might be table-based for Latin
2) a suggestion that combining characters (the U+03xx block) might 
not be included

The first one is completely arbitrary and has the effect of making 
several official Latin-script languages ineligible for IDN. It also 
has the effect of including letters which are not used in contexts 
other than transliteration or transcription.

The second excludes at least dozens if not hundreds of languages from 
IDN. It is astonishing to me that it is being considered, because the 
alternative would be to rescind the normalization stability agreement 
and add a whole lot of pre-composed characters to the UCS.

I understand that IDN must be simple to work. I also understand that 
it must be safe. I further understand that there are commercial 
concerns. But if (regarding the latter) <greekcompanyname>-USA is 
banned on security grounds, then that's just too bad for the company 
which will just have to think up another IDN.
-- 
Michael Everson * http://www.evertype.com


More information about the Idna-update mailing list