The lookalike problem(s)
Michael Everson
everson at evertype.com
Mon Nov 27 10:47:58 CET 2006
Vint,
>Please stop for a moment and think about the problem the engineers have.
>They are trying to determine whether a relatively simply-described algorithm
>would produce a suitable subset of the UNICODEs for use in IDNs. This is
>simply an exercise.
Ah! That was by no means clear.
>If it doesn't work, for a variety of reasons, we will be back to
>considering every character, one at a time, still trying to group
>them so as to determine which subsets can be used freeely within a
>given label in a domain name.
My mistake was in thinking that we were already at a place where we
were doing that. It simply wasn't clear that this was only an
exercise. Perhaps I missed a particular e-mail, or this was discussed
at one of the two meetings in Stockholm which I did not attend.
>So far, the exercise seems to me to point in that direction, but
>this was worth trying.
Yes. The way I work (in analysing and encoding scripts for addition)
is hard to describe, but with experience lots of options are easily
ruled out at the beginning. Sometimes it confuses me when (for
instance) the UTC asks for explanations of the kind "Why didn't you
choose Option S?" when that option is (to me) obviously suboptimal or
worse.
It seems to me "obvious" that an algorithmic approach to IDN
(without the use of tables) simply wouldn't work. To explore that
option, however, will be worthwhile if it helps people to understand
why. My surprise here has been that I thought this was understood
long ago.
>Moreover, it is vital that you appreciate the difference between the set of
>expressions that it is reasonable to support for IDNs and the production of
>general language.
I do understand this, completely and entirely.
>It is NOT the same thing. In fact, it is absolutely clear that we
>cannot support general language in IDNs, for many of the character
>sets under consideration.
What is not clear is whether on the IETF side you (any or all)
understand what minimum "general language support" is for a given
script; whether you understand that some scripts may be minimized
more easily than others; how character properties which are declared
universal may differ from script to script.
>The problem of confusables contributes significantly to this
>limitation. If you continue to view IDN space as a space for general
>discourse,
... but I don't and I never have done.
>you will come to completely unsuitable conclusions about the
>pragmatic solution for choice of characters to permit in IDNs.
What I've seen here recently is:
1) a suggestion that character selection might be table-based for Latin
2) a suggestion that combining characters (the U+03xx block) might
not be included
The first one is completely arbitrary and has the effect of making
several official Latin-script languages ineligible for IDN. It also
has the effect of including letters which are not used in contexts
other than transliteration or transcription.
The second excludes at least dozens if not hundreds of languages from
IDN. It is astonishing to me that it is being considered, because the
alternative would be to rescind the normalization stability agreement
and add a whole lot of pre-composed characters to the UCS.
I understand that IDN must be simple to work. I also understand that
it must be safe. I further understand that there are commercial
concerns. But if (regarding the latter) <greekcompanyname>-USA is
banned on security grounds, then that's just too bad for the company
which will just have to think up another IDN.
--
Michael Everson * http://www.evertype.com
More information about the Idna-update
mailing list