an interesting ICANN development on similar domain names

John C Klensin klensin at jck.com
Sun Aug 10 02:13:21 CEST 2008



--On Sunday, 10 August, 2008 07:49 +0800 Yao Jiankang
<yaojk at cnnic.cn> wrote:

> if SWORD's verbal search algorithms (or any other algorithms)
> can be used to built a similarity words set database, that
> seems be fine.  what I mean is that:
> 
> For every possible domain label word or TLD word, we can
> classify it into the  sets of similarity words , finally we
> can bulit a database including all possible similarity word
> sets.  
> so there will have many similarity words sets
> for example,
>  similarity word A set (every word in this set is similar to
> word A) similarity word B set (every word in this set is
> similar to word B) similarity word C set(every word in this
> set is similar to word C) similarity word D set(every word in
> this set is similar to word D) ...
> ...
> 
> When new word X is encountered by SWORD's verbal search
> algorithms, this algorithm can decide whether word X can be
> classified into current similarity word  sets. if yes, we will
> add word X into the current similarity word set; if not, we
> can create a new similarity word set. if this process is
> repeated, the similarity word set will become larger and the
> database including all the similarity word sets will become
> larger.
> 
> This database may help us to decide whether new gTLD strings
> are in user confusion with existing TLDs. It can also help the
> registry or registrant or registrar to register IDN. Of
> course, that kind of database is not easy to be built.

Sure.  Of course, the matrix of character similarities/distances
would be on the order of 10**12 cells if one were dealing with
all of Unicode.  Given the much more restricted inclusion model
of IDNA2008, that might be only 10**8 or thereabouts.  But, of
course, if one is worried about label similarity, and sounds,
one would need all of the possible combinations.  Unless I've
gotten my arithmetic wrong, if the character matrix was order
10**8, the matrix for two-character labels would be order 10*16,
for three-character ones, an array with 10**24 cells, and so
forth.

And that ignores the "not easy" part and just considers sheer
size of the relationships to be figured out and calculated.  

Right.

I'm _really_ glad this isn't our problem.

    john





More information about the Idna-update mailing list