an interesting ICANN development on similar domain names

Sun Aug 10 05:39:59 CEST 2008

Just FYI and with disclaimer that I am on blackberry and have not been able to read all comments on this topic in full detail:

1) The sword algorithm will be used on strings made of characters or combinations of characters that are valid per the idna protocol and the idn guidelines

2) That said, the algorithm will be on selected scripts to begin with and will cntinue to be build out.

3) The algorithm will not be the sole determination on whether two strings are confusingly similar or not but it will help and give an indication.

That are the most important things for now - please let me know if you have additional questions.

Tina

----- Original Message -----
From: idna-update-bounces at alvestrand.no <idna-update-bounces at alvestrand.no>
To: Yao Jiankang <yaojk at cnnic.cn>; idna-update at alvestrand.no <idna-update at alvestrand.no>; Vint Cerf <vint at google.com>
Sent: Sat Aug 09 17:13:21 2008
Subject: Re: an interesting ICANN development on similar domain names

--On Sunday, 10 August, 2008 07:49 +0800 Yao Jiankang
<yaojk at cnnic.cn> wrote:

> if SWORD's verbal search algorithms (or any other algorithms)
> can be used to built a similarity words set database, that
> seems be fine.  what I mean is that:
>
> For every possible domain label word or TLD word, we can
> classify it into the  sets of similarity words , finally we
> can bulit a database including all possible similarity word
> sets.
> so there will have many similarity words sets
> for example,
>  similarity word A set (every word in this set is similar to
> word A) similarity word B set (every word in this set is
> similar to word B) similarity word C set(every word in this
> set is similar to word C) similarity word D set(every word in
> this set is similar to word D) ...
> ...
>
> When new word X is encountered by SWORD's verbal search
> algorithms, this algorithm can decide whether word X can be
> classified into current similarity word  sets. if yes, we will
> add word X into the current similarity word set; if not, we
> can create a new similarity word set. if this process is
> repeated, the similarity word set will become larger and the
> database including all the similarity word sets will become
> larger.
>
> This database may help us to decide whether new gTLD strings
> are in user confusion with existing TLDs. It can also help the
> registry or registrant or registrar to register IDN. Of
> course, that kind of database is not easy to be built.

Sure.  Of course, the matrix of character similarities/distances
would be on the order of 10**12 cells if one were dealing with
all of Unicode.  Given the much more restricted inclusion model
of IDNA2008, that might be only 10**8 or thereabouts.  But, of
course, if one is worried about label similarity, and sounds,
one would need all of the possible combinations.  Unless I've
gotten my arithmetic wrong, if the character matrix was order
10**8, the matrix for two-character labels would be order 10*16,
for three-character ones, an array with 10**24 cells, and so
forth.

And that ignores the "not easy" part and just considers sheer
size of the relationships to be figured out and calculated.

Right.

I'm _really_ glad this isn't our problem.

    john

_______________________________________________
Idna-update mailing list
Idna-update at alvestrand.no
http://www.alvestrand.no/mailman/listinfo/idna-update