Data on confusables

Mark Davis ⌛ mark at macchiato.com
Mon Jul 27 22:26:18 CEST 2009


> What percentage of domain names contain at least one character which is
confusable with another character permitted by IDNA2003, but no characters
which are confusable with characters permitted by IDNA2008?

I don't have a count of domain names. The figures I gave do part of what you
are asking for:

A. characters allowed by IDNA2008 that are confusable with *at least
one*other character allowed by IDNA2008

B.  characters allowed by IDNA200*3* that are confusable with *at least* one
other character allowed by IDNA200*3* (*and* not in A)

And then versions of A and B weighted by frequency, in two different ways.

I just computed your further question, which is "What is the percentage of
IDNA2008 PVALID characters which are confusable with a PVALID character in
IDNA2003?". That is:

C.  characters allowed by IDNA2008 that are confusable with *at least
one*other character allowed by IDNA200
*3* (*and* not in A)

I'm showing no additional characters in that group; that is, any PVALID2008
character with a confusable in PVALID2003 also has a confusable in
PVALID2008. (The number of other characters that each could be confused with
does grow, but that doesn't change whether or not they can be spoofed.)

Now, the focus on building the confusables has characters that can be used
to spoof modern, most-frequently-used scripts; the figures might change
somewhat if they are extended to other scripts. That is, it might add Runic
characters that could be spoofed by symbols or punctuation. But even then,
the frequency-weighted figures wouldn't change significantly.

Does that help?

Mark


On Mon, Jul 27, 2009 at 12:07, Gervase Markham <gerv at mozilla.org> wrote:

> On 27/07/09 07:54, Mark Davis ⌛ wrote:
>
>> Now, as with any statistics, the data is only an approximation.
>>
>
> It seems to me that the appropriate question to ask when judging impact is:
>
> What percentage of domain names contain at least one character which is
> confusable with another character permitted by IDNA2003, but no characters
> which are confusable with characters permitted by IDNA2008?
>
> In other words, how many domain names move from the "possibly spoofable"
> category into the "not spoofable category"?
>
> You say that in IDNA2008, 4.17% of PVALID characters have different
> IDNA2008 PVALID character they are confusable with. What is the percentage
> of IDNA2008 PVALID characters which are confusable with a PVALID character
> in IDNA2003? (Yes, I have asked that question exactly as I meant it.)
>
> Gerv
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20090727/d99afe6c/attachment.htm 


More information about the Idna-update mailing list