Data on confusables

Mon Jul 27 09:12:36 CEST 2009

Mark,

when you say "Unicode confusability data", you're referring to 
http://unicode.org/reports/tr39/data/confusables.txt - right?

I think this just serves to confirm what's been said here many times - 
that confusability is not one of the problems that can be solved by the 
kind of language-independent slicing and dicing that IDNA2008 is doing; 
it basically has to be a registry-level function.

                 Harald

Mark Davis ⌛ wrote:
> I ran some tests today to come up with the impact of IDNA2008 on 
> confusability. These were generated using Unicode confusability data 
> and Google data on character frequency on the web, and within domain 
> names on the web. Note that the "web" is not just html; it includes 
> PDFs, email archives, etc.
>
> Here's the data and a key following.
>                                       Raw%    WeightedWeb% WeightedIdn%
> pValidHasPValidConfusable:            4.17%     94.53%      99.9974%
> cValidHasCValidConfusable:           +0.08%     +0.05%     +1.57E-05%
> pValid2003HasPValid2003Confusable:   +1.36%     +0.20%     +1.08E-05%
> (100% = all PVALID non-Han characters)
>
> /(Set your mailer to a monospaced font to get the columns to line up.)/
>
> *Key*
> It's running a bit late here, so I'm probably not going to explain 
> these figures as clearly as I could, but here's what they mean.
>
> *Rows. *Each row is a comparison of one type of character to those of 
> the same type:
>
>     * comparing PVALID to other PVALID
>     * comparing CVALID to other CVALID (CVALID = PVALID + CONTEXT)
>     * comparing VALID-in-IDNA2003 to other VALID-in-IDNA2003
>
> The percentages are relative to the number of PVALID non-Han 
> characters. (Han is excluded because we don't have good confusability 
> data for it). Each row past the first is an addition; that is, the 
> total RAW value for CVALID is 4.17% + 0.08%; the 0.08% represents the 
> additional confusable characters added by having at least one of the 
> two compared characters be a CONTEXT character.
>
> *Columns.* Each column shows a type of data.
>
>     * The Raw column shows a character count. That is, 4.17% of PVALID
>       characters have different PVALID character they are confusable with.
>     * The WeightedWeb column weights those counts by web character
>       frequency.
>     * The WeightedIdn column weights the character counts by character
>       frequency within domain names found on the web.
>
> The confusable characters excluded by IDNA2008 account for 1.36% of 
> PVALID non-Han /-- by character count -- /which is about 1/3 of the 
> PVALID confusables. That seems pretty good, except that the vast 
> majority of them are rare characters only confusable with other rare 
> characters. So weighted by usage, they are an extremely small 0.20% of 
> additional confusables (and this is counting the weights of the most 
> frequent character they are confusable with). In sum, IDNA2008 reduces 
> the weighted number of confusables by only about 0.20%, weighted by 
> web frequencies.
>
> Now, as with any statistics, the data is only an approximation.
>
>     * The Google data, for example, is from a sampling of just about
>       half a billion pages, and very low frequency characters can't be
>       reliably distinguished from noise.
>     * The amount of non ASCII text in domain names also doesn't
>       compare to the proportions on the web as a whole, since they are
>       still growing. So that's why I also include the web statistics,
>       since that is probably more like the eventual frequencies.
>     * It is difficult to tell what affect the CONTEXT characters would
>       have, because we can't tell what percentage of the characters
>       they are confused with would reasonably occur in the included
>       contexts, but it is clear that the restrictions on these
>       characters will have only a small effect.
>
> Yet the story is pretty clear; the change from Idna2003 to Idna2008, 
> and the CONTEXT rules (other than joiners) will have a trifling 
> positive impact on IDNA security. For those of us concerned with 
> security, I'd wager that it is swamped by the negative impact of the 
> security problems introduced by the de-facto indeterminacy introduced 
> for ς and ß labels.
>
> Mark
> ------------------------------------------------------------------------
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>