Data on confusables
Harald Alvestrand
harald at alvestrand.no
Mon Jul 27 09:12:36 CEST 2009
Mark,
when you say "Unicode confusability data", you're referring to
http://unicode.org/reports/tr39/data/confusables.txt - right?
I think this just serves to confirm what's been said here many times -
that confusability is not one of the problems that can be solved by the
kind of language-independent slicing and dicing that IDNA2008 is doing;
it basically has to be a registry-level function.
Harald
Mark Davis ⌛ wrote:
> I ran some tests today to come up with the impact of IDNA2008 on
> confusability. These were generated using Unicode confusability data
> and Google data on character frequency on the web, and within domain
> names on the web. Note that the "web" is not just html; it includes
> PDFs, email archives, etc.
>
> Here's the data and a key following.
> Raw% WeightedWeb% WeightedIdn%
> pValidHasPValidConfusable: 4.17% 94.53% 99.9974%
> cValidHasCValidConfusable: +0.08% +0.05% +1.57E-05%
> pValid2003HasPValid2003Confusable: +1.36% +0.20% +1.08E-05%
> (100% = all PVALID non-Han characters)
>
> /(Set your mailer to a monospaced font to get the columns to line up.)/
>
> *Key*
> It's running a bit late here, so I'm probably not going to explain
> these figures as clearly as I could, but here's what they mean.
>
> *Rows. *Each row is a comparison of one type of character to those of
> the same type:
>
> * comparing PVALID to other PVALID
> * comparing CVALID to other CVALID (CVALID = PVALID + CONTEXT)
> * comparing VALID-in-IDNA2003 to other VALID-in-IDNA2003
>
> The percentages are relative to the number of PVALID non-Han
> characters. (Han is excluded because we don't have good confusability
> data for it). Each row past the first is an addition; that is, the
> total RAW value for CVALID is 4.17% + 0.08%; the 0.08% represents the
> additional confusable characters added by having at least one of the
> two compared characters be a CONTEXT character.
>
> *Columns.* Each column shows a type of data.
>
> * The Raw column shows a character count. That is, 4.17% of PVALID
> characters have different PVALID character they are confusable with.
> * The WeightedWeb column weights those counts by web character
> frequency.
> * The WeightedIdn column weights the character counts by character
> frequency within domain names found on the web.
>
> The confusable characters excluded by IDNA2008 account for 1.36% of
> PVALID non-Han /-- by character count -- /which is about 1/3 of the
> PVALID confusables. That seems pretty good, except that the vast
> majority of them are rare characters only confusable with other rare
> characters. So weighted by usage, they are an extremely small 0.20% of
> additional confusables (and this is counting the weights of the most
> frequent character they are confusable with). In sum, IDNA2008 reduces
> the weighted number of confusables by only about 0.20%, weighted by
> web frequencies.
>
> Now, as with any statistics, the data is only an approximation.
>
> * The Google data, for example, is from a sampling of just about
> half a billion pages, and very low frequency characters can't be
> reliably distinguished from noise.
> * The amount of non ASCII text in domain names also doesn't
> compare to the proportions on the web as a whole, since they are
> still growing. So that's why I also include the web statistics,
> since that is probably more like the eventual frequencies.
> * It is difficult to tell what affect the CONTEXT characters would
> have, because we can't tell what percentage of the characters
> they are confused with would reasonably occur in the included
> contexts, but it is clear that the restrictions on these
> characters will have only a small effect.
>
> Yet the story is pretty clear; the change from Idna2003 to Idna2008,
> and the CONTEXT rules (other than joiners) will have a trifling
> positive impact on IDNA security. For those of us concerned with
> security, I'd wager that it is swamped by the negative impact of the
> security problems introduced by the de-facto indeterminacy introduced
> for ς and ß labels.
>
> Mark
> ------------------------------------------------------------------------
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
More information about the Idna-update
mailing list