Data on confusables

Mon Jul 27 08:54:15 CEST 2009

I ran some tests today to come up with the impact of IDNA2008 on
confusability. These were generated using Unicode confusability data and
Google data on character frequency on the web, and within domain names on
the web. Note that the "web" is not just html; it includes PDFs, email
archives, etc.

Here's the data and a key following.
                                      Raw%    WeightedWeb% WeightedIdn%
pValidHasPValidConfusable:            4.17%     94.53%      99.9974%
cValidHasCValidConfusable:           +0.08%     +0.05%     +1.57E-05%
pValid2003HasPValid2003Confusable:   +1.36%     +0.20%     +1.08E-05%
(100% = all PVALID non-Han characters)

*(Set your mailer to a monospaced font to get the columns to line up.)*

*Key*
It's running a bit late here, so I'm probably not going to explain these
figures as clearly as I could, but here's what they mean.

*Rows. *Each row is a comparison of one type of character to those of the
same type:

   - comparing PVALID to other PVALID
   - comparing CVALID to other CVALID (CVALID = PVALID + CONTEXT)
   - comparing VALID-in-IDNA2003 to other VALID-in-IDNA2003

The percentages are relative to the number of PVALID non-Han characters.
(Han is excluded because we don't have good confusability data for it). Each
row past the first is an addition; that is, the total RAW value for CVALID
is 4.17% + 0.08%; the 0.08% represents the additional confusable characters
added by having at least one of the two compared characters be a CONTEXT
character.

*Columns.* Each column shows a type of data.

   - The Raw column shows a character count. That is, 4.17% of PVALID
   characters have different PVALID character they are confusable with.
   - The WeightedWeb column weights those counts by web character frequency.
   - The WeightedIdn column weights the character counts by character
   frequency within domain names found on the web.

The confusable characters excluded by IDNA2008 account for 1.36% of PVALID
non-Han *-- by character count -- *which is about 1/3 of the PVALID
confusables. That seems pretty good, except that the vast majority of them
are rare characters only confusable with other rare characters. So weighted
by usage, they are an extremely small 0.20% of additional confusables (and
this is counting the weights of the most frequent character they are
confusable with). In sum, IDNA2008 reduces the weighted number of
confusables by only about 0.20%, weighted by web frequencies.

Now, as with any statistics, the data is only an approximation.

   - The Google data, for example, is from a sampling of just about half a
   billion pages, and very low frequency characters can't be reliably
   distinguished from noise.
   - The amount of non ASCII text in domain names also doesn't compare to
   the proportions on the web as a whole, since they are still growing. So
   that's why I also include the web statistics, since that is probably more
   like the eventual frequencies.
   - It is difficult to tell what affect the CONTEXT characters would have,
   because we can't tell what percentage of the characters they are confused
   with would reasonably occur in the included contexts, but it is clear that
   the restrictions on these characters will have only a small effect.

Yet the story is pretty clear; the change from Idna2003 to Idna2008, and the
CONTEXT rules (other than joiners) will have a trifling positive impact on
IDNA security. For those of us concerned with security, I'd wager that it is
swamped by the negative impact of the security problems introduced by the
de-facto indeterminacy introduced for ς and ß labels.

Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20090726/756a1b96/attachment.htm