What I do is for every Unicode character C:<br><ul><li>apply the mapping (eg NFKC + CaseFolding + DefaultIgnorableRemoval) to C getting result R, and if the following are all true, I increment a Remapped counter:</li><ol><li>
R != C, and</li><li>every character in R is a possible U-Label character (PVALID or CONTEXTJ or CONTEXTO).<br></li></ol><li>apply the IDNA2003 mapping to C getting R', and if the following are all true, I increment a Diverging counter:</li>
<ul><li>IDNA2003 succeeded (there is an R')</li><li>R' != R, and<br></li><li>every character in R' is a possible U-Label character </li></ul></ul>Mark<br>
<br><br><div class="gmail_quote">On Fri, Apr 3, 2009 at 05:44, Harald Alvestrand <span dir="ltr"><<a href="mailto:harald@alvestrand.no">harald@alvestrand.no</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<div class="im">Mark Davis wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
I modified the program to add a comparison to IDNA2003. I am only including cases where the mapping results in A-Label characters. The numbers within and across row don't add up as you might expect because of various overlaps and because only mappings to A-Label characters are counted.<br>
<br>
Most of the difference between NFKC-CF-RDI and IDNA2003 are new 5.2 characters; there are 5 diverging mappings. (As I said before, these figures don't include the current list of special cases: eszett, final_sigma, joiners.)<br>
<br>
PValid or Context: 90262<br>
IDNA2003, Remapped: 4337<br>
NFKC-CF-RDI, Remapped: 5291, Diverging: 5<br>
NFKC-LC-RDI, Remapped: 5225, Diverging: 77<br>
NFKC-CF, Remapped: 4896, Diverging: 32<br>
NFKC-LC, Remapped: 4830, Diverging: 104<br>
NFC-CF-RDI, Remapped: 2486, Diverging: 2663<br>
NFC-LC-RDI, Remapped: 2395, Diverging: 2754<br>
NFC-CF, Remapped: 2091, Diverging: 2690<br>
NFC-LC, Remapped: 2000, Diverging: 2781<br>
<br>
Mark<br>
</blockquote></div>
Mark,<br>
<br>
these numbers confuse me a bit - what are you counting?<br>
<br>
Is this the result of applying (for instance) NFKC(LC(char))) for all the characters in Unicode, and counting how many got changed?<br>
<br>
A number of character sequences (the ones with combining marks being the most famous ones) are also changed by NFC or NFKC - is there a means of counting the impact of that?<br><font color="#888888">
<br>
Harald<br>
<br>
</font></blockquote></div><br>