What I do is for every Unicode character C:<br><ul><li>apply the mapping (eg NFKC + CaseFolding + DefaultIgnorableRemoval) to C getting result R, and if the following are all true, I increment a Remapped counter:</li><ol><li>

R != C, and</li><li>every character in R is a possible U-Label character (PVALID or CONTEXTJ or CONTEXTO).<br></li></ol><li>apply the IDNA2003 mapping to C getting R&#39;, and if the following are all true, I increment a Diverging counter:</li>

<ul><li>IDNA2003 succeeded (there is an R&#39;)</li><li>R&#39; != R, and<br></li><li>every character in R&#39; is a possible U-Label character </li></ul></ul>Mark<br>

<br><br><div class="gmail_quote">On Fri, Apr 3, 2009 at 05:44, Harald Alvestrand <span dir="ltr">&lt;<a href="mailto:harald@alvestrand.no">harald@alvestrand.no</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div class="im">Mark Davis wrote:<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

I modified the program to add a comparison to IDNA2003. I am only including cases where the mapping results in A-Label characters. The numbers within and across row don&#39;t add up as you might expect because of various overlaps and because only mappings to A-Label characters are counted.<br>


<br>

Most of the difference between NFKC-CF-RDI and IDNA2003 are new 5.2 characters; there are 5 diverging mappings. (As I said before, these figures don&#39;t include the current list of special cases: eszett, final_sigma, joiners.)<br>


<br>

PValid or Context: 90262<br>

IDNA2003,    Remapped:    4337<br>

NFKC-CF-RDI,    Remapped:    5291,    Diverging:    5<br>

NFKC-LC-RDI,    Remapped:    5225,    Diverging:    77<br>

NFKC-CF,    Remapped:    4896,    Diverging:    32<br>

NFKC-LC,    Remapped:    4830,    Diverging:    104<br>

NFC-CF-RDI,    Remapped:    2486,    Diverging:    2663<br>

NFC-LC-RDI,    Remapped:    2395,    Diverging:    2754<br>

NFC-CF,    Remapped:    2091,    Diverging:    2690<br>

NFC-LC,    Remapped:    2000,    Diverging:    2781<br>

<br>

Mark<br>

</blockquote></div>

Mark,<br>

<br>

these numbers confuse me a bit - what are you counting?<br>

<br>

Is this the result of applying (for instance) NFKC(LC(char))) for all the characters in Unicode, and counting how many got changed?<br>

<br>

A number of character sequences (the ones with combining marks being the most famous ones) are also changed by NFC or NFKC - is there a means of counting the impact of that?<br><font color="#888888">

<br>

                Harald<br>

<br>

</font></blockquote></div><br>