Q2: What mapping function should be used in a revised IDNA2008 specification?

Mark Davis mark at macchiato.com
Fri Apr 3 18:19:17 CEST 2009


What I do is for every Unicode character C:

   - apply the mapping (eg NFKC + CaseFolding + DefaultIgnorableRemoval) to
   C getting result R, and if the following are all true, I increment a
   Remapped counter:
      1. R != C, and
      2. every character in R is a possible U-Label character (PVALID or
      CONTEXTJ or CONTEXTO).
      - apply the IDNA2003 mapping to C getting R', and if the following are
   all true, I increment a Diverging counter:
      - IDNA2003 succeeded (there is an R')
      - R' != R, and
      - every character in R' is a possible U-Label character

Mark


On Fri, Apr 3, 2009 at 05:44, Harald Alvestrand <harald at alvestrand.no>wrote:

> Mark Davis wrote:
>
>> I modified the program to add a comparison to IDNA2003. I am only
>> including cases where the mapping results in A-Label characters. The numbers
>> within and across row don't add up as you might expect because of various
>> overlaps and because only mappings to A-Label characters are counted.
>>
>> Most of the difference between NFKC-CF-RDI and IDNA2003 are new 5.2
>> characters; there are 5 diverging mappings. (As I said before, these figures
>> don't include the current list of special cases: eszett, final_sigma,
>> joiners.)
>>
>> PValid or Context: 90262
>> IDNA2003,    Remapped:    4337
>> NFKC-CF-RDI,    Remapped:    5291,    Diverging:    5
>> NFKC-LC-RDI,    Remapped:    5225,    Diverging:    77
>> NFKC-CF,    Remapped:    4896,    Diverging:    32
>> NFKC-LC,    Remapped:    4830,    Diverging:    104
>> NFC-CF-RDI,    Remapped:    2486,    Diverging:    2663
>> NFC-LC-RDI,    Remapped:    2395,    Diverging:    2754
>> NFC-CF,    Remapped:    2091,    Diverging:    2690
>> NFC-LC,    Remapped:    2000,    Diverging:    2781
>>
>> Mark
>>
> Mark,
>
> these numbers confuse me a bit - what are you counting?
>
> Is this the result of applying (for instance) NFKC(LC(char))) for all the
> characters in Unicode, and counting how many got changed?
>
> A number of character sequences (the ones with combining marks being the
> most famous ones) are also changed by NFC or NFKC - is there a means of
> counting the impact of that?
>
>                Harald
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20090403/f53fd794/attachment.htm 


More information about the Idna-update mailing list