kenw at sybase.com
Fri May 27 18:16:18 CEST 2011
On 5/27/2011 8:51 AM, Simon Josefsson wrote:
> The informative tables in RFC 5892 suggests U+130 is disallowed:
> 0130 ; DISALLOWED # LATIN CAPITAL LETTER I WITH DOT ABOVE
> I cannot figure out why. If I go through the 5892 table generating
> logic (see steps below) I get it to be PVALID because it has
> General_Category of Lu (and nothing else seems to disallow it). Help?
> Exception(U+0130) = UNKNOWN
> BackwardCompatible(U+0130) = UNKNOWN
> General_Category-Cn(U+0130) == FALSE
> Noncharacter_Code_Point(U+0130) == FALSE
> Unassigned (U+0130) == FALSE
> LDH (U+0130) == FALSE
> JoinControl (U+0130) == FALSE
> toNFKC(U+0130) = U+0130
> toCaseFold(toNFKC(U+0130)) = U+0130
I think your problem is there. The entry in CaseFolding.txt for U+0130 is:
0130; F; 0069 0307; # LATIN CAPITAL LETTER I WITH DOT ABOVE
See R4 to Casefold(X), on p. 114 of The Unicode Standard, Version 6.0,
Chapter 3, Section 3.13.
"Case_Folding(C) uses the mappings with the status field value 'C' or 'F' in
the data file CaseFolding.txt in the Unicode Character Database."
Hence, the case folding of U+0130 is *not* stable -- it expands to the
sequence of lowercase "i" plus the combining dot above.
Check your implementation of toCaseFold().
> toNFKC(U+0130))) = U+0130
> Unstable (U+0130) == FALSE
More information about the Idna-update