U+0130

Fri May 27 18:16:18 CEST 2011

On 5/27/2011 8:51 AM, Simon Josefsson wrote:
> The informative tables in RFC 5892 suggests U+130 is disallowed:
>
> 0130        ; DISALLOWED  # LATIN CAPITAL LETTER I WITH DOT ABOVE
>
> I cannot figure out why.  If I go through the 5892 table generating
> logic (see steps below) I get it to be PVALID because it has
> General_Category of Lu (and nothing else seems to disallow it).  Help?
>
> /Simon
>
> Exception(U+0130) = UNKNOWN
> BackwardCompatible(U+0130) = UNKNOWN
> 	General_Category-Cn(U+0130) == FALSE
> 	Noncharacter_Code_Point(U+0130) == FALSE
> Unassigned (U+0130) == FALSE
> LDH (U+0130) == FALSE
> JoinControl (U+0130) == FALSE
> 	toNFKC(U+0130) = U+0130
> 	toCaseFold(toNFKC(U+0130)) = U+0130

I think your problem is there. The entry in CaseFolding.txt for U+0130 is:

0130; F; 0069 0307; # LATIN CAPITAL LETTER I WITH DOT ABOVE

See R4 to Casefold(X), on p. 114 of The Unicode Standard, Version 6.0,
Chapter 3, Section 3.13.

"Case_Folding(C) uses the mappings with the status field value 'C' or 'F' in
the data file CaseFolding.txt in the Unicode Character Database."

Hence, the case folding of U+0130 is *not* stable -- it expands to the
sequence of lowercase "i" plus the combining dot above.

Check your implementation of toCaseFold().

--Ken

> 	toNFKC(toCaseFold(
> toNFKC(U+0130))) = U+0130
> Unstable (U+0130) == FALSE