IDNNever.txt
Kenneth Whistler
kenw at sybase.com
Sat Feb 3 03:37:36 CET 2007
Since the issue came up today, I have gone ahead and drafted
what a property file for an IDN_Never property would look
like, with a representative (and conservative) first cut
at its content.
See:
http://www.unicode.org/~whistler/IDNNever.txt
That has the same format as:
http://www.unicode.org/~whistler/IDNPermitted.txt
which I explained earlier.
For conservative criteria for what to absolutely, positively
guarantee are in the never, never, ever category, I have
started with:
1. cp != NFKC(cp)
2. cp has Pattern_Syntax property
3. cp has Pattern_White_Space property
4. cp has White_Space property
5. cp has Variation_Selector property
6. cp has Noncharacter_Code_Point property
7. cp has General_Category=Cf (Unicode format controls)
8. cp has General_Category=Cc (ISO controls)
(There is considerable overlap for some of those properties,
so not all of them may be required -- some may be redundant
for the purposes of this derivation. I just haven't done
the detailed analysis on this first cut yet.)
Then the following three exceptions are pulled from the
list:
1. cp = U+002D HYPHEN-MINUS (a Pattern_Syntax character)
2. cp = U+200C ZERO WIDTH NON-JOINER (gc=Cf)
3. cp = U+200D ZERO WIDTH JOINER (gc=Cf)
The listing is not quite complete yet, because my utility
only processed Planes 0, 1, 2, and 14, and there are also
noncharacter code points on the other planes. Also, I
think all user-defined characters must be given IDN_Never=True,
and I haven't done that yet.
But if you check this list, it should be clear in general
what kinds of characters constitute what I earlier
designated as the ones that *nobody* wants to include
in IDNs and for which a stability guarantee would be easy
to stand by.
--Ken
More information about the Idna-update
mailing list