IDN_Allowed and IDN_Disallowed

Kenneth Whistler kenw at sybase.com
Fri Feb 1 01:41:22 CET 2008


Based on the new consensus that emerged in yesterday's
meeting, I have re-derived the example data files that
I posted for people to take a look at.

To respond to people's discomfort with the terms ALWAYS
and NEVER, I've simply renamed the properties and
files to IDN_Allowed and IDN_Disallowed for the moment.

Under the new derivation, IDN_Allowed is technically
redundant as a property, since it is trivially derivable as:

   ALL - Unassigned - IDN_Disallowed
   
But I've gone ahead and done the derivations and made
the explicit lists for each version, so that the full
repertoire of assigned "Allowed" characters can be seen for each
version, and because diffs between versions highlight the
results of adding new encoded characters and any differences
in the General_Category property between versions.

The data can be seen here:

http://www.unicode.org/~whistler/idna/IDN_Allowed-3.2.0.txt
http://www.unicode.org/~whistler/idna/IDN_Allowed-4.0.0.txt
http://www.unicode.org/~whistler/idna/IDN_Allowed-4.0.1.txt
http://www.unicode.org/~whistler/idna/IDN_Allowed-4.1.0.txt
http://www.unicode.org/~whistler/idna/IDN_Allowed-5.0.0.txt
http://www.unicode.org/~whistler/idna/IDN_Allowed-5.1.0.txt

http://www.unicode.org/~whistler/idna/IDN_Disallowed-3.2.0.txt
http://www.unicode.org/~whistler/idna/IDN_Disallowed-4.0.0.txt
http://www.unicode.org/~whistler/idna/IDN_Disallowed-4.0.1.txt
http://www.unicode.org/~whistler/idna/IDN_Disallowed-4.1.0.txt
http://www.unicode.org/~whistler/idna/IDN_Disallowed-5.0.0.txt
http://www.unicode.org/~whistler/idna/IDN_Disallowed-5.1.0.txt

There are slightly more characters in the IDN_Disallowed
files than had been in my previous IDN_Never files, because
I went ahead and added General_Category=Sk (symbol modifiers)
into IDN_Disallowed, to match the way Patrik has been deriving
his table. Before I had left them in MAYBE pending further
discussion of the relationship between symbol modifiers
and modifier letters. But IDN_Disallowed seems o.k. for them,
given their affinity to symbols that are otherwise already
in IDN_Disallowed.

Of course, there are more characters in IDN_Allowed than had
been in the IDN_Always files, because now all the historic
scripts moved from the MAYBE ("CAUTION") status to simply
IDN_Allowed, where they now mingle with the modern scripts.

As before, I included the required exception lists to
make the derivation absolutely stable. As derived, for
any version transition from Unicode 3.2 to Unicode 5.1,
inclusive, any character once in IDN_Allowed
never changes state to IDN_Disallowed, and vice versa.

For stability between Unicode 5.0 and Unicode 5.1, the
required exception lists are actually empty. It is only
if you want to retroactively apply a derivation all the
way back to Unicode 3.2 while maintaining this property
stability, that you need small exception lists to
ensure that those particular characters would still
end up in the correct class, even though they had
a change in General_Category or casing status sometime
between Unicode 3.2 and Unicode 5.0.

So if we start the IDNA tables from Unicode 5.0, we can
also start with empty stability exception lists,
and likely can keep them empty going forward, as well.

--Ken





More information about the Idna-update mailing list