Table-building
Kenneth Whistler
kenw at sybase.com
Thu Feb 1 04:16:52 CET 2007
To further demonstrate the straightforward implementation
of the inclusion table lookup, when using a simple binary property,
I spent a few minutes adapting some very generic parsing routines.
The result is a little executable that will parse the
property data file I posted yesterday for the proposed
IDN_Permitted Unicode character property:
http://www.unicode.org/~whistler/IDNPermitted.txt
and produced a bit image array of the property for
the range U+0000..U+FFFF:
http://www.unicode.org/~whistler/BitArray.txt
That is to be read as:
. => IDN_Permitted = False
1 => IDN_Permitted = True
1024 lines of bit images, 64 code points per line.
(I didn't bother with the Plane 2 CJK characters, because
any optimization of the table lookup would handle those
with a single range check -- and we might decide that
having CJK Extension-B characters in IDNs isn't
necessary anyway.)
If you take a look at BitArray.txt, it should be clear
that the entire property can be stored in a data structure
in only 8K, and that is without *any* compression at all.
If you use some very generic table compression methods
that people use all the time for property lookup routines,
the data structure storage is only a few K. And the
lookup code is just a few lines in either case.
I'm not suggesting that we actually put BitArray.txt
anywhere in a draft or anything like that. The point is
that this kind of table lookup implementation for
properties is really rather straightforward, once the
properties themselves are defined in well-known formats.
And all of the properties needed to support IDNA
nameprep processing can be handled in trivial amounts
of memory, even if an implementer isn't using a
library routine to get the needed values.
--Ken
More information about the Idna-update
mailing list