Table-building

Kenneth Whistler kenw at sybase.com
Thu Feb 1 04:16:52 CET 2007


To further demonstrate the straightforward implementation
of the inclusion table lookup, when using a simple binary property,
I spent a few minutes adapting some very generic parsing routines.

The result is a little executable that will parse the
property data file I posted yesterday for the proposed
IDN_Permitted Unicode character property:

http://www.unicode.org/~whistler/IDNPermitted.txt

and produced a bit image array of the property for 
the range U+0000..U+FFFF:

http://www.unicode.org/~whistler/BitArray.txt

That is to be read as:

  . => IDN_Permitted = False
  1 => IDN_Permitted = True
  
1024 lines of bit images, 64 code points per line.

(I didn't bother with the Plane 2 CJK characters, because
any optimization of the table lookup would handle those
with a single range check -- and we might decide that
having CJK Extension-B characters in IDNs isn't
necessary anyway.)

If you take a look at BitArray.txt, it should be clear
that the entire property can be stored in a data structure
in only 8K, and that is without *any* compression at all.
If you use some very generic table compression methods
that people use all the time for property lookup routines,
the data structure storage is only a few K. And the
lookup code is just a few lines in either case.

I'm not suggesting that we actually put BitArray.txt
anywhere in a draft or anything like that. The point is
that this kind of table lookup implementation for
properties is really rather straightforward, once the
properties themselves are defined in well-known formats.
And all of the properties needed to support IDNA
nameprep processing can be handled in trivial amounts
of memory, even if an implementer isn't using a
library routine to get the needed values.

--Ken



More information about the Idna-update mailing list