What rules have been used for the current list of codepoints?

Thu Dec 14 03:10:44 CET 2006

I put the results of a generation for comparison on
http://macchiato.com/idn/UnicodePropertyResults.html

A few notes:

   - We've been forgetting to remove default-ignorable-code-points, so I
   added an exclusion. It only affects variation selectors.
   - We probably want to remove Runic (Runr) as a historic script
   [although I didn't yet.]
   - I used the block notation instead of the raw ranges that Ken has.
   - This is from a program I used for testing properties, so the lines
   at the top, like the following, are actually executed to produce the
   results:
      -  Let $baseId = [$gc:Lu $gc:Ll $gc:Lt $gc:Lo $gc:Lm $gc:Mc
      $gc:Mn $gc:Nd]
      - # this sets a variable $baseID to the union of a number of
      property-based sets based on general category values.
      - The ## comments outline what is done to get the different
   results. I currently generate 3 lists.
   - The base list, by range
      - Then taking out the historic scripts and symbol ranges that
      Ken recommended, by range
      - A detailed version of the second, but skipping the big
      alphabets.
      - The output is the standard Unicode data file:

0030..0039;Zyyy #Nd[10] (0..9)
DIGIT ZERO..DIGIT NINE<range> ; <script> # <general category> [<range
count>] (<character(s)>) <name(s)>

The results is an html file, although one could dump it as text. The
characters are also shown, although you'll only see them correctly if you
have a reasonable collection of fonts. Firefox is better than IE at falling
back to whatever fonts are on your system. But it is also easy to pull into
Excel (OpenOffice) for sorting or filtering by different fields, such as the
script field or the general category field.

Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20061213/d0b5fdb4/attachment.html