Table-building

Mark Davis mark.davis at icu-project.org
Thu Feb 1 21:53:01 CET 2007


We may not be that far apart on this -- sometimes the terminology may be
getting in our way.

For example, the notion of a "multi-state" table is not all that different
than what we effectively have in Unicode. Look at the diagram in
http://www.unicode.org/reports/tr31/#Introduction, "Figure 1. Code Point
Categories for Identifier Parsing".  That is, for the purposes of
identifiers, we divide up characters into certain classes:

   1. Identifier characters (roughly letters, marks, decimal numbers)
   2. Pattern characters (whitespace and "syntax" like +, -. ...)
   3. Other (assigned or unassigned)

To see a list of the Pattern characters, see
http://www.unicode.org/Public/UNIDATA/PropList.txt, and search for either:

   - Pattern_White_Space
   - Pattern_Syntax.

To see a list of the ID characters, see
http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt, and search
for

   - XID_Continue

And we have constraints on changes in the future (as one can see reading
down in http://www.unicode.org/reports/tr31/). In particular, the pattern
characters are ones that we will never change. I think one difference might
be the motivation. The Pattern characters were designed to be an immutable
set that could be used syntactically without worrying that one of them would
be in the future included into identifiers. And for that purpose, they were
put out of bounds for inclusion in future identifiers. Because of those
strict guarantees, we were extremely conservative about their contents. (The
definition of this property was produced in response to requests from the
W3C.)

It certainly would be possible to have a similar set of characters for IDN,
one that we guaranteed would never be added into IDNs in the future. But
we'd have to be quite careful that we didn't include by mistake the
equivalent of the middle-dot.

So if in the development of IDN tables, we had 3 classes of characters,
listed below, I don't think it is much of a problem, as long as we are
extremely conservative about class #2.

   1. characters in IDN
   2. characters that will never be added to IDN
   3. characters (and unassigned code points) that could be added to IDN
   in the future

I agree with Ken that as far as the implementer is concerned, class #1 is
the key issue. And thus my main trepidation about spending time on #2 is
just that it diverts us from #1. If people really felt that #2 was important
for development, I'd suggest using for a basis the following set:

   - Pattern_Syntax
   - minus "-"
   - plus ASCII characters currently disallowed by IDN (that is, ASCII
   except -, a-z, A-Z, 0-9
   - plus control & format characters (except for ZWJ, ZWNJ)

Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20070201/7c3ed771/attachment.html


More information about the Idna-update mailing list