Table-building

Fri Feb 2 00:53:40 CET 2007

--On Thursday, 01 February, 2007 12:53 -0800 Mark Davis
<mark.davis at icu-project.org> wrote:

>...
> It certainly would be possible to have a similar set of
> characters for IDN,
> one that we guaranteed would never be added into IDNs in the
> future. But
> we'd have to be quite careful that we didn't include by
> mistake the
> equivalent of the middle-dot.
> 
> So if in the development of IDN tables, we had 3 classes of
> characters,
> listed below, I don't think it is much of a problem, as long
> as we are
> extremely conservative about class #2.
> 
>    1. characters in IDN
>    2. characters that will never be added to IDN
>    3. characters (and unassigned code points) that could be
> added to IDN
>    in the future
> 
> I agree with Ken that as far as the implementer is concerned,
> class #1 is
> the key issue. And thus my main trepidation about spending
> time on #2 is
> just that it diverts us from #1. If people really felt that #2
> was important
> for development, I'd suggest using for a basis the following
> set:
> 
>    - Pattern_Syntax
>    - minus "-"
>    - plus ASCII characters currently disallowed by IDN (that
> is, ASCII
>    except -, a-z, A-Z, 0-9
>    - plus control & format characters (except for ZWJ, ZWNJ)

Mark,

I don't know whether we are far apart or not, but let me
identify at least one difference in perspective/ assumptions.

We have an external mandate to get the symbols, drawing
characters, punctuation, dingbats, etc., forever out of IDNs.
"Out" as in "banned from registration, banned from lookup".
That list, in terms of number of code points, is somewhat larger
than the one you have suggested above.  It is also likely to
grow if you add characters of those varieties to future versions
of Unicode.  

If one could assume that those characters could be handled by
simply banning their registrations, then I would agree with you
and Ken -- that "banned" ("#2") list would not be a matter of
great concern, especially for implementers.  But, as we have
discussed in another context, there is no enforcement mechanism
that permits us to assume that all registries, at all levels of
the DNS tree, will behavior reasonably, nor that some of these
characters will not turn out to be good ways to spoof other
things (the standard example for this has become "things that
look like '/'", but there are others -- how many depends on how
paranoid one is and what assumptions are made about fonts and
glyphs).

As far as the middle-dot is concerned as an example of why one
can't do this, I believe it is an example of something else --
something that goes back to the intent of the original IETF-UTC
agreement about stability.   To get away from that particular
example, if you identify MARTIAN LEFT WIGGLE at U+90005 as
punctuation in one version of Unicode, and then change your
minds and decide it is really a letter (with or without some
specific adjacency requirements), our expectation is that you
will deprecate it in place and allocate a new MARTIAN LETTER
LEFT WIGGLE at some other code point.  That new code point would
then go into either "pending" or "ok", depending on other
decisions.

So I don't see it as a problem if the UTC can accept the
position that, as long as applications of various sorts are
dependent on the property list associated with a given
character, you cannot, in general, change the properties: a
serious enough mistake means that you need to allocate a new
code point with a new set of properties.  If that isn't a
reasonable model, then I think we are in considerable trouble.

     john