Table-building

Mark Davis mark.davis at icu-project.org
Fri Feb 2 02:03:35 CET 2007


There are two issues you raise. I'll give the numbers I talked about names
just to make reference clearer:

   1. ForeverIDN: characters in IDN; once added never removed
   2. NeverIDN: characters that will never be added to IDN once they are
   in this set
   3. MaybeFutureIDN: characters (and unassigned code points) that could
   be added to IDN in the future

1. You maintain that we can broaden the NeverIDN beyond what I've suggested.
I tend to agree with Ken that this list has little technical value. But if
you really feel that you need this for political reasons, I don't have a
strong objection. It does need more work, and take very careful review,
however, since putting something in the set that doesn't belong will cause
problems in the future. And the political problems when some small language
community really needs a character that's been put into the set are not to
be discounted.

2. I think part of this is based on your view that if a character is
miscategorized as punctuation (or symbol), and that we later realize that it
should be a letter, we should split it: deprecate the old and duplicate
encode it with different properties. We have had a certain amount of
experience with these types of situations, and what you propose is not quite
as simple as one may think. We have considered it in some cases, but:

   - In the Unicode world, "deprecate" doesn't mean remove; we can't ever
   do that because of existing data. So the character will always be defined,
   even if its use is discouraged.
   - Every time we have characters that are visually identical, but
   differ in behavior, it always causes problems; both security problems
   and simple usability problems. The user sees X on the screen, but searches
   don't find it, or it doesn't word-wrap as expected, or ....

Mark

On 2/1/07, John C Klensin <klensin at jck.com> wrote:
>
>
>
> --On Thursday, 01 February, 2007 12:53 -0800 Mark Davis
> <mark.davis at icu-project.org> wrote:
>
> >...
> > It certainly would be possible to have a similar set of
> > characters for IDN,
> > one that we guaranteed would never be added into IDNs in the
> > future. But
> > we'd have to be quite careful that we didn't include by
> > mistake the
> > equivalent of the middle-dot.
> >
> > So if in the development of IDN tables, we had 3 classes of
> > characters,
> > listed below, I don't think it is much of a problem, as long
> > as we are
> > extremely conservative about class #2.
> >
> >    1. characters in IDN
> >    2. characters that will never be added to IDN
> >    3. characters (and unassigned code points) that could be
> > added to IDN
> >    in the future
> >
> > I agree with Ken that as far as the implementer is concerned,
> > class #1 is
> > the key issue. And thus my main trepidation about spending
> > time on #2 is
> > just that it diverts us from #1. If people really felt that #2
> > was important
> > for development, I'd suggest using for a basis the following
> > set:
> >
> >    - Pattern_Syntax
> >    - minus "-"
> >    - plus ASCII characters currently disallowed by IDN (that
> > is, ASCII
> >    except -, a-z, A-Z, 0-9
> >    - plus control & format characters (except for ZWJ, ZWNJ)
>
> Mark,
>
> I don't know whether we are far apart or not, but let me
> identify at least one difference in perspective/ assumptions.
>
> We have an external mandate to get the symbols, drawing
> characters, punctuation, dingbats, etc., forever out of IDNs.
> "Out" as in "banned from registration, banned from lookup".
> That list, in terms of number of code points, is somewhat larger
> than the one you have suggested above.  It is also likely to
> grow if you add characters of those varieties to future versions
> of Unicode.
>
> If one could assume that those characters could be handled by
> simply banning their registrations, then I would agree with you
> and Ken -- that "banned" ("#2") list would not be a matter of
> great concern, especially for implementers.  But, as we have
> discussed in another context, there is no enforcement mechanism
> that permits us to assume that all registries, at all levels of
> the DNS tree, will behavior reasonably, nor that some of these
> characters will not turn out to be good ways to spoof other
> things (the standard example for this has become "things that
> look like '/'", but there are others -- how many depends on how
> paranoid one is and what assumptions are made about fonts and
> glyphs).
>
> As far as the middle-dot is concerned as an example of why one
> can't do this, I believe it is an example of something else --
> something that goes back to the intent of the original IETF-UTC
> agreement about stability.   To get away from that particular
> example, if you identify MARTIAN LEFT WIGGLE at U+90005 as
> punctuation in one version of Unicode, and then change your
> minds and decide it is really a letter (with or without some
> specific adjacency requirements), our expectation is that you
> will deprecate it in place and allocate a new MARTIAN LETTER
> LEFT WIGGLE at some other code point.  That new code point would
> then go into either "pending" or "ok", depending on other
> decisions.
>
> So I don't see it as a problem if the UTC can accept the
> position that, as long as applications of various sorts are
> dependent on the property list associated with a given
> character, you cannot, in general, change the properties: a
> serious enough mistake means that you need to allocate a new
> code point with a new set of properties.  If that isn't a
> reasonable model, then I think we are in considerable trouble.
>
>      john
>
>


-- 
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20070201/00a37a3a/attachment-0001.html


More information about the Idna-update mailing list