There are two issues you raise. I'll give the numbers I talked about names just to make reference clearer:<br><ol><li><span style="font-weight: bold;">ForeverIDN: </span>characters in IDN; once added never removed<br>
</li><li><span style="font-weight: bold;">NeverIDN: </span>characters that will never be added to IDN once they are in this set<br></li><li><span style="font-weight: bold;">MaybeFutureIDN: </span>characters (and unassigned code points) that could be added to IDN in the future
</li></ol>1. You maintain that we can broaden the NeverIDN beyond what I've suggested. I tend to agree with Ken that this list has little technical value. But if you really feel that you need this for political reasons, I don't have a strong objection. It does need more work, and take very careful review, however, since putting something in the set that doesn't belong will cause problems in the future. And the political problems when some small language community really needs a character that's been put into the set are not to be discounted.
<br><br>2. I think part of this is based on your view that if a character is miscategorized as punctuation (or symbol), and that we later realize that it should be a letter, we should split it: deprecate the old and duplicate encode it with different properties. We have had a certain amount of experience with these types of situations, and what you propose is not quite as simple as one may think. We have considered it in some cases, but:
<br><ul><li>In the Unicode world, "deprecate" doesn't mean remove; we can't ever do that because of existing data. So the character will always be defined, even if its use is discouraged.<br></li><li>Every time we have characters that are visually identical, but differ in behavior, it
<span style="font-style: italic;">always </span>causes problems; both security problems and simple usability problems. The user sees X on the screen, but searches don't find it, or it doesn't word-wrap as expected, or ....
</li></ul>Mark<br><br><div><span class="gmail_quote">On 2/1/07, <b class="gmail_sendername">John C Klensin</b> <<a href="mailto:klensin@jck.com">klensin@jck.com</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<br><br>--On Thursday, 01 February, 2007 12:53 -0800 Mark Davis<br><<a href="mailto:mark.davis@icu-project.org">mark.davis@icu-project.org</a>> wrote:<br><br>>...<br>> It certainly would be possible to have a similar set of
<br>> characters for IDN,<br>> one that we guaranteed would never be added into IDNs in the<br>> future. But<br>> we'd have to be quite careful that we didn't include by<br>> mistake the<br>> equivalent of the middle-dot.
<br>><br>> So if in the development of IDN tables, we had 3 classes of<br>> characters,<br>> listed below, I don't think it is much of a problem, as long<br>> as we are<br>> extremely conservative about class #2.
<br>><br>> 1. characters in IDN<br>> 2. characters that will never be added to IDN<br>> 3. characters (and unassigned code points) that could be<br>> added to IDN<br>> in the future<br>><br>
> I agree with Ken that as far as the implementer is concerned,<br>> class #1 is<br>> the key issue. And thus my main trepidation about spending<br>> time on #2 is<br>> just that it diverts us from #1. If people really felt that #2
<br>> was important<br>> for development, I'd suggest using for a basis the following<br>> set:<br>><br>> - Pattern_Syntax<br>> - minus "-"<br>> - plus ASCII characters currently disallowed by IDN (that
<br>> is, ASCII<br>> except -, a-z, A-Z, 0-9<br>> - plus control & format characters (except for ZWJ, ZWNJ)<br><br>Mark,<br><br>I don't know whether we are far apart or not, but let me<br>identify at least one difference in perspective/ assumptions.
<br><br>We have an external mandate to get the symbols, drawing<br>characters, punctuation, dingbats, etc., forever out of IDNs.<br>"Out" as in "banned from registration, banned from lookup".<br>That list, in terms of number of code points, is somewhat larger
<br>than the one you have suggested above. It is also likely to<br>grow if you add characters of those varieties to future versions<br>of Unicode.<br><br>If one could assume that those characters could be handled by<br>simply banning their registrations, then I would agree with you
<br>and Ken -- that "banned" ("#2") list would not be a matter of<br>great concern, especially for implementers. But, as we have<br>discussed in another context, there is no enforcement mechanism<br>that permits us to assume that all registries, at all levels of
<br>the DNS tree, will behavior reasonably, nor that some of these<br>characters will not turn out to be good ways to spoof other<br>things (the standard example for this has become "things that<br>look like '/'", but there are others -- how many depends on how
<br>paranoid one is and what assumptions are made about fonts and<br>glyphs).<br><br>As far as the middle-dot is concerned as an example of why one<br>can't do this, I believe it is an example of something else --<br>
something that goes back to the intent of the original IETF-UTC<br>agreement about stability. To get away from that particular<br>example, if you identify MARTIAN LEFT WIGGLE at U+90005 as<br>punctuation in one version of Unicode, and then change your
<br>minds and decide it is really a letter (with or without some<br>specific adjacency requirements), our expectation is that you<br>will deprecate it in place and allocate a new MARTIAN LETTER<br>LEFT WIGGLE at some other code point. That new code point would
<br>then go into either "pending" or "ok", depending on other<br>decisions.<br><br>So I don't see it as a problem if the UTC can accept the<br>position that, as long as applications of various sorts are
<br>dependent on the property list associated with a given<br>character, you cannot, in general, change the properties: a<br>serious enough mistake means that you need to allocate a new<br>code point with a new set of properties. If that isn't a
<br>reasonable model, then I think we are in considerable trouble.<br><br> john<br><br></blockquote></div><br><br clear="all"><br>-- <br>Mark