Reserved general punctuation

John C Klensin klensin at jck.com
Wed Apr 30 15:02:05 CEST 2008



--On Wednesday, 30 April, 2008 04:16 -0700 Vint Cerf
<vint at google.com> wrote:

> My naïve assumption is that anything unassigned has the
> potential to become assigned so we need to have a state in
> which the code point is not allowed for current use but could
> be permitted at a later time. Do we have the semantics to
> accommodate that? V

Short answer: No.  I presume that is why we are having this
discussion.

Longer answer:

While we have concluded that the problems it would cause
outweigh the advantages, these areas of uncertainty are a large
part of what motivated having MAYBE categories.

I think that putting anything into UNASSIGNED that isn't
actually unassigned (i.e., given no code point assignment in the
then-current version of Unicode) is looking for trouble.  As you
point out, such code points have the potential to become
assigned.  While one might make some educated guesses from the
block context in which the code point is located, we can't
predict, with 100% certainty, the properties that a code point
will have if and when it is assigned in the future.

So, for a code point that is actually assigned, I think we have
only three choices:

	* Allow it, as Protocol-Valid.  For general punctuation
	this is, I hope obviously, not a good idea.
	
	* Disallow it and assume that, if we discover we need it
	enough later, we will do whatever drastic revisions or
	disaster corrections are required.  Of course, that sets
	a very high bar to ever allowing those characters, but
	that may not be unreasonable.
	
	* Assign it to "context required" but do not assign a
	rule.   Under the current proposed model, that means
	that it can neither be registered nor looked up.  On the
	other hand, we could, in the future, allow it in the
	cases where it is actually required by assigning an
	appropriate rule and then waiting for software to be
	upgraded (something that would presumably happen more
	quickly in places where the character is important than
	in places where it isn't).

It is that area of flexibility with CONTEXT, especially
CONTEXT-OTHER, where my view that "Disallowed" is permanent,
with no path (or a very difficult one) out of that category,
converges with what I understand of Mark's desire to make
migration out of DISALLOWED relatively easy.  In the middle
ground, we try to identify the characters about which we may be
uncertain and identity them as CONTEXTO with no expectation of
assigning rules unless it turns out that they are really needed.
That approach assume that we can anticipate characters that
_might_ need to be moved, i.e., characters about which are are
not certain that DISALLOWED is globally correct.  I think that
is probably correct.  Indeed, I believe that, if it is not
correct, this entire approach is built on a house of cards and
we may need to drop it.

And, FWIW, the argument for putting Cf into CONTEXTO precisely
follows the reasoning above -- these odd and sometimes-invisible
cases (see U+2060, 2062..2064; WORD JOINER, INVISIBLE TIMES/
SEPARATOR/ PLUS) are precisely the sorts of thing that someone
might, conceivably, argue passionately are required in some IDN
contexts.   If I correctly understand the use of these
characters, my own view is that I would argue strongly about
permitting them.  But I think it would be better to have that
argument on the basis of substantive requirement to have the
characters in IDNs versus risks and complexity and not on the
basis of an artifact of how we had defined things.

     john




More information about the Idna-update mailing list