Tables and contextual rule for IDEOGRAPHIC ITERATION MARKs

John C Klensin klensin at jck.com
Thu Apr 9 12:10:19 CEST 2009



--On Thursday, April 09, 2009 16:59 +0900 "\"Martin J. Dürst\""
<duerst at it.aoyama.ac.jp> wrote:

> I understand that there is a desire to add some context
> constraints for  middle dot, but I don't understand why we
> need constraints for Ideographic Iteration Mark. In my
> opition, the context given by Yoshiro  is correct, but the
> chance that this character gets confused with  something else
> is as big or as little as any other randomly picked 
> character, so I don't see why we would need context. Is it
> that this is  a punctuation character, that we can only
> exceptionally include  punctuation characters, and only if
> they have context?

Middle dot (U+30FB) is a punctuation character (Po), so it is
allowed only by exception and, for the reasons mentioned
earlier, it makes sense to make the exception as narrow as
possible.

I no longer remember why we treated U+3005 as requiring context.
It is Lm in the tables, which brings it under Category A
(Section 2.1) in Tables, so, absent other considerations, it
ought to default to PVALID.  I note that there are several other
iteration marks that are just PVALID.  I image that U+3005 was
called out for special treatment because the Unicode Standard
identifies it as part of a "CJK Symbols and Punctuation" block
(see page 830 of TUS 5.0). Its presence in the Contextual rule
list may consequently be an artifact of the time in which we
were still treating the Unicode block structure as significant.

On a fast scan, there doesn't seem to be anything in Stringprep
that calls it out for special treatment.  At least at the
registry level, none of the iteration marks appear to be
Preferred Variants for Chinese (see
http://www.iana.org/domains/idn-tables/tables/cn_zh-cn_4.0.html
or the identical table for .TW), some, but not all, of them
appear in the .JP Preferred Variants list of Japanese (see
http://www.iana.org/domains/idn-tables/tables/jp_ja-jp_1.2.html).
.KR has filed only a Hangul table with IANA, so I can make no
inferences there.

So, if I can ask your indulgence to satisfy my curiosity and
slightly reduce my ignorance, 

	(i) Are these iteration marks used with Japanese only
	(out of the CJK script group)?
	
	(ii) How are they used?   It may be just an incorrect
	inference from terminology, but, if I saw something
	called an "iteration mark", I'd normally expect it to be
	associated with a numeral that would tell me how many
	copies of an associated character or string to infer.
	With the understanding that this is strictly a registry
	matter (unless someone comes forward with a very strong
	argument for DISALLOWing something), wouldn't that
	require a rather special normalization process to
	prevent conceptually-identical strings being treated
	differently?  Or, if my model is correct, would the
	relevant iteration mark always be required rather than
	repeating whatever it is that is repeated?
	
	(iii) Is there any possible reason why some of the
	iteration marks should be treated as PVALID and others
	should be CONTEXTO? 

	(iv) If "vertical" really means that, is U+303B needed
	in domain names at all?  Are they ever, in practice,
	written vertically?  I note that the .JP table
	(reference above) does not permit that character at all.
	If it is not used, not useful, and could cause
	conceptual confusion (can it?), then should it be
	DISALLOWED rather than PVALID or CONTEXTO?

I think that this takes us in the direction of removing U+3005
and U+303B from the exception list, letting them fall into
PVALID because of their Lm classification (unless U+303B should
be DISALLOWED as discussed above).  But, to the extent possible,
it would be good to understand a bit more about the situation
first, even though this takes us rather far into the
character-by-character analysis that we try to avoid.

   best,
    john









More information about the Idna-update mailing list