Tables and contextual rule for IDEOGRAPHIC ITERATION MARKs

Fri Apr 10 04:24:13 CEST 2009

Hello John,

On 2009/04/09 19:10, John C Klensin wrote:
>
> --On Thursday, April 09, 2009 16:59 +0900 "\"Martin J. Dürst\""
> <duerst at it.aoyama.ac.jp>  wrote:
>
>> I understand that there is a desire to add some context
>> constraints for  middle dot, but I don't understand why we
>> need constraints for Ideographic Iteration Mark. In my
>> opition, the context given by Yoshiro  is correct, but the
>> chance that this character gets confused with  something else
>> is as big or as little as any other randomly picked
>> character, so I don't see why we would need context. Is it
>> that this is  a punctuation character, that we can only
>> exceptionally include  punctuation characters, and only if
>> they have context?
>
> Middle dot (U+30FB) is a punctuation character (Po), so it is
> allowed only by exception and, for the reasons mentioned
> earlier, it makes sense to make the exception as narrow as
> possible.

Agreed.

> I no longer remember why we treated U+3005 as requiring context.
> It is Lm in the tables, which brings it under Category A
> (Section 2.1) in Tables, so, absent other considerations, it
> ought to default to PVALID.  I note that there are several other
> iteration marks that are just PVALID.  I image that U+3005 was
> called out for special treatment because the Unicode Standard
> identifies it as part of a "CJK Symbols and Punctuation" block
> (see page 830 of TUS 5.0). Its presence in the Contextual rule
> list may consequently be an artifact of the time in which we
> were still treating the Unicode block structure as significant.
>
> On a fast scan, there doesn't seem to be anything in Stringprep
> that calls it out for special treatment.  At least at the
> registry level, none of the iteration marks appear to be
> Preferred Variants for Chinese (see
> http://www.iana.org/domains/idn-tables/tables/cn_zh-cn_4.0.html
> or the identical table for .TW), some, but not all, of them
> appear in the .JP Preferred Variants list of Japanese (see
> http://www.iana.org/domains/idn-tables/tables/jp_ja-jp_1.2.html).
> .KR has filed only a Hangul table with IANA, so I can make no
> inferences there.
>
> So, if I can ask your indulgence to satisfy my curiosity and
> slightly reduce my ignorance,
>
> 	(i) Are these iteration marks used with Japanese only
> 	(out of the CJK script group)?

I don't remember to have seen it in Chinese, and I have seen explicit 
character repetition in Chinese, but I rarely look at Chinese (and don't 
read it), so that doesn't mean too much.
But http://en.wiktionary.org/wiki/Category:Japanese-only_CJKV_Characters
also lists it as a Japanese-only character.

> 	(ii) How are they used?   It may be just an incorrect
> 	inference from terminology, but, if I saw something
> 	called an "iteration mark", I'd normally expect it to be
> 	associated with a numeral that would tell me how many
> 	copies of an associated character or string to infer.

That's thinking too far. 々 (U+3005) is simply used to repeat the 
previous character. So 人 (hito) means man, person and 人々 (hitobito, 
note the assimilation from h to b) means men, people (only used in 
certain cases, in general, 人 can be used for plural, too. 人々 may have 
originally be written 人人、but these days, that would be 
orthographically wrong. There is no device e.g. for a threefold 
repetition, which is not too surprising, because such repetitions don't 
occur in practice. See also http://en.wiktionary.org/wiki/々.

> 	(iii) Is there any possible reason why some of the
> 	iteration marks should be treated as PVALID and others
> 	should be CONTEXTO?

Not as far as I can immagine. There are good reasons for having some 
PVALID, and there are good reasons for having others disallowed, but not 
CONTEXTO.

> 	(iv) If "vertical" really means that, is U+303B needed
> 	in domain names at all?  Are they ever, in practice,
> 	written vertically?  I note that the .JP table
> 	(reference above) does not permit that character at all.
> 	If it is not used, not useful, and could cause
> 	conceptual confusion (can it?), then should it be
> 	DISALLOWED rather than PVALID or CONTEXTO?

I think Yoshiro already said that the vertical ones are not needed and 
should be disallowed. That applies to all of U+3031-3035. They are 
needed only used in vertical text, and therefore don't work for domain 
names (which are usually horizontal).

> I think that this takes us in the direction of removing U+3005
> and U+303B from the exception list, letting them fall into
> PVALID because of their Lm classification (unless U+303B should
> be DISALLOWED as discussed above).  But, to the extent possible,
> it would be good to understand a bit more about the situation
> first, even though this takes us rather far into the
> character-by-character analysis that we try to avoid.

If we don't want to go too far with character-by-character analysis, we 
can leave the business of excluding U+3031-3035 to registries.

Regards,    Martin.

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst at it.aoyama.ac.jp