Tables and contextual rule for IDEOGRAPHIC ITERATION MARKs
"Martin J. Dürst"
duerst at it.aoyama.ac.jp
Fri Apr 10 04:24:13 CEST 2009
On 2009/04/09 19:10, John C Klensin wrote:
> --On Thursday, April 09, 2009 16:59 +0900 "\"Martin J. Dürst\""
> <duerst at it.aoyama.ac.jp> wrote:
>> I understand that there is a desire to add some context
>> constraints for middle dot, but I don't understand why we
>> need constraints for Ideographic Iteration Mark. In my
>> opition, the context given by Yoshiro is correct, but the
>> chance that this character gets confused with something else
>> is as big or as little as any other randomly picked
>> character, so I don't see why we would need context. Is it
>> that this is a punctuation character, that we can only
>> exceptionally include punctuation characters, and only if
>> they have context?
> Middle dot (U+30FB) is a punctuation character (Po), so it is
> allowed only by exception and, for the reasons mentioned
> earlier, it makes sense to make the exception as narrow as
> I no longer remember why we treated U+3005 as requiring context.
> It is Lm in the tables, which brings it under Category A
> (Section 2.1) in Tables, so, absent other considerations, it
> ought to default to PVALID. I note that there are several other
> iteration marks that are just PVALID. I image that U+3005 was
> called out for special treatment because the Unicode Standard
> identifies it as part of a "CJK Symbols and Punctuation" block
> (see page 830 of TUS 5.0). Its presence in the Contextual rule
> list may consequently be an artifact of the time in which we
> were still treating the Unicode block structure as significant.
> On a fast scan, there doesn't seem to be anything in Stringprep
> that calls it out for special treatment. At least at the
> registry level, none of the iteration marks appear to be
> Preferred Variants for Chinese (see
> or the identical table for .TW), some, but not all, of them
> appear in the .JP Preferred Variants list of Japanese (see
> .KR has filed only a Hangul table with IANA, so I can make no
> inferences there.
> So, if I can ask your indulgence to satisfy my curiosity and
> slightly reduce my ignorance,
> (i) Are these iteration marks used with Japanese only
> (out of the CJK script group)?
I don't remember to have seen it in Chinese, and I have seen explicit
character repetition in Chinese, but I rarely look at Chinese (and don't
read it), so that doesn't mean too much.
also lists it as a Japanese-only character.
> (ii) How are they used? It may be just an incorrect
> inference from terminology, but, if I saw something
> called an "iteration mark", I'd normally expect it to be
> associated with a numeral that would tell me how many
> copies of an associated character or string to infer.
That's thinking too far. 々 (U+3005) is simply used to repeat the
previous character. So 人 (hito) means man, person and 人々 (hitobito,
note the assimilation from h to b) means men, people (only used in
certain cases, in general, 人 can be used for plural, too. 人々 may have
originally be written 人人、but these days, that would be
orthographically wrong. There is no device e.g. for a threefold
repetition, which is not too surprising, because such repetitions don't
occur in practice. See also http://en.wiktionary.org/wiki/々.
> (iii) Is there any possible reason why some of the
> iteration marks should be treated as PVALID and others
> should be CONTEXTO?
Not as far as I can immagine. There are good reasons for having some
PVALID, and there are good reasons for having others disallowed, but not
> (iv) If "vertical" really means that, is U+303B needed
> in domain names at all? Are they ever, in practice,
> written vertically? I note that the .JP table
> (reference above) does not permit that character at all.
> If it is not used, not useful, and could cause
> conceptual confusion (can it?), then should it be
> DISALLOWED rather than PVALID or CONTEXTO?
I think Yoshiro already said that the vertical ones are not needed and
should be disallowed. That applies to all of U+3031-3035. They are
needed only used in vertical text, and therefore don't work for domain
names (which are usually horizontal).
> I think that this takes us in the direction of removing U+3005
> and U+303B from the exception list, letting them fall into
> PVALID because of their Lm classification (unless U+303B should
> be DISALLOWED as discussed above). But, to the extent possible,
> it would be good to understand a bit more about the situation
> first, even though this takes us rather far into the
> character-by-character analysis that we try to avoid.
If we don't want to go too far with character-by-character analysis, we
can leave the business of excluding U+3031-3035 to registries.
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:duerst at it.aoyama.ac.jp
More information about the Idna-update