Tables and contextual rule for IDEOGRAPHIC ITERATION MARKs
John C Klensin
klensin at jck.com
Fri Apr 10 17:24:13 CEST 2009
This all makes sense. The information that 人人、would be
orthographically wrong is one of the important bits I wanted to
confirm. Based on your note and Yoneya-san's, I think we should
get the iteration marks out of the CONTEXT category entirely,
making the vertical ones DISALLOWED and the others that are in
thanks to both you, Yoneya-san, and the others who have
commented for your patience.
--On Friday, April 10, 2009 11:24 +0900 "\"Martin J. Dürst\""
<duerst at it.aoyama.ac.jp> wrote:
> Hello John,
> On 2009/04/09 19:10, John C Klensin wrote:
>> --On Thursday, April 09, 2009 16:59 +0900 "\"Martin J.
>> Dürst\"" <duerst at it.aoyama.ac.jp> wrote:
>>> I understand that there is a desire to add some context
>>> constraints for middle dot, but I don't understand why we
>>> need constraints for Ideographic Iteration Mark. In my
>>> opition, the context given by Yoshiro is correct, but the
>>> chance that this character gets confused with something else
>>> is as big or as little as any other randomly picked
>>> character, so I don't see why we would need context. Is it
>>> that this is a punctuation character, that we can only
>>> exceptionally include punctuation characters, and only if
>>> they have context?
>> Middle dot (U+30FB) is a punctuation character (Po), so it is
>> allowed only by exception and, for the reasons mentioned
>> earlier, it makes sense to make the exception as narrow as
>> I no longer remember why we treated U+3005 as requiring
>> context. It is Lm in the tables, which brings it under
>> Category A (Section 2.1) in Tables, so, absent other
>> considerations, it ought to default to PVALID. I note that
>> there are several other iteration marks that are just PVALID.
>> I image that U+3005 was called out for special treatment
>> because the Unicode Standard identifies it as part of a "CJK
>> Symbols and Punctuation" block (see page 830 of TUS 5.0). Its
>> presence in the Contextual rule list may consequently be an
>> artifact of the time in which we were still treating the
>> Unicode block structure as significant.
>> On a fast scan, there doesn't seem to be anything in
>> Stringprep that calls it out for special treatment. At least
>> at the registry level, none of the iteration marks appear to
>> be Preferred Variants for Chinese (see
>> ml or the identical table for .TW), some, but not all, of them
>> appear in the .JP Preferred Variants list of Japanese (see
>> ml). .KR has filed only a Hangul table with IANA, so I can
>> make no inferences there.
>> So, if I can ask your indulgence to satisfy my curiosity and
>> slightly reduce my ignorance,
>> (i) Are these iteration marks used with Japanese only
>> (out of the CJK script group)?
> I don't remember to have seen it in Chinese, and I have seen
> explicit character repetition in Chinese, but I rarely look at
> Chinese (and don't read it), so that doesn't mean too much.
> also lists it as a Japanese-only character.
>> (ii) How are they used? It may be just an incorrect
>> inference from terminology, but, if I saw something
>> called an "iteration mark", I'd normally expect it to be
>> associated with a numeral that would tell me how many
>> copies of an associated character or string to infer.
> That's thinking too far. 々 (U+3005) is simply used to repeat
> the previous character. So 人 (hito) means man, person and
> 人々 (hitobito, note the assimilation from h to b) means
> men, people (only used in certain cases, in general, 人 can
> be used for plural, too. 人々 may have originally be written
> 人人、but these days, that would be orthographically wrong.
> There is no device e.g. for a threefold repetition, which is
> not too surprising, because such repetitions don't occur in
> practice. See also http://en.wiktionary.org/wiki/々.
>> (iii) Is there any possible reason why some of the
>> iteration marks should be treated as PVALID and others
>> should be CONTEXTO?
> Not as far as I can immagine. There are good reasons for
> having some PVALID, and there are good reasons for having
> others disallowed, but not CONTEXTO.
>> (iv) If "vertical" really means that, is U+303B needed
>> in domain names at all? Are they ever, in practice,
>> written vertically? I note that the .JP table
>> (reference above) does not permit that character at all.
>> If it is not used, not useful, and could cause
>> conceptual confusion (can it?), then should it be
>> DISALLOWED rather than PVALID or CONTEXTO?
> I think Yoshiro already said that the vertical ones are not
> needed and should be disallowed. That applies to all of
> U+3031-3035. They are needed only used in vertical text, and
> therefore don't work for domain names (which are usually
>> I think that this takes us in the direction of removing U+3005
>> and U+303B from the exception list, letting them fall into
>> PVALID because of their Lm classification (unless U+303B
>> should be DISALLOWED as discussed above). But, to the extent
>> possible, it would be good to understand a bit more about the
>> situation first, even though this takes us rather far into the
>> character-by-character analysis that we try to avoid.
> If we don't want to go too far with character-by-character
> analysis, we can leave the business of excluding U+3031-3035
> to registries.
> Regards, Martin.
More information about the Idna-update