Tables and contextual rule for IDEOGRAPHIC ITERATION MARKs

John C Klensin klensin at jck.com
Fri Apr 10 17:24:13 CEST 2009


Martin,

This all makes sense.  The information that 人人、would be
orthographically wrong is one of the important bits I wanted to
confirm.  Based on your note and Yoneya-san's, I think we should
get the iteration marks out of the CONTEXT category entirely,
making the vertical ones DISALLOWED and the others that are in
Lm PVALID.

thanks to both you, Yoneya-san, and the others who have
commented for your patience.

    john


--On Friday, April 10, 2009 11:24 +0900 "\"Martin J. Dürst\""
<duerst at it.aoyama.ac.jp> wrote:

> Hello John,
> 
> On 2009/04/09 19:10, John C Klensin wrote:
>> 
>> --On Thursday, April 09, 2009 16:59 +0900 "\"Martin J.
>> Dürst\"" <duerst at it.aoyama.ac.jp>  wrote:
>> 
>>> I understand that there is a desire to add some context
>>> constraints for  middle dot, but I don't understand why we
>>> need constraints for Ideographic Iteration Mark. In my
>>> opition, the context given by Yoshiro  is correct, but the
>>> chance that this character gets confused with  something else
>>> is as big or as little as any other randomly picked
>>> character, so I don't see why we would need context. Is it
>>> that this is  a punctuation character, that we can only
>>> exceptionally include  punctuation characters, and only if
>>> they have context?
>> 
>> Middle dot (U+30FB) is a punctuation character (Po), so it is
>> allowed only by exception and, for the reasons mentioned
>> earlier, it makes sense to make the exception as narrow as
>> possible.
> 
> Agreed.
> 
>> I no longer remember why we treated U+3005 as requiring
>> context. It is Lm in the tables, which brings it under
>> Category A (Section 2.1) in Tables, so, absent other
>> considerations, it ought to default to PVALID.  I note that
>> there are several other iteration marks that are just PVALID.
>> I image that U+3005 was called out for special treatment
>> because the Unicode Standard identifies it as part of a "CJK
>> Symbols and Punctuation" block (see page 830 of TUS 5.0). Its
>> presence in the Contextual rule list may consequently be an
>> artifact of the time in which we were still treating the
>> Unicode block structure as significant.
>> 
>> On a fast scan, there doesn't seem to be anything in
>> Stringprep that calls it out for special treatment.  At least
>> at the registry level, none of the iteration marks appear to
>> be Preferred Variants for Chinese (see
>> http://www.iana.org/domains/idn-tables/tables/cn_zh-cn_4.0.ht
>> ml or the identical table for .TW), some, but not all, of them
>> appear in the .JP Preferred Variants list of Japanese (see
>> http://www.iana.org/domains/idn-tables/tables/jp_ja-jp_1.2.ht
>> ml). .KR has filed only a Hangul table with IANA, so I can
>> make no inferences there.
>> 
>> So, if I can ask your indulgence to satisfy my curiosity and
>> slightly reduce my ignorance,
>> 
>> 	(i) Are these iteration marks used with Japanese only
>> 	(out of the CJK script group)?
> 
> I don't remember to have seen it in Chinese, and I have seen
> explicit character repetition in Chinese, but I rarely look at
> Chinese (and don't read it), so that doesn't mean too much.
> But
> http://en.wiktionary.org/wiki/Category:Japanese-only_CJKV_Char
> acters
> also lists it as a Japanese-only character.
> 
>> 	(ii) How are they used?   It may be just an incorrect
>> 	inference from terminology, but, if I saw something
>> 	called an "iteration mark", I'd normally expect it to be
>> 	associated with a numeral that would tell me how many
>> 	copies of an associated character or string to infer.
> 
> That's thinking too far. 々 (U+3005) is simply used to repeat
> the previous character. So 人 (hito) means man, person and
> 人々 (hitobito, note the assimilation from h to b) means
> men, people (only used in certain cases, in general, 人 can
> be used for plural, too. 人々 may have originally be written
> 人人、but these days, that would be orthographically wrong.
> There is no device e.g. for a threefold repetition, which is
> not too surprising, because such repetitions don't occur in
> practice. See also http://en.wiktionary.org/wiki/々.
> 
>> 	(iii) Is there any possible reason why some of the
>> 	iteration marks should be treated as PVALID and others
>> 	should be CONTEXTO?
> 
> Not as far as I can immagine. There are good reasons for
> having some PVALID, and there are good reasons for having
> others disallowed, but not CONTEXTO.
> 
>> 	(iv) If "vertical" really means that, is U+303B needed
>> 	in domain names at all?  Are they ever, in practice,
>> 	written vertically?  I note that the .JP table
>> 	(reference above) does not permit that character at all.
>> 	If it is not used, not useful, and could cause
>> 	conceptual confusion (can it?), then should it be
>> 	DISALLOWED rather than PVALID or CONTEXTO?
> 
> I think Yoshiro already said that the vertical ones are not
> needed and should be disallowed. That applies to all of
> U+3031-3035. They are needed only used in vertical text, and
> therefore don't work for domain names (which are usually
> horizontal).
> 
> 
>> I think that this takes us in the direction of removing U+3005
>> and U+303B from the exception list, letting them fall into
>> PVALID because of their Lm classification (unless U+303B
>> should be DISALLOWED as discussed above).  But, to the extent
>> possible, it would be good to understand a bit more about the
>> situation first, even though this takes us rather far into the
>> character-by-character analysis that we try to avoid.
> 
> If we don't want to go too far with character-by-character
> analysis, we can leave the business of excluding U+3031-3035
> to registries.
> 
> Regards,    Martin.






More information about the Idna-update mailing list