Tables and contextual rule for IDEOGRAPHIC ITERATION MARKs
John C Klensin
klensin at jck.com
Thu Apr 9 12:10:19 CEST 2009
--On Thursday, April 09, 2009 16:59 +0900 "\"Martin J. Dürst\""
<duerst at it.aoyama.ac.jp> wrote:
> I understand that there is a desire to add some context
> constraints for middle dot, but I don't understand why we
> need constraints for Ideographic Iteration Mark. In my
> opition, the context given by Yoshiro is correct, but the
> chance that this character gets confused with something else
> is as big or as little as any other randomly picked
> character, so I don't see why we would need context. Is it
> that this is a punctuation character, that we can only
> exceptionally include punctuation characters, and only if
> they have context?
Middle dot (U+30FB) is a punctuation character (Po), so it is
allowed only by exception and, for the reasons mentioned
earlier, it makes sense to make the exception as narrow as
possible.
I no longer remember why we treated U+3005 as requiring context.
It is Lm in the tables, which brings it under Category A
(Section 2.1) in Tables, so, absent other considerations, it
ought to default to PVALID. I note that there are several other
iteration marks that are just PVALID. I image that U+3005 was
called out for special treatment because the Unicode Standard
identifies it as part of a "CJK Symbols and Punctuation" block
(see page 830 of TUS 5.0). Its presence in the Contextual rule
list may consequently be an artifact of the time in which we
were still treating the Unicode block structure as significant.
On a fast scan, there doesn't seem to be anything in Stringprep
that calls it out for special treatment. At least at the
registry level, none of the iteration marks appear to be
Preferred Variants for Chinese (see
http://www.iana.org/domains/idn-tables/tables/cn_zh-cn_4.0.html
or the identical table for .TW), some, but not all, of them
appear in the .JP Preferred Variants list of Japanese (see
http://www.iana.org/domains/idn-tables/tables/jp_ja-jp_1.2.html).
.KR has filed only a Hangul table with IANA, so I can make no
inferences there.
So, if I can ask your indulgence to satisfy my curiosity and
slightly reduce my ignorance,
(i) Are these iteration marks used with Japanese only
(out of the CJK script group)?
(ii) How are they used? It may be just an incorrect
inference from terminology, but, if I saw something
called an "iteration mark", I'd normally expect it to be
associated with a numeral that would tell me how many
copies of an associated character or string to infer.
With the understanding that this is strictly a registry
matter (unless someone comes forward with a very strong
argument for DISALLOWing something), wouldn't that
require a rather special normalization process to
prevent conceptually-identical strings being treated
differently? Or, if my model is correct, would the
relevant iteration mark always be required rather than
repeating whatever it is that is repeated?
(iii) Is there any possible reason why some of the
iteration marks should be treated as PVALID and others
should be CONTEXTO?
(iv) If "vertical" really means that, is U+303B needed
in domain names at all? Are they ever, in practice,
written vertically? I note that the .JP table
(reference above) does not permit that character at all.
If it is not used, not useful, and could cause
conceptual confusion (can it?), then should it be
DISALLOWED rather than PVALID or CONTEXTO?
I think that this takes us in the direction of removing U+3005
and U+303B from the exception list, letting them fall into
PVALID because of their Lm classification (unless U+303B should
be DISALLOWED as discussed above). But, to the extent possible,
it would be good to understand a bit more about the situation
first, even though this takes us rather far into the
character-by-character analysis that we try to avoid.
best,
john
More information about the Idna-update
mailing list