Katakana Middle Dot again (Was: tables-06b.txt: A.5, A.6, A.9)

Kenneth Whistler kenw at sybase.com
Fri Aug 7 21:35:07 CEST 2009


O.k., it looks like I have to wade in on this thread now. :-)

John said:

> If this is really a symbol, punctuation, or spacing mark --as
> the name implies-- then our general principles would argue for
> banning it entirely.

O.k., first let's get this misconception off the table.
The IDEOGRAPHIC CLOSING MARK has *nothing* whatsoever to
do with punctuation. This isn't "CLOSING" in the sense of
"closing punctuation" or anything of the sort.

U+3006 IDEOGRAPHIC CLOSING MARK is an abbreviated form that
Japanese shopkeepers hang up on their doors to indicate
the shop is closed. It is literally read "shime", which
means 'closed (not open for business)', from the verb
"shimeru" 'to close'. It is basically the Japanese equivalent
of this:

http://www.nottsprepared.gov.uk/np_home/closed_sign2.jpg

When Yoneya-san talks about this "shime" being equivalent
to the CJK ideograph U+7DE0, it isn't that U+7DE0 is
a *character* equivalent to U+3006 per se, but rather that
U+7DE0 is the ordinary kanji used to write the verb
"shime(ru)" (or "shima(ru)") -- in actual writing U+7DE0 is
used just for the "shi" root part of "shimeru", and you
would follow it be U+3081 to write the Hiragana
syllable "me". And a shopkeeper might post a sign that
has just U+7DE0 as another way to indicate a shop is closed.

> Unless someone makes the case for its
> having been misclassified, I don't see a reason to make an
> exception to Unicode's classification of it as "Lo", so it would
> remain a PVALID character. 

It isn't misclassified. In origin, U+3006 is a handwriting
abbreviation for "shime", so it has something in common
with other digraphic abbreviatory forms like the more
recently encoded U+309F HIRAGANA DIGRAPH YORI.

U+3006 has the additional attribute that it has long been
treated as a kind of honorary ideograph, because it stands
for the verb "shime(ru)" in the same way that the actual,
traditional, correct CJK ideograph U+7DE0 does. And because
of its use as a "content" element, it is classed in the
UCD as General_Category=Lo, but it is also classed as
Ideographic=True.

The reason why U+3006 is given Script=Common, instead of
Script=Han, is that it is in origin a derivative of
Hiragana forms, but isn't formally Hiragana, nor is it
formally a CJK Ideograph. Think of it as being a kind
of letterlike symbol, but one which is used in context
of Han, Hiragana, and Katakana in the Japanese writing
system, like a number of other letterlike symbols or
actual symbol-symbols in the 30XX blocks in Unicode.

> But, just as was the case for
> Middle Dot, I think we need to hear a compelling argument for
> why it is actually necessary to have labels that consist only of
> one or more closing marks and middle dots.

On that point, I would differ somewhat with Yoneya-san on
whether there is anything compelling about this.

> 
> At least for me, it would help to know how a label consisting of
>    Ü+3006 U+30FB
> would be pronounced and what it would mean. 

It would be pronounced "shime", but that is somewhat beside
the point.
 
> 
> It would also help me to understand how a normal (not computer
> expert) reader of Japanese would read 
>   U+30A2 U+30AA U+30FB U+30A2 

ao-a

> as different from
>   U+30A2 U+30AA U+30FB U+30A2 U+3006

ao-ashime

> in a label.

both of which are nonsensical, of course.

It would be possible to make a case for just U+3006 all by
itself in a label, although odd -- the way someone has registered
and used the radical sign U+227A as a label, and actually has
a website up for it. Since U+3006 is PVALID and otherwise
unconstrained, that is allowed by IDNA2008 currently.

I don't see any strong case for U+3006 *and* U+30FB without
any other Han or Hiragana characters. It just wouldn't mean
much. The U+30FB is a little like adding a "-", and unless
you connect it to something meaningful, there isn't much
point to it.

> Otherwise, I think that the observation that Harald and I have
> made in different ways should probably apply: It is in
> everyone's interest to minimize the number of exceptions to
> those that are really needed to support the writing system.  In
> this example, I believe that case has been made for permitting
> Katakana Middle Dot despite the fact that it is classified as
> punctuation.  But the idea of making a second exception just to
> support an exception makes me very nervous, especially if the
> argument for it is that someone might desire such a label.

My opinion is that it isn't worth making an exception to
the exception for U+30FB, just to allow it to work with
U+3006 "shime" alone. That is an edge case of an edge case, and
I cannot envision a strong enough claim on its necessity to
justify yet another exception to the rule. Anybody who wanted
to use U+3006 in a label with the katakana middle dot could
make it work by simply adding one more Japanese character --
virtually any other Japanese character would work.

--Ken



More information about the Idna-update mailing list