Tables and contextual rule for Katakana middle dot
John C Klensin
klensin at jck.com
Wed Apr 8 00:24:58 CEST 2009
--On Tuesday, April 07, 2009 11:29 -0700 Mark Davis
<mark at macchiato.com> wrote:
> First off, the Katakana middle dot is distinctive enough that
> I see no problem with visual confusion. Take a look at:
> a ・c.com
> The width and positioning is far different.
So, you are relying on a particular font presentation, one that
is certainly not required by The Unicode Standard or anything
else, in a script that is famous for artistic calligraphy and
font design to determine distinctiveness? One might certainly
have a different opinion.
I also note that, in some circles, IP addresses have often been
written (in handwriting) with the dots above the baseline to
make the presence of those dots clear (just as zeros are
sometimes written with slashes). I tried two separate
handwriting-to-text conversion programs that thought they were
processing Latin-script characters with a midline dot separating
digits and both of them decided it was U+002E. So at least
something non-visual is easily confused.
> There are other dot-like characters that are far more visually
> similar to dot, like Arabic zero.
Sure. But a no-mixed-script registration rule would prohibit
the use of that one outside an Arabic script context, where it
would be expected and easily distinguished. My problem, and I
think Harald's, is not with permitting this character, it is
permitting it in non-Japanese contexts. Of course, if Unicode
has considered Romanji a separate script from Latin, we might be
having a different discussion.
> But more importantly, there is a real lack of data presented
> for these kinds of positions. When excluding characters that
> are in common use on the basis of visual confusability, such
> as Katakana middle dot, let's see some real data on what a
> difference this would make in overall visual confusability of
> characters. Of all of the visually confusable characters in
> PVALID, what would be the percentage difference but adding or
> removing Katakana middle dot? And why do people think this
> can't be handled by exactly the same mechanisms that programs
> have to handle the visually confusable characters that *are*
Mark, there is a problem with any sort of "prove this with data"
situation, which is that one can take the position that
data/proof are required for either inclusion or exclusion. I
think that we can propose a guideline about that and try to
stick with it. My proposed guideline is that we use the rest
of the classification rules in Tables (and, effective, the
Unicode character properties) to govern construction of the
appropriate rule. If a character would be PVALID under those
rules, then it takes a demonstration that it is harmful,
dangerous, or seriously confusing to justify DISALLOWING it or
requiring a contextual rule. If a character would be
DISALLOWED, it takes a demonstration that it is both safe and
important to justify making an exception or a special rule (to
make it either PVALID or CONTEXTx).
If that rule is not reasonable, please propose a different one.
If it is reasonable, it provides the discrimination function
that you suggest is needed:
* An Arabic Zero is a Digit (Nd), digits are, by default,
PVALID, and no one has made a convincing case for excluding it.
Such a case would presumably meet great resistance because
digits in labels are common, well-established (as a class), and
required for names that are formed according to the requirements
of some international networking standards.
* The Katakana Middle Dot is Punctuation (Po). That category
is, by default, DISALLOWED. So the discussion is whether it is
important enough to make into an exception (whether that
exception makes it PVALID or CONTEXTO) and whether it is safe to
do so. Yoneya-san has made the argument that it is important as
a word separator. Absent anything else, I would personally
accept that while noting that there are lots of punctuation
characters in lots of scripts and contexts that people would
like to use as word separators, including, curiously, U+002E.
While it isn't an IDNA context, I note that I might want to put
something that I can normally only write as
John\.Klensin.jck.com in an SOA record and that a number of
standards and conventions would make Joe\.Bloggs.example.com a
nice third-level subdomain, quite distinguished from
Joe.Bloggs.example.com, which is a fourth-level one. Also note
that, in the DNS's internal format, neither of these examples
uses the backslash _and_ that the backslash is not required by
any standard. For that reason, I think it is reasonable to
accommodate the Japanese requirement by permitting that
character as long as it is sufficiently restricted to Japanese
contexts that a non-reader of Japanese is unlikely to encounter
it and confuse it with a label-separator, i.e., that it is
acceptable if handled with a contextual rule although not
acceptable as PVALID for use anywhere. But, if "Japanese
context" is expanded to include labels that consist only of one
or more middle dots, in any position, plus undecorated Latin
(because there is no way to separate Romanji from Latin
(decorated or otherwise)), then I can see no plausible way to
write the contextual rule and we are back to the DISALLOWED
Whether you agree or not, does that line of reasoning make sense?
More information about the Idna-update