The lookalike problem(s)

Sun Nov 26 20:45:00 CET 2006

--On Sunday, 26 November, 2006 02:33 +0000 Michael Everson
<everson at evertype.com> wrote:

> At 18:23 -0800 2006-11-25, Paul Hoffman wrote:
> 
>> I'm not sure why you are insisting on seeing the example
>> visually.  Take any Greek word (even "Hellas") and append
>> some Latin numerals,  such as the current year.
> 
> The common "European" digits do not belong to the Latin
> script, and do not have a "Latin" character property.

Paul,

To amplify a bit on Michael's comment:

(1) I don't see any way to put a "one label, one script" rule
into the protocol in a way that would do us any good.   As
examples of reasons (you pick your preferred one or invent
another, even if you don't believe one or more of the others or
see a workaround for them):

(i) with regard to any given language, label, or use, there is
enough extra "stuff" identified as "Latin Script" to make an
"entirely  Latin Script" rule nearly useless.  Probalby, similar
comments could be made about other scripts, but that one is
probably the worst.  That identification is almost certainly
entirely appropriate given the Unicode definition of "script",
but it makes the category less useful for our purposes than it
might be.

(ii) We have the well-known case of needing to form labels that
mix base Latin characters with the script in which the Japanese
language is traditionally written.  By any Unicode, linguistic,
or other definition, those characters come from different
scripts.  But, pragmatically, they are used together and should
probably be permitted in the DNS together.    Of course, the
rules might be different in some domain that uses CJK characters
differently and Japanese is certainly not the only case.

(iii) If we really wanted, and could implement, such a rule, it
should --I think clearly-- be tied to the characters of the
writing system associated with a particular language, or, in
some cases, the characters of the writing system associated with
a particular language when some particular script is used.   And
that can't work at lookup time because there is no writing
system or language information in the DNS and one would need to
have it prior to lookup, rather than after getting the results,
anyway.

(2)  Were we, nonetheless, to try to specify and implement such
a rule, it would likely be as sensible, and would certainly be
as feasible, to decide that "European" digits were usable with
characters that were otherwise in the Greek script as it would
be to decide that they were usable with characters in the
"Latin" script.  Partially because of (1), above, we are a great
distance from actually defining such a rule.

But, Michael, this is, I believe, why you and Paul have been
confusing each other.  I think that he assumed, when we have
said "one label, one script", that the  test would be made on
"Greek script" only. The fact that the European digits aren't
"Latin script" is irrelevant -- from the standpoint of that test
as he (reasonably) interpreted it, it is important only that
they aren't Greek.   I infer that your assumption was that "one
label, one script" would simply pass things that were not
properly part of any script (e.g., were "Script Common" or some
other case) or perhaps just things of class Nd.  Perfectly
reasonable and a much more plausible rule than [strictly] "one
label, one script", but perhaps something we should be calling
"one label, all letters from the same script" to be more clear.

(3) While I am getting skeptical about the feasibility of
applying a label-homogeneity rules to the basic protocol, these
sorts of things makes perfectly good sense as a registry
restriction.  Registries have access to language knowledge,
local knowledge, and an ability to make judgments about
circumstances that DNS or IDNA protocols lookups will lack.  For
this example, that would leave the question of whether
"European" digits belong in the same label as Greek characters
in the hands of the registry or domain administrator who decides
to permit registration of Greek characters.

     john