The lookalike problem(s)
sam.vilain at catalyst.net.nz
Sun Nov 26 22:10:44 CET 2006
John C Klensin wrote:
> (ii) We have the well-known case of needing to form labels that
> mix base Latin characters with the script in which the Japanese
> language is traditionally written. By any Unicode, linguistic,
> or other definition, those characters come from different
> scripts. But, pragmatically, they are used together and should
> probably be permitted in the DNS together. Of course, the
> rules might be different in some domain that uses CJK characters
> differently and Japanese is certainly not the only case.
There are some cross-script confusables that I found between Latin and
⻈ vs i, ⺃ vs L, ⻖ vs ß, - well, those are radicals, which I think are
excluded (though they seem to be repeated - U+8BA0 讠, and I saw the ß
look-alike twice as well). 七 vs t, 丅 vs T, 丿 vs J, 亅 vs l, 丨 or 工
vs I, 乚 vs L, 乙 or 己 vs Z, 丫 vs Y, 凵 vs U (but definitely not 凹 vs
U :)), 冫 vs i (perhaps), 匚 or 匸 vs C, 卜 vs t, 口 vs O (which,
incidentally, looks like a "b" as normally hand-written), 爪 vs M. It
has the feeling of an NP-complete problem to find them all...
I even saw one confusable with punctuation - eg 丶 vs `. And 灬 might
get confused with MediaWiki mark-up :-).
> (2) Were we, nonetheless, to try to specify and implement such
> a rule, it would likely be as sensible, and would certainly be
> as feasible, to decide that "European" digits
You mean "Arabic" numerals? :)
> were usable with characters that were otherwise in the Greek script
> as it would be to decide that they were usable with characters in
> the "Latin" script. Partially because of (1), above, we are a great
> distance from actually defining such a rule.
> (3) While I am getting skeptical about the feasibility of
> applying a label-homogeneity rules to the basic protocol, these
> sorts of things makes perfectly good sense as a registry
> restriction. Registries have access to language knowledge,
> local knowledge, and an ability to make judgments about
> circumstances that DNS or IDNA protocols lookups will lack. For
> this example, that would leave the question of whether
> "European" digits belong in the same label as Greek characters
> in the hands of the registry or domain administrator who decides
> to permit registration of Greek characters.
Agreed, it's a pretty heavy thing to expect client software to get
right, and will be so incomplete. I think that so long as the
"important" ones (like ⁄, ／, ．,：) are covered (as of course they are
by now), the rest can be done by registries.
Sam Vilain, Systems Architect, Catalyst IT (NZ) Ltd.
phone: +64 4 499 2267 PGP ID: 0x66B25843
More information about the Idna-update