The lookalike problem(s)

Sun Nov 26 23:49:17 CET 2006

--On Monday, 27 November, 2006 10:10 +1300 Sam Vilain
<sam.vilain at catalyst.net.nz> wrote:

> John C Klensin wrote:
>> (ii) We have the well-known case of needing to form labels
>> that mix base Latin characters with the script in which the
>> Japanese language is traditionally written.  By any Unicode,
>> linguistic, or other definition, those characters come from
>> different scripts.  But, pragmatically, they are used
>> together and should probably be permitted in the DNS
>> together.    Of course, the rules might be different in some
>> domain that uses CJK characters differently and Japanese is
>> certainly not the only case.

> There are some cross-script confusables that I found between
> Latin and Chinese;
> 
> ⻈ vs i, ⺃ vs L, ⻖ vs ß, - well, those are radicals,
> which I think are excluded (though they seem to be repeated -
> U+8BA0 讠, and I saw the ß look-alike twice as well). 七 vs
> t, 丅 vs T, 丿 vs J, 亅 vs l, 丨 or 工 vs I, 乚 vs L,
> 乙 or 己 vs Z, 丫 vs Y, 凵 vs U (but definitely not 凹 vs
> U :)), 冫 vs i (perhaps), 匚 or 匸 vs C, 卜 vs t, 口 vs O
> (which, incidentally, looks like a "b" as normally
> hand-written), 爪 vs M. It has the feeling of an NP-complete
> problem to find them all...

It probably is.  Michael certainly knows more about this than I
do, but there is a tendency, probably historical and/or
anthropological, for relatively simple patterns, such as more or
less vertical and horizontal strokes, perhaps with simple
decorations, to appear in many writing systems.    Whether they
are then confusable or not depends on fonts/ rendering/
presentation: I don't have any trouble distinguishing any of the
characters in the message above but can easily imagine fonts (or
handwriting) in which the pairs would be indistinguishable...
just as I can imagine (and have seen) fonts for Roman characters
that make some of them indistinguishable from some Thai
characters in some fonts.

Because Unicode contains only a finite number of characters and
one only needs to look at pairs, the problem is less complex
than NP-completeness.  On the other hand, _I_ certainly don't
want to look at all of the relevant pairs (something on the
order of 10**9 if my quick mental arithmetic is correct), nor to
have to repeat the job with each revision of Unicode.

As a matter of personal speculation, there is a piece of this
puzzle in addition to what can go in the algorithm and what
should be done on a registry basis, and that is to think very
seriously about presentation and display.   If the system that
is to be used is to present punycode unless the user
pre-identifies the relevant script(s) as being well-known, and
possibly to present punycode if the application decides the
string is risky for reasons independent of those that would have
restricted the name from the protocol or the registry, then I
predict that IDNs will die except for some strictly
intra-language-community applications.  As I have commented
several times in different contexts, going to a user and say "We
have improved your life.  Instead of those horrible Romanized
transliterations that you have gotten used to and hate every
time, but can predict and your correspondents and users can
predict, we have this lovely, almost-random-looking (unless you
are _really_ unlucky) xn--nonsense punycode strings" the result
is not going to be positive.  (For the record, I have no idea
what the four Chinese characters represented by xn--nonsense
mean, but it is valid punycode.)

Curiously we have solved this problem before, although only for
much smaller character sets.  It didn't take long after
computers were introduced for people to discover that
distinguishing between l and 1 and between O and 0 was important
and to design both fonts and conventions about handwriting to
make the distinctions clear (Michael is the author of what, IMO,
is one of the better such fonts).  It didn't take much longer
for people to figure out that, at least for debugging purposes,
we needed conventions to represent, identify, and distinguish
otherwise-invisible control characters and text and control
languages.

Perhaps part of the _real_ future for IDNs and the confusability
problem lies in a combination of font design for maximum
distinguishability and that can be used to explicitly identify a
script (or something else) with each character as presented and
the use of that arrangement any time a domain name or IRI
appears and can be identified.   Some months ago, someone (I
think Mark Davis) suggested color-coding scripts within a domain
name to identify mixed script situations.  I don't think that
scales and generalizes sufficiently, and color is problematic
for accessibility and cultural reasons, but it was certainly a
useful idea as part of this more general notion.

If we could do that, it would quickly take us back toward the
plan that end-users should see punycode only on request or under
very unusual circumstances... a plan that I think will turn out
to be essential if IDNs are going to be viable globally.

The idea, if it is worth anything, is a gift to the community.
But I, for one, would be happy to pay good money for such a font.

>...
>> (2)  Were we, nonetheless, to try to specify and implement
>> such a rule, it would likely be as sensible, and would
>> certainly be as feasible, to decide that "European" digits
> 
> You mean "Arabic" numerals? :)

For those who don't immediately understand the joke, as long as
you put "Arabic" in quotes and understand that Arabic (no
quotes) digits are different, yes.    That is one of the other
"features" of this business -- words often don't mean what you
think they mean and precision is _really_ important.  For those
digits, the convention among the Unicode experts seems to be to
use "European", which seems sensible to me.

>...
>> (3) While I am getting skeptical about the feasibility of
>> applying a label-homogeneity rules to the basic protocol,
>> these sorts of things makes perfectly good sense as a registry
>> restriction.  Registries have access to language knowledge,
>> local knowledge, and an ability to make judgments about
>> circumstances that DNS or IDNA protocols lookups will lack.
>> For this example, that would leave the question of whether
>> "European" digits belong in the same label as Greek characters
>> in the hands of the registry or domain administrator who
>> decides to permit registration of Greek characters.  
> 
> Agreed, it's a pretty heavy thing to expect client software to
> get right, and will be so incomplete. I think that so long as
> the "important" ones (like ⁄, ／, ．,：) are covered (as
> of course they are by now), the rest can be done by registries.

Yes, but we also have to be aware that they raise another
problem.  To take a handy example, there are a lot of characters
that are confusable with "/" (see comment about strokes above).
If we permit one of them into the allowed set of characters, the
opportunities for URL-spoofing would be just too much for any
mildly immature or malevolent person to resist. And registration
restrictions that are not enforced (or enforceable) belong the
second level won't help with that.  For example, if there were a
TLD "badguy" then consider 
    http://www.mal-site.badguy/foo.bar.com/
if the third slash-lookalike isn't a slash.  Of course, if
presentation software says "evil! mixed script!" and drops
"badguy/foo" into punycode, that solves the problem... unless
"bar.com" is also a maliciously-populated domain and users have
gotten so used to seeing, and ignoring, punycode that it is just
ignored and taken as an IDN in a non-local script.  The hardest
problems here are not, IMO, going to be around IDNA and its
tables and we are going to all need to get very clever about
this... much more clever than we have been so far.  And, by the
way, for that one, painting the spoofed slash bright red, or
displaying it in a font that is impressively different from the
font of the other slashes, is probably a _much_ better solution
than punycode display.

     john