UTF-8

Tue Jul 6 12:14:17 CEST 2010

2010/6/26 John C Klensin <klensin at jck.com>:
> --On Saturday, June 26, 2010 11:54 +0200 Nick Teint
> <nick.teint at googlemail.com> wrote:
>> Sometimes, you _do_ want ACEs to leak into the UI:
>> 1. Your user does not know the script. Displaying an ugly ACE
>> string is better than displaying some
>> known-to-be-unrecognisable characters.*
>> 2. You don't have the fonts. Displaying an ugly ACE string is
>> better than displaying "???????".
>
> Both of these points have been made many times before.  Note
> that  "???????" and its equivalents (e.g., row of little boxes)
> are universal confusables -- they can be confused with, and
> match in user perception, _any_ string for which the user does
> not have fonts.  For the cautious user, that should be a strong
> warning.

These two arguments are not about phishing.

While certainly not the preferred form for human consumption, the ACE
form does enable some basic usage patterns such as noting the name
down on paper, reading it aloud over the phone, etc. These tasks are
simply not possible with an unfamiliar script or ersatz characters.

Gibberish out of known characters (a subset of the Latin script) is
still better than gibberish that consists of unknown or
indistinguishable characters.

>> 3. The string contains conspicuous confusables. Displaying an
>> ugly ACE string is better than displaying a
>> maliciously-crafted string.
>
> As long as one can know that the string is maliciously-crafted,
> sure.  A big warning that says "this is malicious, you aren't
> looking at what you think you are, and you are likely to damage
> your machine, your identity, your financed, or your soul if you
> continue" would be even better and does not require displaying
> an ACE.

I doubt it be better. Big warnings have problems of their own,
especially when there are too many false alarms.

Falling back to ACE is more neutral: It does not claim that anything
be malicious, it simply disables IDN display and thus enables the user
to see the difference between domain names. While paypal.com and
payрal.com look similar, paypal.com and xn--payal-xye.com don't.

While user agents can use other methods to make the difference visible
(e.g. colours), these don't work universally.

Being able to express the name in ASCII is also useful for further
investigation and debugging.

>> PS: * For this purpose, it might even make sense to define
>> Script-Compatible Encodings (SCEs) for scripts other than
>> Latin/ASCII.
>
> I'd be interested in understanding what you have in mind.

ACEs map the full Unicode range onto characters from the Latin script
(a subset of the best-known characters, of course). This not only
helps machines that can't cope with non-Latin  characters but also
humans who don't know the original script.

SCEs would map the full Unicode range onto characters from a non-Latin
script, helping users who know neither the original nor the Latin
script. For example, müller.example.net, the ACE of which is
xn--mller-kva.example.net, could also have SCEs such as
χν--μλλερ-κβα.χω--εχαμφλε.χω--νετ for people that don't know the Latin
but do know the Greek script.

> However, the conventional/historical way to do this is by
> transliteration into Latin characters.

Transliterations are lossy. ACEs and SCEs are less comprehensible but
allow a full reconstruction of the U-label.

NT