UTF-8

Tue Jul 6 18:51:07 CEST 2010

On Tue, Jul 06, 2010 at 12:14:17PM +0200, Nick Teint wrote:
> 2010/6/26 John C Klensin <klensin at jck.com>:
> 
> These two arguments are not about phishing.
> 
> While certainly not the preferred form for human consumption, the ACE
> form does enable some basic usage patterns such as noting the name
> down on paper, reading it aloud over the phone, etc. These tasks are
> simply not possible with an unfamiliar script or ersatz characters.
> 
> Gibberish out of known characters (a subset of the Latin script) is
> still better than gibberish that consists of unknown or
> indistinguishable characters.

ACE certainly can look like gibberish, and ACE can be confusable.  For
example, xn--fo-6ja, xn--fo-3ja, xn--fo-oja and so on -- pretty close,
but not the same.

In other words, I'm not sure we can win here.

As for '?' and the such: if users treat them as wildcards, that is bad,
but if users treat them as an indication that something is broken, then
that's much better.  It'd be nice to know which way most users are
likely to respond.  My intuition says that if you display garbage that
can't even correctly by cut-n-pasted then we're reasonably safe.  But
that's just intuition, and it only applies when you either can't
represent a label in the user's locale or lack the fonts...

Hmmm, maybe apps and even systems could have a setting where a user can
say that any domainname labels using scripts that the user doesn't
understand or containing characters which are confusable to the user,
will be displayed in such a way as to make that clear to the user, or
possibly as garbage.

> >> 3. The string contains conspicuous confusables. Displaying an
> >> ugly ACE string is better than displaying a
> >> maliciously-crafted string.
> >
> > As long as one can know that the string is maliciously-crafted,
> > sure.  A big warning that says "this is malicious, you aren't
> > looking at what you think you are, and you are likely to damage
> > your machine, your identity, your financed, or your soul if you
> > continue" would be even better and does not require displaying
> > an ACE.
> 
> I doubt it be better. Big warnings have problems of their own,
> especially when there are too many false alarms.

I agree... unless the user positively selected warnings instead of ACE
or garbage.

> Falling back to ACE is more neutral: It does not claim that anything
> be malicious, it simply disables IDN display and thus enables the user

I would rather see more options (see above).

> to see the difference between domain names. While paypal.com and
> payрal.com look similar, paypal.com and xn--payal-xye.com don't.

Oh, but that's just about confusables with non-IDNs.  It doesn't help in
IDN<->IDN confusable situations.

> While user agents can use other methods to make the difference visible
> (e.g. colours), these don't work universally.

Right, but in most UIs there will be a method of flagging problematic
IDNs that _users_ can enable.  Once the user has picked a warning method
they can be trusted to understand it (perhaps I'm being a bit naïve).

> Being able to express the name in ASCII is also useful for further
> investigation and debugging.

Yes.

> >> PS: * For this purpose, it might even make sense to define
> >> Script-Compatible Encodings (SCEs) for scripts other than
> >> Latin/ASCII.
> >
> > I'd be interested in understanding what you have in mind.
> 
> ACEs map the full Unicode range onto characters from the Latin script
> (a subset of the best-known characters, of course). This not only
> helps machines that can't cope with non-Latin  characters but also
> humans who don't know the original script.
> 
> SCEs would map the full Unicode range onto characters from a non-Latin
> script, helping users who know neither the original nor the Latin
> script. For example, müller.example.net, the ACE of which is
> xn--mller-kva.example.net, could also have SCEs such as
> χν--μλλερ-κβα.χω--εχαμφλε.χω--νετ for people that don't know the Latin
> but do know the Greek script.

Interesting.  It shouldn't be hard to define SCEs, and we could leave
the bootstring parameters and base character set to national standards
orgs.  But we'd need to make sure that SCEs do not leak onto the wire,
and therein lies the problem: if we can't prevent leaking of IDNs and
ACE into the wrong places then we likely won't be able to prevent
leakage of SCEs.  And SCEs would be IDNs from IDNA's p.o.v., so they'd
get re-encoded as ACE, adding more garbage/confusability.

It may seem really unfair that everyone has to be familiar with the
basic Latin alphabet.  But right now I think that's the best we can do.

> > However, the conventional/historical way to do this is by
> > transliteration into Latin characters.
> 
> Transliterations are lossy. ACEs and SCEs are less comprehensible but
> allow a full reconstruction of the U-label.

Transliterations work well when they are chosen by the owner of the
non-ASCII name in question.

Nico
--