UTF-8

Shawn Steele Shawn.Steele at microsoft.com
Tue Jul 6 18:45:01 CEST 2010


>> Sometimes, you _do_ want ACEs to leak into the UI:
>> 1. Your user does not know the script. Displaying an ugly ACE
>> string is better than displaying some
>> known-to-be-unrecognisable characters.*
>> 2. You don't have the fonts. Displaying an ugly ACE string is
>> better than displaying "???????".

I somewhat disagree.  To most users, xn--qwerty is no different that xn--querty.  In that case, EVERYTHING is confusable, no different than your case #2.
(In case #1 the user can probably still notice at least a few differences).

> While certainly not the preferred form for human consumption, the ACE
> form does enable some basic usage patterns such as noting the name
> down on paper, reading it aloud over the phone, etc. These tasks are
> simply not possible with an unfamiliar script or ersatz characters.

Possibly true, but is that at all helpful?  If I can't read a script for a web site's domain name, then very likely that web site is unusable to me, so it matters very little if I can transcribe it.  If your web site is useful to me, you probably have a CNAME, DNAME, or something else with an ASCII address or other script I can read.  Even if I'm collecting data for a 3rd party users, like maybe an index of all the breweries I can find, then I'd likely still want to be able to publish the URLs in a script targeted at the end users that can use the web site.

> Gibberish out of known characters (a subset of the Latin script) is
> still better than gibberish that consists of unknown or
> indistinguishable characters.

Just barely.

> I doubt it be better. Big warnings have problems of their own,
> especially when there are too many false alarms.

> Falling back to ACE is more neutral: It does not claim that anything
> be malicious, it simply disables IDN display and thus enables the user
> to see the difference between domain names. While paypal.com and
> payрal.com look similar, paypal.com and xn--payal-xye.com don't.

Huh, treating everything as an ACE makes everything a false alarm.  By displaying cyrillic as punycode, you're saying "Hey, we don't trust Russians, and you shouldn't either".  And it does nothing about paypal.safe.com, which is just as likely to catch most users.

Being able to express the name in ASCII is also useful for further
investigation and debugging.

> SCEs would map the full Unicode range onto characters from a non-Latin
> script, helping users who know neither the original nor the Latin
> script. For example, müller.example.net, the ACE of which is
> xn--mller-kva.example.net, could also have SCEs such as
> χν--μλλερ-κβα.χω--εχαμφλε.χω--νετ for people that don't know the Latin
> but do know the Greek script.

That's an interesting usability exercize, but what's it got to do with IDN?  If I need to encode things in other scripts, a web address is about the least likely thing I need to encode.  Web sites I just click on.  Email I just reply to.  A much larger problem is when I'm supposed to call someone and I have no idea how to read the name their script's in.

> > However, the conventional/historical way to do this is by
> > transliteration into Latin characters.

> Transliterations are lossy. ACEs and SCEs are less comprehensible but
> allow a full reconstruction of the U-label.

Transliteration's pronouncable.  And if a domain name is useful in the transliterated script, then it should register another name.  ACE encoding a cyrillic URL isn't going to allow me to use a Russian web site.  That requires the site to be translated, at least a little bit.

Just my own 2 cents,

-Shawn


More information about the Idna-update mailing list