Stop me if I've misunderstood...

Shawn Steele Shawn.Steele at microsoft.com
Fri Jul 10 18:48:10 CEST 2009


I guess you read more into my camel-cased example than I really meant.  Of course if it doesn't work, marketing won't use it.  For IDNA2003 mappings, they do work and someone probably uses them though.  Make A-Z illegal and I'll consider the mapping isn't needed argument.  (Yea, I know it's "different", but to the end user it's not.  WE get blamed when English "works," but Russian doesn't).

I'm less concerned about Q and O looking alike to a non-latin-script user.  It doesn't matter what Chinese character it is, I can't type any of them.  I can follow a link from mailed to me in Chinese though.  I'm more worried about script-literate users having predictable,  "reasonable" mappings and somewhat expected behavior (I realize it can't be perfect).

To muddle around my unclear definition of when I think mappings should happen.  Mapping MUST NOT happen when the DNS client is querying the DNS server for a name (the protocol specifies Punycode there).  Mapping SHOULD/MAY happen when an application calls GetAddrInfoW() (or the equivilent), which is the Windows API that sends that query to the DNS server.  (Which is why I've been avoiding saying that mapping shouldn't happen at the DNS layer, it should happen on the client side of the request).

You indicated that you don't want UI applications to ignore rules.

I believe that the only IDNA2003 rule that IE "breaks" is by displaying punycode in the address bar &/or blocking easy navigation if the user has not configured that script.  I wouldn't really call this breaking as a user can fix it, but I personally would have chosen a different approach (make it red or something).  Additionally IE blocks phishing and other malicious web sites, but that happens to ASCII addresses as well.  The user can easily add additional http-accept-lang languages for languages they can read, in which case they won't see punycode for those scripts.

The way for rules to not be ignored is to have good rules :)  Entire RFCs are ignored when they don't have enough industry support.  Certainly if you specify that mappings can only happen on user-initiated events (like a click or typing), and not when parsing an href, then the browsers/crawers will ignore the rules.  Nobody's really said that, but it seems that several vendors' representatives have made that fairly clear by their positions.  Mapping in GetAddrInfoW(), which will impact pretty much all Windows applications.

The 3 things that seem at risk of being ignored in the current drafts are:  1) several vendors have stated their intent to support IDNA2003 mappings for back-compat reasons (at least when they don't conflict with the IDNAbis code point changes)  2) several vendors have indicated that they require flexibility at where mappings are applied, and 3) several vendors have also indicated that consistent mappings are a necessity.

Those could probably be met by supporting ONLY the 2003 mappings and not adding new ones, but I don't see how that helps your no-mapping goal.

-Shawn

________________________________________
From: John C Klensin [klensin at jck.com]
Sent: Friday, July 10, 2009 8:57 AM
To: Shawn Steele; Gervase Markham; idna-update at alvestrand.no
Subject: RE: Stop me if I've misunderstood...

--On Thursday, July 09, 2009 21:13 +0000 Shawn Steele
<Shawn.Steele at microsoft.com> wrote:

>
>> I don't think IDNA2008, with or without the most recent
>> proposals, changes that property.  The main thing IDNA2008
>> does that is different from IDNA2003 is to strongly
>> discourage any string that requires mapping from those
>> adverts.
>
> That's not gonna happen.  Burger King isn't going to write
> "haveityourway.com" on the side of the bus, it's gonna be
> "HaveItYourWay.com".  Sure, mapping in ASCII is free, but
> there's a need for mapping in non-ASCII contexts as well.
> Specifying or recommending something we know is going to be
> ignored is bad.  A) it encourages people to interpret the
> standard how they see fit, and B) developers can't count on
> the language because they know it'll be ignored.

And you are reasoning from analogies that may not hold up.
Before I try to explain that (following up part of Elizabeth's
note), I want to stress that...

First of all, our role here is to make things work well and
predictably, with catering to the inclinations of various
marketing and branding departments (Burger King or otherwise), a
secondary goal at best.  For whatever it is worth, we get more
predictability when we have fewer variations in what is
possible.   Probably we all believe the latter, the question is
how it should properly interact with the user experience.  That
is not an easy question and I don't think that hyperbole (from
either side) or games about who has to prove what moves us
forward.

Second, my experience with marketing people is that, while they
would like a perfect world in which every campaign was
successful, competitors were stupid and ineffective, and will
make all sorts of demands in the hope of realizing one or both,
they are ultimately very pragmatic.    If one is faced with a
choice between "haveityourway.com" or "have-it-your-way.com",
either of which work 100% of the time, and "HaveItYourWay.com"
that works only fairly often, I know that they --or at least the
subset who expect to survive in the business-- will pick one of
the first two.   I note that, as far as the DNS is concerned
"Have It Your Way.com" is a perfectly valid domain name.

While ICANN rules prohibit names with embedded blanks at the
second level, just as they prohibit raw, non-ASCII, UTF-8, I
assume that I'm not the only one here who has had to listen to
some marketing type complain that "Our Favorite Slogan" could
not be used as a domain name and that "we" had to be smart
enough to make it happen and just weren't trying hard enough.
The response is to explain that

    Our Favorite Slogan.MyCompany.com

is actually a valid domain name that they are welcome to use if
they like, they just wouldn't find that it was very useful in
practice.   Each time I've had that conversation --and there
have been several times-- there has been much complaining but,
eventually, there has been no insistence on domain names with
embedded spaces.

Now, coming back to your example, we have to realize how
culturally- and historically-sensitive this is.  A decision was
made in the early 1970s that names of hosts and networks were
going to be treated case-insensitively.  At the time, that
decision had very little to do with user experiences: we had
hosts that really couldn't handle lower case, hosts that could
but treated the two cases as globally equivalent, and hosts that
were case-sensitive but on which upper case was considered a
little strange.   Case-insensitive identifiers seemed to be the
way to go.  A decade later, that decision was carried forward
into the DNS world without, if I recall, a lot of thought or
discussion, largely because, by then, it had been embedded into
a number of application protocols.  Had the original decision
been made differently -- either to treat identifiers as
case-sensitive or to prohibit one case or the other-- we
probably would be having a different discussion today (not
necessarily an easier one, but different).

Second, the way in which one gets the equivalent of
"HaveItYourWay" in German is traditionally to make a new word,
"haveityourway", with no capital letters in the middle.  If one
wants to maintain distinct word-components, one uses spaces or
maybe hyphens.   There are, in principle, two ways to do it in
Arabic -- the use of initial-form and final-form characters to
denote boundaries or the use of ZWNJ.  But we've been told by
Unicode experts that initial, final, isolated, and medial forms
should all match and the Arabic language community has been
reasonably clear that they do not want or need ZWNJ for writing
the Arabic language.

So I wouldn't generalize much from "Have It Your Way" (with or
without spaces).

> I'm not saying that the U-label form shouldn't be encouraged
> in the bowels of the system, that'd clearly be good.  I am
> saying that anything potentially user facing shouldn't have
> this recommendation.  Especially if "marketing" is going to
> have a voice ;-)

I think maybe we agree, but I'm not sure which "this
recommendation" you are referring to partially because, as you
and others have pointed out, "user facing" is not itself
unambiguous.

Because of the greater distinguishability of lower case
characters and because having reverse-mapping work out, I would
tend to recommend that those who are more worried about
precision and avoidance of attacks based on recognition of
characters stick with U-labels and hence with lower case.  I
would not require that in UIs, but I would probably recommend it
to both advertisers and users.

Where the design questions get controversial, and despite many
concerns, I'd encourage people who are designing highly
localized UIs to consider forgoing case mapping (and to present
lower case) where the community involved was extra-vunerable to
confusion in scripts with which they were not familiar enough to
easily do the case conversions without looking (e.g., "Q" and
"q" may look alike to you or me, but, to someone with very low
familiarly with Latin scripts and fonts, "Q" might look a lot
more like "o" than it does like "q").    That is a tradeoff with
the principle that anyone who types a given string should get
the same interpretation as anyone else who types that string,
but it may be worth pointing out that, if familiarity with Latin
characters is low enough for my suggestion to apply, there
probably are no Latin characters on the keyboard, so the same
string is _not_ being typed as would be typed by someone with a
Latin-based keyboard.    I don't think we should be trying to
make the decisions involved in this, or in forcing one
particular UI behavior, in the protocol -- partially because I'm
convinced that, after a few bad experiences, we will find UI
software ignoring any rules we write in favor of protecting
users (either by reducing the amount of mapping that is done or
by insisting on user entry of A-labels for labels in unfamiliar
scripts.

To turn that same comment around, I'd think that the designers
of any localized UI that is expected to be used in locales with
Latin-based scripts, or scripts that have variant-width
characters in Unicode, would be nuts not to make the obvious
mappings.  Clearly the spec permits that.

If you can suggest a better way to make this clear, I'm
listening and I assume that Pete and Paul are too.

    john


More information about the Idna-update mailing list