Mapping?

Shawn Steele Shawn.Steele at microsoft.com
Wed Dec 2 20:10:57 CET 2009


I like that you're trying to clarify the thinking, but I reach the same conclusion that a standardized mapping is necessary.

You are brainstorming about typed in links vs followed links.  IF we could know that "followed links" were already in a canonical form, that distinction might be interesting.  Unfortunately I don’t think that distinction can be made since many editors don't "correct" the typed in links that web page authors create.  (Including wikis, blog tools, etc.)  Even worse, many "links" aren't even really links, but rather they're the client UI trying to be smart about an address it saw in a document or email, in which case the distinction is lost.

I'm also pretty sure that it won't help much for "typed in" names.  If we could guarantee the users' expectations, then perhaps, however there are several factors that make variations on typed in names difficult.

* I might just happen to be using a different computer, like a kiosk.  In that case I may not know the local rules and the machine may not configured with my expectations.  Additionally "my" locale might not even be installed, even if the kiosk was configurable or asked me where I was from.  This is any public scenario like an airport, hotel or library.

* I had to get the URL from somewhere.  The advertisement I saw may not know my locale limitations.  Specifically for eszett, consider a billboard near the swiss-german border.  If a Swiss customer sees it, their machine may map ß to ss, however a German customer's machine might keep it as is.  (Obviously if that mapping was illegal that wouldn't happen, but maybe it’s a billboard "MICROSOFT" in Turkey (hopefully our ad people would know the difference, but that's not the point, most won't, and even our ad people don't know the guts of IDNA)).

* You could exchange the name through another mechanism.  I could be trying to type it off our your letterhead or a bill from an import/export firm that crosses locale boundaries.

* In short, even for typed-in names, a "locale sensitive" operation works only for locale-specific data.  In a global context/operation, the locale expectations break when boundaries are crossed.

The previous part, I think, is more about mapping in general than for eszett.  

> One slightly more solid question for browsers is, would it be entirely crazy to have different mapping algorithms for typed-in domain names than for links followed?  

As above, I think it would be well-intentioned, but cause undo chaos.  (A qualified "entirely crazy" :)

> how often do we know the locale where the link was authored?

Ask the search guys (I'm not in touch with that data).  My guess is very rarely.  Additionally it's probably often wrong.  (If I enter a comment on a German blog, then I was thinking English, not German, but the server can't tell that.  Also blogs.msdn.com seems to have Cyrillic on it.  I don't have a config page to specify my locale, so my guess is those Cyrillic pages are treated as en-US.)
 
> Do any authoring software clients fix up links as the user types?

I'm sure some do, however the problem is that a lot do not.

> When I type a link in a document, the authoring software often makes that link active.  Is there any software that automatedly lower-cases?

Some.  Specifically IE displays the domain in lower case.  This is intentional, although it might use the IDNA2003 APIs to get to that point.  It's also probably a problem for some cases, like CamelCased Greek where the l was supposed to be a final sigma.  (Not trying to say what should be done there, just pointing out it's a problem.  One thought would be for to bundle just in case someone types in all caps.)

>  If so, would such software also be likely to map to PVALID characters before the doc is finished?

I would expect that 'good' authoring tools would allow this, however the problem is likely that there are a lot of tools (maybe even most?) that don't provide that kind of behavior.

-Shawn

-----Original Message-----
From: Lisa Dusseault [mailto:lisa.dusseault at gmail.com] 
Sent: ,  02,  2009 7:07
To: Shawn Steele
Cc: Eric Brunner-Williams; Andrew Sullivan; idna-update at alvestrand.no
Subject: Re: Mapping?

I'd like to try to unpack some of the different use cases we're talking about a little more.

ISTM that use cases where the person following the link is the person who is typing it in, are use cases that locale-dependent mapping might be most useful.  If I'm in a locale where Ȱ (x230) is considered to be the capitalized version of o (ASCII o),  it might very well be most helpful to make that mapping.  Use cases where the same user is typing in the domain names that then looks them up include:
 - typing links in the address bar
 - typing mail address in the To field of an email
 - Writing a Web page, blog post or email, wherein I check that the links work before posting/sending my document

In contrast, the use case where the person looking up the domain FȰȰ.example is not the person who typed it in, then in most cases we no longer know the intent or locale of the person who typed in the domain.  It may be the same locale as the person who is looking up the domain but it may not be.  The person who typed in the domain may have intended fȱȱ.example or foo.example, and may have tested that before sending/posting the link, but we no longer have that information.  Use cases include:
 - Following a HTTP link in any Web page, document, blog post, email, etc
 - Using a mailto link (explicit or implicit), e.g. when one person sends me another person's email address

We probably would all agree that people follow links while Web browsing far more often than they type them in, and even when typing in, auto-complete probably drastically reduces the new cases of from-scratch mapping and lookup.

However, we probably have quite different assumptions about how much Internet activity takes place among users of a consistent locale.  Can we assume that Patrik wants ß interpreted as ß because he communicates mostly in Swedish with Swedish users and mostly reads Swedish Web pages?  Or must we assume that Patrik also gets email from german and swiss senders, and also reads Web pages (perhaps in English!) written by German users who expected different mappings?  I am sure this depends heavily on our model of a user, and whether we're using ourselves as hypothetical examples or not.

One slightly more solid question for browsers is, would it be entirely crazy to have different mapping algorithms for typed-in domain names than for links followed?  There might be a locale-dependent mapping as well as a global mapping.  (I assume that having every established locale mapping installed would be complete craziness.)

Another question is: when posted links are followed, how often do we know the locale where the link was authored?  Not that the browser following the link would necessarily be able to apply the mappings of the locale in which it was authored, but would it be slightly better to apply a global mapping than a mapping from a different locale?

Do any authoring software clients fix up links as the user types?
When I type a link in a document, the authoring software often makes that link active.  Is there any software that automatedly lower-cases?
 If so, would such software also be likely to map to PVALID characters before the doc is finished?

Lisa

On Tue, Dec 1, 2009 at 12:45 PM, Shawn Steele <Shawn.Steele at microsoft.com> wrote:
>
> One example I discussed with Patrik yesterday, was whether locale 
> might affect mapping. I'd like to get better insight into the general 
> understanding of that.
>
>> 1. Could locale determine whether a PVALID character should be mapped 
>> into another PVALID character prior to following the rules to turn 
>> into an ALABEL?  I believe the consensus answer is probably SHOULD 
>> NOT or MUST NOT because that would make domains with that valid 
>> character unreachable by software using those locale rules.
>
> I agree.
>
>> 2. Could locale determine whether, or how, a DISALLOWED character is 
>> mapped into a PVALID character prior to getting an ALABEL?
>
> No, for several reasons:
>
> A) If I email you a link that contains a DISALLOWED character, your machine/environment MUST map it to the same thing my machine did.  Otherwise I say "you have funny charges from travelling, visit Bank.org to correct it."  You are trying to pay for your flight home so you type "Bank.org" into the computer in the kiosk in the foreign airport, and if it uses different mapping rules you could end up as a phishing site.  You don't want VISA.com to go to a vısa.com just because you're using a Turkish airport browser.
>
> B) If I travel myself, I need consistent behavior regardless of the machine I'm using.
>
> C) If I see an international advertisement, the domains need to go to the same server, regardless of who and how and where the person is typing in the link.
>
> D) A server or relay wouldn't necessarily know the context the user expected when interpreting a forwarded request.
>
> E) It'd be a support nightmare.
>
> F) I'm not sure if it is practical to create APIs that enable this 
> distinction.  (We (software community, not just my company) already 
> have problems selecting the correct locale specific behavior for 
> sorting and formatting, etc., so we'd be bound to get it wrong at 
> least some of the time.)
>
> -Shawn
>



More information about the Idna-update mailing list