Browser IDN display policy: opinions sought

"Martin J. Dürst" duerst at it.aoyama.ac.jp
Wed Dec 14 12:02:01 CET 2011


On 2011/12/12 19:54, Gervase Markham wrote:
> On 09/12/11 18:10, Mark Davis ☕ wrote:
>> I'm not familiar with the code, but I think that (A) may actually be:
>>
>> A (IE, Chrome): Unicode if the (single) 'script' of the string matches
>> one of the scripts of the user's language(s) in the options,
>> Punycode otherwise.
>>
>> It is pretty easy and reliable to detect the script of the string,
>> whereas language detection would be unreliable.

I have to correct myself. In another mail, I was writing that I was 
quite sure that Mark's correction applied. But by playing around with 
IE, I found out that this may only partially be the case.

I looked at http://www.viagénie.com/ in IE (IE8 on Win7), and it showed 
punycode. I then added "en" (English) to my language preferences (which 
were just "ja" (Japanese) out of the box because I rarely use IE). 
viagénie was still shown in punycode. Then I added "de" (German), and 
now viagénie was shown. So either IE uses a separate "script" category 
"ASCII-only" (but the algorithm would still be script-oriented at the 
core) or the letters for a language are taken rather widely, with German 
including French accented letters and so on (which would be a 
language-only algorithm).

Michel, if you know any details (that you can talk about), it would be 
nice to hear from you.

When showing punycode, IE also displayed a one-line message just above 
the page itself and below the chrome (tabs and stuff), saying 
(translating back from Japanese) "This Web address contains letters or 
symbols that cannot be displayed with the current language settings. If 
you click here, options will be displayed...". When clicking, I get the 
options of changing my language settings, of not displaying the message 
anymore, or of getting some further explanations or help.


> I can quite believe it may be something like this; but how does one deal
> with the impedance mismatch that users think they are defining
> languages, but what you need is scripts? Does IE keep a script/language
> mapping? Is that data (perhaps compiled by others) publicly available
> somewhere, e.g. from the Unicode consortium?

Some of the data is in the suppress-script fields in the language subtag 
registry at IANA. At 
http://www.iana.org/assignments/language-subtag-registry, if you see 
something like:

%%
Type: language
Subtag: af
Description: Afrikaans
Added: 2005-10-16
Suppress-Script: Latn
%%

then Suppress-Script: Latn tells you that Afrikaans is, for all intents 
and purposes, written with the Latin script. This information isn't 
complete (given the number of languages in the subtag registry, that 
shouldn't be a surprise), but I'd say it's highly accurate where it's 
there, and it's there for most of the major languages for which it can 
be reasonably provided.

For character coverage needed for a language, CLDR (the Unicode Common 
Locale Data Repository, http://cldr.unicode.org) provides quite a lot of 
data to work with, although you may want to have a closer look or talk 
with somebody more familiar with the data and processes before you work 
on a particular application.

While I'm mentioning data sources, I also wanted to mention 
http://www.unicode.org/reports/tr36/, Unicode Security Considerations, 
and http://www.unicode.org/reports/tr39/, Unicode Security Mechanisms, 
and the data sources mentioned there. I'm very surprised that nobody has 
mentioned them, because I think they are extremely relevant and helpful 
for our discussion and for actual implementations.

Regards,   Martin.


More information about the Idna-update mailing list