Browser IDN display policy: opinions sought
"Martin J. Dürst"
duerst at it.aoyama.ac.jp
Wed Dec 14 12:02:01 CET 2011
On 2011/12/12 19:54, Gervase Markham wrote:
> On 09/12/11 18:10, Mark Davis ☕ wrote:
>> I'm not familiar with the code, but I think that (A) may actually be:
>>
>> A (IE, Chrome): Unicode if the (single) 'script' of the string matches
>> one of the scripts of the user's language(s) in the options,
>> Punycode otherwise.
>>
>> It is pretty easy and reliable to detect the script of the string,
>> whereas language detection would be unreliable.
I have to correct myself. In another mail, I was writing that I was
quite sure that Mark's correction applied. But by playing around with
IE, I found out that this may only partially be the case.
I looked at http://www.viagénie.com/ in IE (IE8 on Win7), and it showed
punycode. I then added "en" (English) to my language preferences (which
were just "ja" (Japanese) out of the box because I rarely use IE).
viagénie was still shown in punycode. Then I added "de" (German), and
now viagénie was shown. So either IE uses a separate "script" category
"ASCII-only" (but the algorithm would still be script-oriented at the
core) or the letters for a language are taken rather widely, with German
including French accented letters and so on (which would be a
language-only algorithm).
Michel, if you know any details (that you can talk about), it would be
nice to hear from you.
When showing punycode, IE also displayed a one-line message just above
the page itself and below the chrome (tabs and stuff), saying
(translating back from Japanese) "This Web address contains letters or
symbols that cannot be displayed with the current language settings. If
you click here, options will be displayed...". When clicking, I get the
options of changing my language settings, of not displaying the message
anymore, or of getting some further explanations or help.
> I can quite believe it may be something like this; but how does one deal
> with the impedance mismatch that users think they are defining
> languages, but what you need is scripts? Does IE keep a script/language
> mapping? Is that data (perhaps compiled by others) publicly available
> somewhere, e.g. from the Unicode consortium?
Some of the data is in the suppress-script fields in the language subtag
registry at IANA. At
http://www.iana.org/assignments/language-subtag-registry, if you see
something like:
%%
Type: language
Subtag: af
Description: Afrikaans
Added: 2005-10-16
Suppress-Script: Latn
%%
then Suppress-Script: Latn tells you that Afrikaans is, for all intents
and purposes, written with the Latin script. This information isn't
complete (given the number of languages in the subtag registry, that
shouldn't be a surprise), but I'd say it's highly accurate where it's
there, and it's there for most of the major languages for which it can
be reasonably provided.
For character coverage needed for a language, CLDR (the Unicode Common
Locale Data Repository, http://cldr.unicode.org) provides quite a lot of
data to work with, although you may want to have a closer look or talk
with somebody more familiar with the data and processes before you work
on a particular application.
While I'm mentioning data sources, I also wanted to mention
http://www.unicode.org/reports/tr36/, Unicode Security Considerations,
and http://www.unicode.org/reports/tr39/, Unicode Security Mechanisms,
and the data sources mentioned there. I'm very surprised that nobody has
mentioned them, because I think they are extremely relevant and helpful
for our discussion and for actual implementations.
Regards, Martin.
More information about the Idna-update
mailing list