Browser IDN display policy: opinions sought

Wed Dec 14 23:36:12 CET 2011

On 12/14/2011 3:02 AM, "Martin J. Dürst" wrote:
> On 2011/12/12 19:54, Gervase Markham wrote:
>
>
>> I can quite believe it may be something like this; but how does one deal
>> with the impedance mismatch that users think they are defining
>> languages, but what you need is scripts? Does IE keep a script/language
>> mapping? Is that data (perhaps compiled by others) publicly available
>> somewhere, e.g. from the Unicode consortium?
>
>
>
> For character coverage needed for a language, CLDR (the Unicode Common 
> Locale Data Repository, http://cldr.unicode.org) provides quite a lot 
> of data to work with, although you may want to have a closer look or 
> talk with somebody more familiar with the data and processes before 
> you work on a particular application.
>
>
>

Just following up this particular query about publicly available data 
about script/language
mapping, CLDR also makes available specific charts which specify the 
(commonly used)
scripts for a large number of languages, including nearly all of the 
languages which
would be used for IDNs. See:

http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/languages_and_scripts.html

and the reverse indexed:

http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/scripts_and_languages.html

Although this data is not perfect or complete for *all* languages, it is 
a very good
statement of 99.9% of the significant facts of usage relevant to the 
issues being
debated on this thread, IMO.

Anyone making use of this data would need to become familiar with its 
source,
supplementalData.xml in the CLDR releases, and know something about the 
extensions
which CLDR makes to the Unicode notion of "script", before just blindly 
implementing
it. For example, the Japanese *language* is identified as being written 
with the
Japanese *script* in languages_and_scripts.html. The Japanese "script" 
actually
refers to the Japanese writing system, which combines several scripts, 
but which, for
various implementations reasons is identified in CLDR with an aggregated 
script
identifier. And so on.

However, I think this is the kind of machine-readable information that 
Gervase was
inquiring about.

Note also that CLDR is an ongoing project responsive to public input and 
feedback,
so if there are deficiencies, omissions, or outright errors in the 
script and language
data, the CLDR project would like to hear about it via bug reports. See:

http://cldr.unicode.org/

--Ken