Language Identifier List up for comments

Thu Dec 16 16:14:52 CET 2004

> Since there are only two tags for CN, zh-CN and zh-hans-CN, 
> would those who argue for not overdifferentiating tags, 
> recommend just the simpler zh-CN?
> Similarly for TW, just zh-TW?

What does zh-CN mean? 

It is most commonly used as far as I'm aware to indicate text written in the
Simplified Chinese script.  For identification of the script I think we
should recommend zh-Hans first these days - although we need to add caveats
about the fact that some applications won't recognise it (eg. for automatic
application of fonts in Unicode encoded Web pages on some browsers (see
http://www.w3.org/International/tests/results/lang-and-cjk-font). There are
not a huge number of applications, as far as I'm aware.)  

Use of zh-CN doesn't seem to make sense for identifying spoken Chinese,
since there are many dialects in China.  I think one should recommend
zh-guoyu, zh-yue, etc. for this purpose.

Note also that Mandarin, Cantonese, Hakka, etc are spoken in many parts of
the world.  My expectation is that the use of CN would only be appropriate
if one wanted to explicitly make the point that one was referring to the
language as spoken in Mainland China - ie. that there is some particular
characteristic of the instance of text or audio recording that was
idiosyncratic to that particular area as a whole.

And now what does zh-TW mean?  Well usually text written in Traditional
Chinese script, although the we could repeat much of what I wrote above
about zh-CN for this too.  zh-TW taken literally means the Chinese spoken in
Taiwan - which happens to be Mandarin.  So unless you have particular
distinguishing features in mind, perhaps, again you should just use
zh-guoyu.

Then there's the question: what are we doing with this page?  Describing
current usage or recommending best practises.  If the latter, perhaps zh-CN
and zh-TW should only appear on the page if clearly marked as edge cases.

Btw, what does de-CH represent in the table?  Swiss German is different from
de-DE, and rarely written, and then has little consistency to its
orthography.  There are also many local variants to Swiss German across
Switzerland, which would seem to invite a large number of additions to this
table.  But presumably de-CH refers to the way de-DE German is written in
Switzerland or spoken by newsreaders there (and there are a small number of
significant differences here from de-DE.)?  If so, we ought to clarify that
in the table.

I think this kind of process could be applied to many other parts of the
second table, which worries me.  I can't help thinking that it might be
better to talk through some examples of when to use en and when to use en-GB
or en-US, talk through the choices for particular problem areas like chinese
and swiss german, and so on, rather than to just list these combinations,
most of which you could determine pretty easily anyway if you gave what you
were doing a small amount of thought and had access to a list of country
codes. 

What might be more useful is to say, here is the simplest form to identify
this language (eg. 'en'), and in the next column are a bunch of potential
country or other codes you may want to consider using in conjunction with
this.  Rather than, "This table lists the languages" and " require a
language subtag and country subtag".

RI