Language Identifier List up for comments

Fri Dec 17 19:23:55 CET 2004

Martin,
You are right of course.  The specifics of the rendering belong in the 
style sheet.
But this does bear out that taggers need to think about the likely use 
of the document and tag accordingly.  If the likely use is as a written 
text, then the tag should reflect readability.  If it's more likely to 
go to voice, then the tag should indicate the voice language.  Obviously 
it is up to the discretion of the tagger.
Perhaps there could be some mention of this in the language tag list?
Andrea

Martin Duerst wrote:

> Hello Andrea,
> 
> Good thoughts, so here is some arrows.
> 
> At 04:30 04/12/17, A. Vine wrote:
>  >
>  >In all this discussion, I've been thinking that it makes sense to have 
> (OK, I'm waiting for the arrows) 2 separate identifiers; one for written 
> and one for voice.
> 
> Do you mean two distinct sets of identifier values? Or two different
> attributes in XML, e.g. xml:lang and xml:spokenlang? Or what else?
> 
> It should be kept in mind that xml:lang is just something declarative.
> I.e. you say what language you think it is, you don't say what processing
> you expect. If you have an en-us text, and you want it to be read with
> a NY accent, then the best way to do this is to e.g. declare a NY
> accent voice in the stylesheet or other mechanism that reads the
> document.
> 
> For recorded speach, this is of coures different; in that case,
> the spoken language/dialect/accent isn't the result of some processing,
> but it is a primary property of the data. But then, such data rarely
> gets encoded directly into XML.
> 
> Regards,    Martin.
> 
>  >It's very clear that the written language is much more generic than 
> the voice language, and that being too specific in tagging may cause the 
> document to be passed over when it could actually be useful.
>  >
>  >True there are other issues with language identification, but this one 
> stands out as having a solution.  Something to think about bringing up 
> to the HTML/XML tag folks...
>  >
>  >:-D just a thought,
>  >Andrea
>  >
>  >Richard Ishida wrote:
>  >
>  >>
>  >>>Since there are only two tags for CN, zh-CN and zh-hans-CN, would 
> those who argue for not overdifferentiating tags, recommend just the 
> simpler zh-CN?
>  >>>Similarly for TW, just zh-TW?
>  >>
>  >> What does zh-CN mean?
>  >> It is most commonly used as far as I'm aware to indicate text 
> written in the
>  >> Simplified Chinese script.  For identification of the script I think we
>  >> should recommend zh-Hans first these days - although we need to add 
> caveats
>  >> about the fact that some applications won't recognise it (eg. for 
> automatic
>  >> application of fonts in Unicode encoded Web pages on some browsers (see
>  >> http://www.w3.org/International/tests/results/lang-and-cjk-font). 
> There are
>  >> not a huge number of applications, as far as I'm aware.)
>  >> Use of zh-CN doesn't seem to make sense for identifying spoken Chinese,
>  >> since there are many dialects in China.  I think one should recommend
>  >> zh-guoyu, zh-yue, etc. for this purpose.
>  >> Note also that Mandarin, Cantonese, Hakka, etc are spoken in many 
> parts of
>  >> the world.  My expectation is that the use of CN would only be 
> appropriate
>  >> if one wanted to explicitly make the point that one was referring to 
> the
>  >> language as spoken in Mainland China - ie. that there is some 
> particular
>  >> characteristic of the instance of text or audio recording that was
>  >> idiosyncratic to that particular area as a whole.
>  >> And now what does zh-TW mean?  Well usually text written in Traditional
>  >> Chinese script, although the we could repeat much of what I wrote above
>  >> about zh-CN for this too.  zh-TW taken literally means the Chinese 
> spoken in
>  >> Taiwan - which happens to be Mandarin.  So unless you have particular
>  >> distinguishing features in mind, perhaps, again you should just use
>  >> zh-guoyu.
>  >> Then there's the question: what are we doing with this page?  
> Describing
>  >> current usage or recommending best practises.  If the latter, 
> perhaps zh-CN
>  >> and zh-TW should only appear on the page if clearly marked as edge 
> cases.
>  >>
>  >> Btw, what does de-CH represent in the table?  Swiss German is 
> different from
>  >> de-DE, and rarely written, and then has little consistency to its
>  >> orthography.  There are also many local variants to Swiss German across
>  >> Switzerland, which would seem to invite a large number of additions 
> to this
>  >> table.  But presumably de-CH refers to the way de-DE German is 
> written in
>  >> Switzerland or spoken by newsreaders there (and there are a small 
> number of
>  >> significant differences here from de-DE.)?  If so, we ought to 
> clarify that
>  >> in the table.
>  >> I think this kind of process could be applied to many other parts of 
> the
>  >> second table, which worries me.  I can't help thinking that it might be
>  >> better to talk through some examples of when to use en and when to 
> use en-GB
>  >> or en-US, talk through the choices for particular problem areas like 
> chinese
>  >> and swiss german, and so on, rather than to just list these 
> combinations,
>  >> most of which you could determine pretty easily anyway if you gave 
> what you
>  >> were doing a small amount of thought and had access to a list of 
> country
>  >> codes.
>  >> What might be more useful is to say, here is the simplest form to 
> identify
>  >> this language (eg. 'en'), and in the next column are a bunch of 
> potential
>  >> country or other codes you may want to consider using in conjunction 
> with
>  >> this.  Rather than, "This table lists the languages" and " require a
>  >> language subtag and country subtag".
>  >> RI
>  >>
>  >>
>  >