LANGUAGE SUBTAG REGISTRATION FORM: pinyin

Tracey, Niall niall.tracey at logica.com
Tue Aug 5 10:15:17 CEST 2008


From: mark.edward.davis at gmail.com [mailto:mark.edward.davis at gmail.com] On Behalf Of Mark Davis
Sent: 04 August 2008 18:57

> Mandarin text has been validly tagged as 'zh', and will continue to be validly tagged as 'zh'. 
 
But what *is* Mandarin text?
 
The point of "zh" is that a text written in Chinese logograms is not necessarily Mandarin. As I understand it, there are many Chinese languages that share a mutually comprehensible written mode -- it's pretty much impossible to point to a Chinese text and identify it unambiguously as Mandarin, unless the writer uses a lot of slang or colloquial idioms.
 
However, once we write something in a pinyin, it is clear to us which Chinese language it is, so we really should be more specific -- if we skip a step in the hierarchy, it makes searching more complicated.
 
Surely the point of a hierarchical naming convention is to allow rapid pruning of a dataset without having to examine all levels? With an explicit hierarchy we can do this very efficiently: if we want to search for text that a Mandarin speaker is likely to understand, we can do two steps to prune the search-space:
1) Cut any texts not marked ZH
2) Cut any texts with a variant other than CMN
 
I'm sure there's some intricacy of the current system that I've missed that already makes this impossible in practice, but I feel we should aim to get closer to this state of affairs. Having to search for zh with cmn and/or pinyin and/or ... but not xx, yy, zz, .. etc is overcomplicated and will lead to errors. Not only this, but arguably it doesn't make the job of tagging the text easier in the first place. It's confusing when there are three or four "correct" ways of doing something.

I'm opposed to hiding any data that makes everyone's job harder.
 
Níall.

This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.




More information about the Ietf-languages mailing list