Request for advice: IANA subtag registry, zh / cmn / yue

Fri Jul 15 17:01:44 CEST 2011

Hi Debbie,

If you carefully read BCP 47, you will find that, although the ‘Prefix’ field is recommended for use in forming language tags, there is no prohibition on using subtags with something other than their listed prefixes. Both of the tags you mention below are well-formed and valid under BCP 47.

The tag “zh-Latn-jyutping” makes sense because it implies what your correspondent wants the tag to convey---the subtag ‘zh’ implies “standard Chinese” while ‘jyutping’ implies a Cantonese pronunciation. The tag “cmn-pinyin” also makes sense (cmn-Latn-pinyin would be better), although “zh-Latn-pinyin” would probably be preferred for most applications. The explicit use of the ‘cmn’ subtag is useful in distinguishing Mandarin Chinese from “standard Chinese”. It puts sort of an exclamation point on it. The reason the registry doesn’t include these is that these uses are slightly unusual.

Finally, I should mention that extlang subtags might also be useful here. For example, the tag “zh-cmn-Latn-pinyin” might be used in place of “cmn-pinyin”—it includes the ‘pinyin’ subtag’s Prefix *and* clarifies that it applies to Mandarin.

In any case, the main point here is that the two tags in question are not illegal. It is possible to try to register additional Prefix fields in the registry to cover these cases. But anyone can go ahead and use them in the tags suggested to convey the meaning suggested.

Addison

From: ietf-languages-bounces at alvestrand.no [mailto:ietf-languages-bounces at alvestrand.no] On Behalf Of Debbie Garside
Sent: Friday, July 15, 2011 1:23 AM
To: ietf-languages at alvestrand.no
Subject: FW: Request for advice: IANA subtag registry, zh / cmn / yue

Hi

I have received the following enquiry from a colleague (see full query below).  He appears to have a need for the following language tags:

zh-Latn-jyutping

and

cmn-pinyin

Should I advise that he complete request forms for these new tags?  Can anyone see a problem with these?  Is it just a matter of updating the subtag records for jyutping and pinyin to allow them to be used with the primary language subtags zh and cmn respectively?

Debbie

---------------
Hello Debbie!

I hope you are well. I am emailing you because I have a particular problem with language tagging and I'm getting contradictory answers from my searches of standards and mailing list archives. I'm trying to write software and data for English-speaking learners of Cantonese, and I would appreciate your advice as to how to tag some of the data.

My Question: Short version

The IANA subtag registry seems to have an asymmetry about Chinese romanizations. If I've understood correctly, it allows:

 zh-Latn-pinyin ("Chinese written in pinyin romanization"), and
 yue-jyutping ("Cantonese written in jyutping romanization"),

but not

 zh-Latn-jyutping ("Chinese written in jyutping romanization"), nor
 cmn-pinyin ("Mandarin written in pinyin romanization").

Pinyin is a romanization method for Mandarin, whereas Jyutping is a romanization method for Cantonese.

Is this asymmetry an error?

What should I use instead of "zh-Latn-jyutping"?

My Question: Long version

Background Facts

You probably know all this already, but I'll go over it to make it clear where I'm coming from.

Cantonese speakers usually speak with one lexicon (which I will call "Cantonese dialect words") -- used in all registers of speech from teenage chats to courts of law -- but they usually write with a different lexicon (which I'll call "standard Chinese words"), which is essentially the same lexicon used for writing across China (and, not-so-coincidentally, essentially the same lexicon as that used in spoken Mandarin). Here is an example of a lexical difference:

   Standard Chinese word: 怎麼, "what" [int. pron.], pronounced "zam2 mo1"
  Cantonese Dialect word: 乜嘢, "what" [int. pron.], pronounced "mat1 je5"

A Cantonese speaker would usually write "怎麼" (the standard Chinese word), but would usually say "mat1 je5" (the Cantonese dialect word). However, as seen above, the reverse is possible: you can write the Cantonese dialect word in Chinese characters (乜嘢), and you can speak the standard Chinese word with a Cantonese pronounciation ("zam2 mo1" -- which is different from the Mandarin pronounciation which would be "zen me"). Neither would be comprehensible to a Mandarin speaker. You might do the former when writing the script for a play, and the latter when reading a standard Chinese book out loud.

Note that the two lexicons overlap. There are non-lexical grammar differences but these aren't important for the particular work I'm doing. There's also a completely separate difference between Traditional Chinese characters and Simplified Chinese characters, but I'll ignore this for now by only considering Traditional Chinese characters.

My issue

Learners of Cantonese need to become familiar with both lexicons to order to function fully in Cantonese-speaking situations.

Therefore, disregarding Simplified Chinese characters for now, the lexicon data I have falls into six varieties:

1. Standard Chinese words, written in Traditional Chinese characters
2. Standard Chinese words, written in jyutping (Cantonese pronounciation)
3. Dialect words, written in Traditional Chinese characters
4. Dialect words, written in jyutping (Cantonese pronounciation)
5. English words

I am trying to decide which BCP 47 language tags to use for all these. So far, I have:

1. zh-Hant
2. yue-jyutping (???)
3. yue-Hant
4. yue-jyutping
5. en

I'm really unsure about what tag to use for 2, i.e. the Cantonese pronounciation for reading aloud standard Chinese words which do not appear in the Cantonese dialect lexicon. The options seem to be:

(A) yue-jyutping (weird)
(B1) zh-Latn-jyutping  (illegal, but analogous to zh-Latn-pinyin)
(B2) zh-Latn-x-jyutping
(C) cmn-jyutping (bizzare, illegal)

Option (A) seems the only standard option but it is a little weird: we would be using "zh-Hant" to tag a word like 怎麼 (which is in the standard Chinese lexicon but not in the Cantonese dialect lexicon), but then "yue-jyutping" to tag exactly the same word written when romanized. The language has changed from "zh" to "yue" just because we've written it down differently.

(An alternative would be to tag the word 怎麼 as "yue-Hant", even though it is not in the Cantonese dialect lexicon, because a Cantonese person might potentially read it aloud. But by the same logic, you could tag any Chinese character text whatsoever as "yue", even if it was written by a non-Cantonese speaker and uses only words which are not in the Cantonese dialect lexicon, which is completely absurd).

Option (B1) best expresses the dialect-neutrality of the standard Chinese lexicon, and seems analogous to zh-Latn-pinyin, which is allowed in the IANA subtag registry. But for some reason, although the "pinyin" subtag is allowed with the prefix "zh-Latn", the "jyutping" subtag is only allowed with the prefix "yue". I can't see why "zh-Latn-jyutping" is not allowed -- it seems to say "some sort of Chinese written in jyutping" which seems perfectly reasonable. Should we ask IANA to allow this?

(There is also a different argument: if I can legally use the "zh" tag for Cantonese-only characters such as 乜嘢, then why can't I use "zh-Latn-jyutping" for exactly the same word written in the jyutping romanization?)

Option (C) seems bizzare ("Mandarin words as pronounced in Cantonese"), but would be the logical consequence if people say that standard written Chinese is actually Mandarin (which they might say on the grounds that the standard written Chinese lexicon is the same as the spoken Mandarin lexicon).

So my question is: which tag should I be using (for Standard Chinese words, written in jyutping)?

I can see why this is complicated; there is inherently some sort of hybridization going on when a Cantonese person reads standard Chinese text aloud with a Cantonese pronounciation. On the other hand, it's not some sort of weird and unusual edge case. People quote written text all the time, and 300 million Chinese people speak a dialect other than Mandarin (which therefore has a different lexicon to standard written Chinese).

Sorry to take up your time, and thanks for reading all this!
--
David Chan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/ietf-languages/attachments/20110715/846152c0/attachment-0001.html>