FW: Request for advice: IANA subtag registry, zh / cmn / yue

Fri Jul 15 10:22:35 CEST 2011

Hi

I have received the following enquiry from a colleague (see full query
below).  He appears to have a need for the following language tags:

zh-Latn-jyutping

and

cmn-pinyin

Should I advise that he complete request forms for these new tags?  Can
anyone see a problem with these?  Is it just a matter of updating the subtag
records for jyutping and pinyin to allow them to be used with the primary
language subtags zh and cmn respectively?

Debbie

---------------

Hello Debbie!

I hope you are well. I am emailing you because I have a particular problem
with language tagging and I'm getting contradictory answers from my searches
of standards and mailing list archives. I'm trying to write software and
data for English-speaking learners of Cantonese, and I would appreciate your
advice as to how to tag some of the data.

My Question: Short version

The IANA subtag registry seems to have an asymmetry about Chinese
romanizations. If I've understood correctly, it allows:

 zh-Latn-pinyin ("Chinese written in pinyin romanization"), and 
 yue-jyutping ("Cantonese written in jyutping romanization"),

but not 

 zh-Latn-jyutping ("Chinese written in jyutping romanization"), nor
 cmn-pinyin ("Mandarin written in pinyin romanization").

Pinyin is a romanization method for Mandarin, whereas Jyutping is a
romanization method for Cantonese.

Is this asymmetry an error?

What should I use instead of "zh-Latn-jyutping"?

My Question: Long version

Background Facts

You probably know all this already, but I'll go over it to make it clear
where I'm coming from.

Cantonese speakers usually speak with one lexicon (which I will call
"Cantonese dialect words") -- used in all registers of speech from teenage
chats to courts of law -- but they usually write with a different lexicon
(which I'll call "standard Chinese words"), which is essentially the same
lexicon used for writing across China (and, not-so-coincidentally,
essentially the same lexicon as that used in spoken Mandarin). Here is an
example of a lexical difference:

   Standard Chinese word: ÔõüN, "what" [int. pron.], pronounced "zam2 mo1"
  Cantonese Dialect word: Ø¿‡S, "what" [int. pron.], pronounced "mat1 je5"

A Cantonese speaker would usually write "ÔõüN" (the standard Chinese word),
but would usually say "mat1 je5" (the Cantonese dialect word). However, as
seen above, the reverse is possible: you can write the Cantonese dialect
word in Chinese characters (Ø¿‡S), and you can speak the standard Chinese
word with a Cantonese pronounciation ("zam2 mo1" -- which is different from
the Mandarin pronounciation which would be "zen me"). Neither would be
comprehensible to a Mandarin speaker. You might do the former when writing
the script for a play, and the latter when reading a standard Chinese book
out loud.

Note that the two lexicons overlap. There are non-lexical grammar
differences but these aren't important for the particular work I'm doing.
There's also a completely separate difference between Traditional Chinese
characters and Simplified Chinese characters, but I'll ignore this for now
by only considering Traditional Chinese characters.

My issue

Learners of Cantonese need to become familiar with both lexicons to order to
function fully in Cantonese-speaking situations.

Therefore, disregarding Simplified Chinese characters for now, the lexicon
data I have falls into six varieties:

1. Standard Chinese words, written in Traditional Chinese characters
2. Standard Chinese words, written in jyutping (Cantonese pronounciation)
3. Dialect words, written in Traditional Chinese characters
4. Dialect words, written in jyutping (Cantonese pronounciation)
5. English words

I am trying to decide which BCP 47 language tags to use for all these. So
far, I have:

1. zh-Hant
2. yue-jyutping (???)
3. yue-Hant
4. yue-jyutping
5. en

I'm really unsure about what tag to use for 2, i.e. the Cantonese
pronounciation for reading aloud standard Chinese words which do not appear
in the Cantonese dialect lexicon. The options seem to be:

(A) yue-jyutping (weird)
(B1) zh-Latn-jyutping  (illegal, but analogous to zh-Latn-pinyin)
(B2) zh-Latn-x-jyutping
(C) cmn-jyutping (bizzare, illegal)

Option (A) seems the only standard option but it is a little weird: we would
be using "zh-Hant" to tag a word like ÔõüN (which is in the standard Chinese
lexicon but not in the Cantonese dialect lexicon), but then "yue-jyutping"
to tag exactly the same word written when romanized. The language has
changed from "zh" to "yue" just because we've written it down differently.

(An alternative would be to tag the word ÔõüN as "yue-Hant", even though it
is not in the Cantonese dialect lexicon, because a Cantonese person might
potentially read it aloud. But by the same logic, you could tag any Chinese
character text whatsoever as "yue", even if it was written by a
non-Cantonese speaker and uses only words which are not in the Cantonese
dialect lexicon, which is completely absurd).

Option (B1) best expresses the dialect-neutrality of the standard Chinese
lexicon, and seems analogous to zh-Latn-pinyin, which is allowed in the IANA
subtag registry. But for some reason, although the "pinyin" subtag is
allowed with the prefix "zh-Latn", the "jyutping" subtag is only allowed
with the prefix "yue". I can't see why "zh-Latn-jyutping" is not allowed --
it seems to say "some sort of Chinese written in jyutping" which seems
perfectly reasonable. Should we ask IANA to allow this?

(There is also a different argument: if I can legally use the "zh" tag for
Cantonese-only characters such as Ø¿‡S, then why can't I use
"zh-Latn-jyutping" for exactly the same word written in the jyutping
romanization?)

Option (C) seems bizzare ("Mandarin words as pronounced in Cantonese"), but
would be the logical consequence if people say that standard written Chinese
is actually Mandarin (which they might say on the grounds that the standard
written Chinese lexicon is the same as the spoken Mandarin lexicon).

So my question is: which tag should I be using (for Standard Chinese words,
written in jyutping)?

I can see why this is complicated; there is inherently some sort of
hybridization going on when a Cantonese person reads standard Chinese text
aloud with a Cantonese pronounciation. On the other hand, it's not some sort
of weird and unusual edge case. People quote written text all the time, and
300 million Chinese people speak a dialect other than Mandarin (which
therefore has a different lexicon to standard written Chinese).

Sorry to take up your time, and thanks for reading all this!
-- 
David Chan

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/ietf-languages/attachments/20110715/a1ee45d4/attachment.html>