My old paper, feedback on registration proposals

Thu May 29 12:51:48 CEST 2003

1. My paper

Okay, I was finally able to work through the maze of Apple process to get my old paper on RFC3066 extensions posted. This is from February, but is updated to include some annotations from e-mail discussions in March with Peter Constable, Michael Everson, and others. The URL is:
  ftp://text:ftp2apple@tondero.apple.com/LocaleNotes-Tags-2pke.pdf

(i.e. ftp to tondero.apple.com with username 'text', password 'ftp2apple'; the document is LocaleNotes-Tags-2pke.pdf; sorry for the awkward access method).

Note: References in this thread to "Peter's paper" occasionally mean the paper above but often instead mean the wonderful papers by Peter Constable which now constitute Unicode tech note #8:
  http://www.unicode.org/notes/tn8/

2. Proposed registrations feedback & discussion

I strongly support the idea of extending RFC 3066 by using ISO 15924 script tags (and tags for other kinds of orthographic information) in conjunction with ISO 639 language codes; my old paper above had proposed a model somewhat like the model proposed by John Cowan on April 7. An RFC 3066 language tag "always defines a language as spoken (or written, signed or otherwise signaled)" and one cannot define a language as written without script information and other necessary orthographic information.

I also support the need to have tags for all of the particular cases requested by Mark Davis:
 az-latn, az-cyrl, az-arab
 sr-cyrl, sr-latn
 uz-cyrl, uz-latn
 zh-hans, zh-hant

(Apple also needs to identify and distinguish most of these cases, plus a few more I will mention below).

My only questions are about the specific form of the tags used for the Chinese cases. These registrations are intended to be steps on the path to a productive model using ISO 15924 script tags. However, "Hans" and "Hant" are not currently in ISO 15924; is there a commitment to add them?

My paper had suggested a slightly different approach, based on existing 15924 tags:
- Using subtags in conjunction with existing ISO 15924 codes in order to further qualify them; thus for the above Chinese cases the full form could be something like "zh-Hani-simplified" and "zh-Hani-traditional".
- However, it also suggested that tags for scripts (and other kinds of information) could be eliminated when they indicated clear defaults (and "Hani" is a clear default for "zh"); thus the tags above would reduce to simply "zh-simplified" and "zh-traditional".

The same scheme could be used to eliminate the other variants in ISO 15924, thus giving "en-Latn-Fraktur" -> "en-Fraktur" instead of "en-Latf", etc. It could also be used to cover other cases, such as polytonic/monotonic modern Greek: "el-Grek-poly" -> "el-poly".

Also, if these are steps toward a productive model, another question to consider  is how the subtags for simplified & traditional would interact with the subtags used to disambiguate the various Chinese languages: e.g. "zh-yue-Hant" or "zh-yue-traditional" (I favor having all of the language-specific information precede all of the script/orthographic information). This may be moot if ISO 639-3 adds codes for the specific Chinese languages and RFC 3066bis adopts ISO 639-3.

By the way, even if 15924 adds Hant and Hans, I think Hani is still very useful.
- For specifying content: I see many documents that include both simplified and traditional characters. Consider an academic paper discussing the use of Chinese character puns and double-entendres during the last 150 years of Chinese literature; the paper needs to use the exact character form used in the source material, whether simplified or traditional.
- For specifying accept-language: Many scholars and literati are familiar with both forms, but might want to specify that they prefer Chinese in Han script only (as opposed to a website that has Chinese in pinyin, etc.) - thus zh-Hani as opposed to just zh, etc.

Now, I would like to mention a few other cases for which Apple needs extended RFC 3066 tags that include script or other orthographic information (this is not yet a request for registration, just material for discussion):

- For Mongolian, we need mn-Cyrl and mn-Mong.
- For Malay, we need ms-Latn and ms-Arab.
- For Tatar, we will probably need tt-Latn and tt-Cyrl.
- For Irish, we need a way to indicate the old orthography (with dots above) and the modern orthography (using h instead): e.g. "ga-Latn-dots" -> "ga-dots" ??
- For English, we need a way to indicate English written in an orthography that is restricted to the ASCII subset of characters only, versus the full range of possible characters (curly quotes, em-dashes, etc). This is actually one of the localizations that we support. Perhaps "en-Latn-ASCII" -> "en-ASCII" ??

-Peter
-- 
----------------------------------------------
Peter Edberg  .  .  .  .  Apple Computer, Inc.
Mac OS Engineering: International & Text Group
Tel: +1 (408) 974-4275, Fax: +1 (408) 862-4566
----------------------------------------------