Script codes in RFC 3066

Mon Apr 7 12:21:31 CEST 2003

Since I am someone who works best when staring at syntactic schemas and
actual examples, I would like to propose the following trial balloon
for RFC 3066bis, specifically the kinds of codes that would be available
without IANA registration.

First of all, I adopt the categories of Peter Constable's admirable
papers which constitute UTN #8 (http://www.unicode.org/notes/tn8).
He characterizes the relevant kinds of language-related categories as
"language", "writing system", "orthography", and "domain-specific data
set".  I am proposing a standardized, backward-compatible method of
writing down tags that represent the first three of these.  The fourth
category, plus tags that aren't properly handled by the scheme below,
are still meat for IANA standardization as one-offs.

In this scheme, a language is encoded by either an ISO 639 tag or an ISO
639 tag followed by an Ethnologue tag.  A writing system is encoded by
an ISO 15924 script code.  An orthography is encoded by an ISO 3166-1
country code.  All these are separated by hyphuses, though underscore
is recognized as an equivalent of hyphus.  Tags are Latin letters
(case-insensitive) and/or European digits.

Specifically, a language is encoded by the shortest of the following
four sequences:  an ISO 639-1 tag, an ISO 639-2 tag, an ISO 639-1 tag
followed by hyphus followed by an Ethnologue tag, or an ISO 639-2 tag
followed by a hyphus followed by an Ethnologue tag.  Examples:

	en is English (not eng, not en-eng, not eng-eng).
	ast is Asturian (not ast-aub).
	qu-qho is Imbabura Quechua (not que-qho).
	cmc-cjm is Eastern Cham.

To a language tag in one of the above forms may be appended a script
code representing the writing system, or a country code representing the
orthography, or both in that order.  There is no ambiguity, because if
the second part of the tag is 2-letter it is an orthography, if 3-letter
it is an Ethnologue code, and if 4-letter it is a script code.

Here are examples.

With specified writing system:

	az-latn is Azerbaijani (North or South unspecified) in Latin script.
	jpr-arab is Judeo-Persian written in Arabic script.
	az-aze-latn is Northern Azerbaijani in Latin script.
	I can't find a useful example of iso-eth-scrp coding.

With specified orthography:

	en-us is U.S. English.
	I can't find useful examples of the other possibilities.

With both:
	zh-hant-cn is PRC-orthography Traditional Chinese.
	I can't find useful examples of the other possibilities.

Grandfathered exceptions:

The "sgn" prefix is followed directly by a country code to signify a
sign language associated with that country.  In this case, then, the
country code comes before the script code.

The three tags zh-gan, zh-min, and zh-yue for the Sinitic languages Gan,
Min, and Yue are grandfathered, even though the codes are not Ethnologue
codes.  Consequently, zh-knn, zh-cfr, and zh-yuh are forbidden.  There is
no conflict, because the Ethnologue languages assigned to gan, min,
and yue (Aten, Milikin, Yuracare) would be tagged as nic-gan, map-min,
and sai-yue.

Private use:

Any tag beginning x-, or with a private use code from any standard in
it, is for private use.  The private-use tags of ISO 639 are qaa-qtz;
of the Ethnologue, qva-qzz; of ISO 15924, qaaa-qtzz; of ISO 3166-1, aa,
qm-qz, xa-xz, zz.

Parsing:

The following Perl-type regular expression will parse tags of this form,
provided that case and underscore mapping have already been done.
It returns up to three values: language, writing system, orthography.

	/^((?:[a-z][a-z][a-z]?(?:-[a-z][a-z][a-z])?)|sgn(?:-[a-z][a-z]))
		 (?:-([a-z][a-z][a-z][a-z]))?(?:-([a-z][a-z]))?$/x;

-- 
John Cowan  jcowan at reutershealth.com  http://www.ccil.org/~cowan
O beautiful for patriot's dream that sees beyond the years
Thine alabaster cities gleam undimmed by human tears!
America! America!  God mend thine every flaw,
Confirm thy soul in self-control, thy liberty in law!
        -- one of the verses not usually taught in U.S. schools