Script codes in RFC 3066, 4 issues

Mark Davis mark.davis at
Wed Apr 9 08:06:42 CEST 2003

This scheme looks good to me.

1. One small quibble:

> A writing system is encoded by
> an ISO 15924 script code.  An orthography is encoded by an ISO 3166-1

The term writing system is often contrasted with script. There is no need to
identify them; it is simpler to always use script:

"A script is encoded by an ISO 15924 script code."

2. Shortest form

> Specifically, a language is encoded by the shortest of the following
> four sequences:

There are compatibility issues with this both ISO 639 codes and Ethnologue
codes are growing; a "shortest" code may change over time, which makes my
use of a code now legal, tomorrow illegal. (E.g., a new ISO 639 code
subsumes an Ethnologue code.) There are two possibilities:

A. Restrict both the ISO codes and Ethnologue codes so that no new
combinations are shorter than an older combination. Politically, I suspect
the chances of this, the nicest tack, approach nil.

B. Allow non-shortest forms (but keep the shortest form restriction on ISO
639 codes), but provide a table of equivalancies somewhere (not necessarily
associated with 3066bis). Not as nice, but politically feasible.

3. For compatibility, also we need that once a 3066bis code, forever a
3066bis code. That is, even if the Ethnologue or ISO remove/deprecate a
code, that code is remains forever valid for use in a 3066bis subtag.

4. As now, any strings would be compared case insensitive. However,
customarily the casing would be en-foo-Cryl-CH.

(مرقص بن داود)
mark.davis at
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799

----- Original Message -----
From: "John Cowan" <jcowan at>
To: <ietf-languages at>
Sent: Monday, April 07, 2003 08:21
Subject: Re: Script codes in RFC 3066

> Since I am someone who works best when staring at syntactic schemas and
> actual examples, I would like to propose the following trial balloon
> for RFC 3066bis, specifically the kinds of codes that would be available
> without IANA registration.
> First of all, I adopt the categories of Peter Constable's admirable
> papers which constitute UTN #8 (
> He characterizes the relevant kinds of language-related categories as
> "language", "writing system", "orthography", and "domain-specific data
> set".  I am proposing a standardized, backward-compatible method of
> writing down tags that represent the first three of these.  The fourth
> category, plus tags that aren't properly handled by the scheme below,
> are still meat for IANA standardization as one-offs.
> In this scheme, a language is encoded by either an ISO 639 tag or an ISO
> 639 tag followed by an Ethnologue tag.  A writing system is encoded by
> an ISO 15924 script code.  An orthography is encoded by an ISO 3166-1
> country code.  All these are separated by hyphuses, though underscore
> is recognized as an equivalent of hyphus.  Tags are Latin letters
> (case-insensitive) and/or European digits.
> Specifically, a language is encoded by the shortest of the following
> four sequences:  an ISO 639-1 tag, an ISO 639-2 tag, an ISO 639-1 tag
> followed by hyphus followed by an Ethnologue tag, or an ISO 639-2 tag
> followed by a hyphus followed by an Ethnologue tag.  Examples:
> en is English (not eng, not en-eng, not eng-eng).
> ast is Asturian (not ast-aub).
> qu-qho is Imbabura Quechua (not que-qho).
> cmc-cjm is Eastern Cham.
> To a language tag in one of the above forms may be appended a script
> code representing the writing system, or a country code representing the
> orthography, or both in that order.  There is no ambiguity, because if
> the second part of the tag is 2-letter it is an orthography, if 3-letter
> it is an Ethnologue code, and if 4-letter it is a script code.
> Here are examples.
> With specified writing system:
> az-latn is Azerbaijani (North or South unspecified) in Latin script.
> jpr-arab is Judeo-Persian written in Arabic script.
> az-aze-latn is Northern Azerbaijani in Latin script.
> I can't find a useful example of iso-eth-scrp coding.
> With specified orthography:
> en-us is U.S. English.
> I can't find useful examples of the other possibilities.
> With both:
> zh-hant-cn is PRC-orthography Traditional Chinese.
> I can't find useful examples of the other possibilities.
> Grandfathered exceptions:
> The "sgn" prefix is followed directly by a country code to signify a
> sign language associated with that country.  In this case, then, the
> country code comes before the script code.
> The three tags zh-gan, zh-min, and zh-yue for the Sinitic languages Gan,
> Min, and Yue are grandfathered, even though the codes are not Ethnologue
> codes.  Consequently, zh-knn, zh-cfr, and zh-yuh are forbidden.  There is
> no conflict, because the Ethnologue languages assigned to gan, min,
> and yue (Aten, Milikin, Yuracare) would be tagged as nic-gan, map-min,
> and sai-yue.
> Private use:
> Any tag beginning x-, or with a private use code from any standard in
> it, is for private use.  The private-use tags of ISO 639 are qaa-qtz;
> of the Ethnologue, qva-qzz; of ISO 15924, qaaa-qtzz; of ISO 3166-1, aa,
> qm-qz, xa-xz, zz.
> Parsing:
> The following Perl-type regular expression will parse tags of this form,
> provided that case and underscore mapping have already been done.
> It returns up to three values: language, writing system, orthography.
> /^((?:[a-z][a-z][a-z]?(?:-[a-z][a-z][a-z])?)|sgn(?:-[a-z][a-z]))
> (?:-([a-z][a-z][a-z][a-z]))?(?:-([a-z][a-z]))?$/x;
> --
> John Cowan  jcowan at
> O beautiful for patriot's dream that sees beyond the years
> Thine alabaster cities gleam undimmed by human tears!
> America! America!  God mend thine every flaw,
> Confirm thy soul in self-control, thy liberty in law!
>         -- one of the verses not usually taught in U.S. schools
> _______________________________________________
> Ietf-languages mailing list
> Ietf-languages at

More information about the Ietf-languages mailing list