RFC3066bis: looking ahead

Peter Constable petercon at microsoft.com
Tue Jan 20 01:03:27 CET 2004


A ballot on ISO 639-3 closed last month, and this project is moving forward, with a new draft expected to be circulated as a DIS very soon. 

One of the innovations in ISO 639-3 is to recognize a third scope for identifiers, in addition to the individual-language and collective scopes included in ISO 639-2. This third scope is between the other two, and is being called "macrolanguage". The idea of a macrolanguage identifier is that the thing it represents is considered an individual language in some usage context, though it encompasses two or more individual languages listed in ISO 639-3.

A good example of this is Chinese: it has "individual-language" identifiers in ISO 639-1 and ISO 639-2 (zh, zho), though we know that there are distinct, individual languages within its scope, such as Yue, Hokkien, etc. Thus, when ISO 639-3 is published, there will be alpha-2 and alpha-3 identifiers for "Chinese", but there will also be alpha-3 identifiers for the various Chinese languages such as Yue and Hokkien.

I started thinking about how to integrate ISO 639-3 into the RFC 3066 framework once the former has been published. Ignoring for the moment the existence of registered tags such as zh-yue, it would become possible to use a three-letter identifier for Yue (let's say it's "yue" for sake of discussion), but there will also be prior implementations that use "zh", and thus a need to relate "zh" and "yue".

It occurred to me that an easy way to do this would be to require a hierarchical tagging, "zh-yue" for these situations (i.e. only where a macro-language identifier exists). This will make use of the existing language-range mechanism; so, for instance, an HTTP request for "zh" will match on content tagged with "zh-yue" (which wouldn't happen if the content were tagged as "yue").

This has potential implication for the syntax being proposed in RFC 3066bis, which allows sub-tags for language, then script, then region, then variant. Something like "zh-yue" would involve another sub-tag between the first one, for language, and a subsequent one for script, region or variant; this extra sub-tag would also be for language. I don't think this is a serious problem, however: if RFC 3066bis were published with syntax as in the current draft, it would simply be a matter of revising the expansion given for "lang" so as to allow terminals of the form 2*3 ALPHA "-" 3 ALPHA. As long as we don't end up using alpha-3 country IDs, this should work without problems.


Peter
 
Peter Constable
Globalization Infrastructure and Font Technologies
Microsoft Windows Division



More information about the Ietf-languages mailing list