language tag structure

Jon Hanna jon at hackcraft.net
Tue Jan 18 15:18:38 CET 2005


Jefsey,
This list is concerned with three topics.

1. RFC 3066 - a system for labelling languages, dialects, and groups of dialects in a shorthand primarily intended to be machine readable and with an intended degree of precision applicable for most users (it will never be adequate for a lot of linguistic research, and doesn't try to be) to enable automated systems to treat text and other communiqués appropriately.

Examples of this appropriate behaviour include using the correct spell-checking dictionary, using appropriate glyphs in cases where the variant used varies from language to language within the same script (e.g. the difference in the angle of the acute accent in Polish and French or the differences in the same ideographs as used in Japanese and Chinese) returning the correct choice of document to a user when several translations are available.

This system uses and extends ISO 639. It also uses ISO 3166 due to the fact that dialects or groups of dialects often coincide closely enough to geographic borders that the dialect or group of dialects can often be usefully, if somewhat imprecisely, be identified with reference to a country.

The use of RFC 3066 is deliberately not limited by RFC 3066 itself, it plays a role in how other technologies work and does so according to the generally accepted (sometimes after learning the hard way) principle that technologies, and Internet technologies in particular, work better and are more extensible if they do one small job well and provide well-defined ways to interact with other technologies than if they try to do everything. Because of this you do not need to be an expert in RDF, XML, HTML or HTTP to usefully contribute to this list, even though all four of those technologies use RFC 3066 (of course we do need to keep an eye on RFC 3066 remaining useful in those places, which is why someone like me who knows a lot about RDF, XML, HTML and HTTP and general issues about encoding data but very little about languages can contribute too).

Similarly it does not try to reflect every possible datum about the language in question. Nor could it, for an example that proves by absurdity consider that a document could contain the slang term "grok" if it was written by someone (generally if they were either a techie or a member of certain religious groups) while a member of their family may not know the word. To try to reflect the language difference between these two people rather than the broad brushstrokes of their dialect (and rather vaguely defining "dialect" at that) would not only be technologically extremely difficult, it would also be pointless for all but a few applications used in researching idiolects and actually hinder the usefulness elsewhere.

If a greater level of linguistic detail is needed than is catered for by RFC 3066 then another technology is needed, though it could perhaps build on RFC 3066 much as RFC 3066 builds on ISO 639.

2. Potential successors to RFC 3066; currently this primarily refers to an internet draft that is reasonably likely to be accepted. Such a successor is referred to as RFC 3066bis (though I must admit I don't know what the "bis" stands for).

The successor currently being mooted differs primarily from RFC 3066 through formalising the use of ISO 15924 to identify scripts (e.g. Latin, Ogham, Hans, Cyrillic) used in written forms of the language. While I have expressed some reservations about this it is reasonable given how closely script and language are tied in practice and given that some registered tags build on combinations of languages and scripts (e.g. in talking of the orthographic change in German of the 1990s we are talking about how a particular language [German, tagged as 'de'] is written in a particular script [Latin, tagged as 'latn']).

3. Occasionally topics likely to be of interest to people with enough interest in RFC 3066 and its potential successors (viz. techies with an interest in languages, linguists with an interest in tech, and those suitably expert in both matters to be considered both techies and linguists) will be mentioned here (such as Misha's interesting link to Polari, a subject I've always found fascinating myself). Unless Misha has an intention of registering Polari (a matter that has been discussed before, but more in terms of discussing the boundaries of registration than an actual request) it should probably have been marked as [Off-Topic] or [OT] but with a short mail with just a link, such as Misha's, this isn't important. Though appropriately labelling off-topic mails remains an important act of courtesy in the case of a mail longer than a couple of sentences.

I'm not at all sure where your recent statements fit into any of this.

Regards,
Jon Hanna
Work: <http://www.selkieweb.com/>
Play: <http://www.hackcraft.net/>
Chat: <irc://irc.freenode.net/selkie>



More information about the Ietf-languages mailing list