Language tag structure

Martin Duerst
Sat, 27 Apr 2002 13:33:56 +0900

Hello Johannes,

At 17:39 02/04/26 -0100, J.Wilkes wrote:
> > On 04/25/2002 09:53:07 PM Torsten Bronger wrote:
> >>  The very important
> >> "en-GB"--"en-US" thing supports this assumption.  Most implementators
> >> (including myself) realise that by some sort of "longest match".
>Unfortunately, yes. Its easy to implement this way, but unclean and not 
>valid. It would
>be if the tags were defined this way (first fragment language, second 
>country, ...),
>but they are not. Not exactly.

If the second code is a two-letter code, it's a country code.
Otherwise not (necessarily).

In general, the hierarchy may be used for fallbacks, but it's
not guaranteed to work. RFC 3066 says so explicitly. For example,
for x-, the only thing you know is that's experimental. Similar
for i-. For sng-, you don't know much more.

>It is desirable to have a language tagging method that allows for all the 
>discussed. For example, "language=de;year=1901;country=AT" would do, defining
>the order of the fragments separated by ";" as irrelevant.

I'm not sure you get much more than now. 'year=1901' doesn't say
what it looks like. It's not that there is a generic 'year' field,
it's just that we took a number (trying to get as non-arbitrary as
possible, but not actually being very successful) to stand for
a collection of orthographic conventions (with a certain internal

And even if it were something that we could expand in a regular
fashion, adding years to all tags where needed, or even if you
add a table lookup from the tags to this kind of information,
you haven't really achieved anything. What you need is more
tables or other information, to get at data for spell-checking,
text-to-speech processing, hyphenation, styling, ...
Because you need that anyway, the intermediate structure,
or the tags themselves, may not be that important.

>If anyone wants to go in detail about my example, or an revision of RFC 3066,
>please set a different topic or direct me to a place/mailinglist more 
>appropriate for

For a revision of RFC 3066, this would be the list that would
be used, I guess.

>I think it's better to get tags like de-AT-1901 or de-1901-AT registered 
>soon and
>revise the standard separately, so these tags can be put to use already.

Yes indeed. A revision easily takes several years.

Regards,    Martin.