Proposed Successor to RFC 3066 (language tags)

Thu Nov 20 03:23:32 CET 2003

From: "Addison Phillips [wM]" <aphillips at webmethods.com>
> 2. The use of "_" and "." would also be illegal under the existing
standard.
> Pick your favorite ASCII characters... we chose these two, but any two you
> pick other than ALPHANUM and HYPHEN-MINUS would be additions to
> the range of characters.

You're right here that you needed two other characters. But I would not have
chosen the same characters already used in URL escaping (so I would have
eliminated "&", "+", "=", and "%").
That's why it was probably more stable to avoid collisions of escaping
mechanisms by choosing "." and "_".

> 3. Note that "SP" is unassigned. Click here for the semi-official table
>
http://www.iso.ch/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/iso_3166-1_decoding_table.html.
> According to this draft, the code for "Serbia and Montenegro" would be
CS1, a specially registered value.

I know that the former Yugoslavia has changed its offical name into Serbia &
Montenegro, I also knew that a new code was requested, but I did not know
which: I have seen several proposals, including requests to reserve two
codes, one for Serbia coded SP, one unknown other for Montenegro as it has
its own local government and may sometime split from Serbia, the same way it
happened between Czech Republic and Slovakia, after a future referendum
still projected in the future by some local parties.

For now, both regions have decided to continue to live together, the time to
restructure their economies in peace and prepare their future and restore
relations with their neighbors, and this is what is reflected in the new
name of the country, which says definitely goodbye to the Yugoslav
Federation, but paves the way for a future split.

> 4. URLs must obey the rules for URLs. Values placed in a "GET" URL that
used an extended
> language tag as a value would be no exception. This may be unattractive,
but doesn't
> invalidate the design. I'm more concerned about direct consumers, such as
the
> Accept-Language header in HTTP, than I am about transport encodings. The
characters
> chosen were picked from the list of characters valid in HTTP headers and
URIs.

Well, both "_" and "." are also valid in HTTP headers and URIs, no?

> 5. RFC 3066 requires the shortest form of both ISO639 and ISO3166 to be
used
> (the latter being a moot point currently).
> The new draft does not change this, but does extend it in the area of
stability.

What is stability for region codes when regions change in their history,
split or join together?
For language codes, the relevant definition of regions should be the
ethographic definition, that official countries do not cover exactly.

Say for example that someone wants to encode Catalan spoken in
Catalunia,Spain separately from Catalan spoken in France, one has to use the
country code before the effective region code (which one? is there a good
standard in ISO-3166-2 for all countries except a few like France and
Federal countries like USA or Switzerland?) What if one autonomous region
gains its status of independant country?

ISO3166 is not an excellent match  for language tags, and
cultural/ethnographic regions are not coded by it.

> 6. No, these are tags for the identification of languages.
> A "locales extension" is something Mark and I have in mind, but this isn't
it.

This is so much related... For a language like Chinese, with its many
regional dialects, the ISO3166 region code will not map them correctly. How
do we encode for now the Han dialects (Yue, Wu, ...)? You propose standard
(registered) variants but they collide with the numeric Year specifier.

May be the indication of year should have been separated from the registered
variant code.

> > and for documenting the legal values of variant codes (either a year
> > with possible era, or a registered tag)
>
> 7. Don't forget private use variant tags.

I don't forget them. That's a good thing to give a way to associate a
semantic to them without relying only their encoding order. Also a good
point to include wildcards for language ranges. But languages ranges are
more complex than what it appears. The Ethnolog provides a good
classification system, but the classification of regional dialects of a same
language is quite difficult (sometimes the main ISO639 language code is so
poor, for example when it already defines very imprecise languages, and
includes codes for groups of languages)

I just wonder if the semantic variant tags should not be managed by a
separate language properties database, like the Unicode character
properties, in which standard groups could be infered and refered if needed
by a specific code. But this is a long term study, which has its own
implementation issues that should be discussed with the need to represent
and process locales, and user preferences in softwares.

I just remember another use of "=" with language codes:
the "Accept-Languages:" header can specify a list of alternative language
codes, each one with its own qualifiers (a "q=" fractional value between 0.0
and 1.0 after a semi-colon). I fear that some browsers may try to parse a
numeric value found in the variant tag of a language code.

This is another form of language range (alternates are comma-separated, and
qualified by a semi-colon separator), not discussed in the RFC proposal
which just speaks about the "*" wildcard and prefixes...