Proposed Successor to RFC 3066 (language tags)

Thu Nov 20 06:39:12 CET 2003

Hi Philippe,

Comments interlinearly below.

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility
http://www.webMethods.com
Chair, W3C Internationalization (I18N) Working Group
Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International

Internationalization is an architecture. 
It is not a feature.

> 
> I know that the former Yugoslavia has changed its offical name 
> into Serbia &
> Montenegro, I also knew that a new code was requested, but I did not know
> which: I have seen several proposals, including requests to reserve two
> codes, one for Serbia coded SP, one unknown other for Montenegro as it has
> etc.

It helps to use real values. The country you chose is particularly interesting for this document... the country code actually assigned was "CS".

> 
> Well, both "_" and "." are also valid in HTTP headers and URIs, no?

Yes. My point is not to belittle your suggestion. Mark and I considered several formulations and there is no reason that ultimately the characters chosen might be different. We should consider the choices for expanding the character range carefully before settling on some formulation. The characters = and % were chosen precisely because they are used for the same purpose in other applications. That doesn't make them the right choice, but it does make them consistent with other applications.

> 
> What is stability for region codes when regions change in their history,
> split or join together?

"Stablility" is not for the regions themselves. It's for the codes. The problem with Serbia and Montenegro is that the country was assigned 'CS' as a code. 'CS' recently refered to Czechoslovakia. There is data in the world labeled with 'CS' that pertains to the late union of Czechs and Slovaks. It is very bad for the meaning of a tag to change due to external events, in our opinion, hence the creation of stability rules. Consider that one might want to tag historical data with a tag that is somewhat contemporaneous with the text---say a Czech document from 1968. What does one do? Once a code is assigned to a region or country, it is permanent. This is also why the "shortest form" rule is still applied to ISO3166 despite the fact that all countries currently get alpha2 and alpha3 codes: there will be a day when alpha2 codes are exhausted if stability were to apply to ISO3166.

> For language codes, the relevant definition of regions should be the
> ethographic definition, that official countries do not cover exactly.

This is the object of some debate. The variant and extension mechanisms in this draft were proposed precisely to deal with issues such as these (where lang+region is not precise enough).
> 
> Say for example...

I'm not going to address any of this, except to say that:

1. The region code is NOT obligatory. Only the primary language tag is obligatory.
2. There seems to be some utility in the existing system, which is, after all, an abstraction.
> 
> > 6. No, these are tags for the identification of languages.
> > A "locales extension" is something Mark and I have in mind, but 
> this isn't
> it.
> 
> This is so much related... 

But it is good to keep the applications apart. In particular, I would expect that variant subtags won't be registered solely to provide locale information. For example, I don't expect to see 'fr-FR-EURO' as a language tag. We might see someone use something like: 'fr-FR-x-currency=EUR', but this is a different application.

> For a language like Chinese, with its many
> regional dialects, the ISO3166 region code will not map them 
> correctly. How
> do we encode for now the Han dialects (Yue, Wu, ...)? You propose standard
> (registered) variants but they collide with the numeric Year specifier.

I don't see what this has to do with ISO3166. The ISO3166 code is not obligatory. The year subtags and variant subtags can be mixed and neither of them are obligatory either. In addition, the extension mechanism provides a way for a community of users to standardize references that go beyond the already very fine grained abilities of unlimited variant subtags. I would expect highly technical users (such as scholars interested in the historical evolution of a dialect) to create specific subtags (either variant or extension) to describe very specific sets of documents. These would not rise to the level of a registration because they are not of general enough interest.

I think the proposed set of registrations for Resian (sl-rozaj) are a good example of the way that the new draft will help resolve many of the difficulties that the current RFC3066 regime imposes.

In particular: RFC3066 was designed as an abstraction of language. It does not (and indeed, no tagging mechanism could) address every possible valid language and dialect in the world today (let alone historically). The registration system was (if I may so infer, since I didn't write it) designed for a limited number of exceptional cases.

The new draft attempts to deal with the need for more complex registration needs.

By way of example, all of these are 'valid' (if some are silly):

zh-xiang
zh-CN-xiang
zh-xiang-2003
zh-xiang-750BCE  # this date is random.
zh-CN-xiang-2003
zh-Hant-CN-2003-xiang
zh-x-dialect=xiang
zh-Hant-CN-xiang-scouse-2003-boont

Each of these pertains to very specific examples of the xiang dialect of Chinese. If that isn't specific enough.....
> 
> May be the indication of year should have been separated from the 
> registered
> variant code.

Why? The ABNF allows them to be interchangeably used (see examples) and either kind is easy to detect. This is more flexible than forcing the order (as separation would require).

> >
> > 7. Don't forget private use variant tags.
> 
> I don't forget them. That's a good thing to give a way to associate a
> semantic to them without relying only their encoding order. Also a good

Order *is* important for matching. 

Private use variant subtags allow the UNREGISTERED use of a subtag by a user community that would otherwise have to register a tag. This circumvents problems with 'non-standard' language tags, such as Microsoft's use of RFC3066-like identifiers, in an open environment. My software may not recognize all of Microsoft's tags, but can perform operations on their tags.

> point to include wildcards for language ranges. But languages ranges are

Language ranges do NOT apply to tags. Wildcards are not permitted in tags. They apply to matching.

> more complex than what it appears. The Ethnolog provides a good
> classification system, but the classification of regional 
> dialects of a same
> language is quite difficult (sometimes the main ISO639 language code is so
> poor, for example when it already defines very imprecise languages, and
> includes codes for groups of languages)

Opinions differ (I don't have a fully formed one about Ethnologue, since I'm not an expert in the areas that are involved). Note that there are examples of how Ethnologue could be used as an extension in the draft.
> 
> I just wonder if the semantic variant tags should not be managed by a
> separate language properties database, like the Unicode character
> properties, in which standard groups could be infered and refered 
> if needed

Part of the point of the registration mechanism in this draft is that processors need not (for the first time) have the latest registration table in order to process all of the possible, legal tags. The semantic meaning of a specific tag, especially a more complex one, may not be something that can be readily described. Presumably standardization of the use of tags will obviate the need for a large, complex database.

In other words: the attraction of RFC3066 is that it is simple. We have, in our little way, tried to maintain that simplicity and not create a huge database. In part this is because human language is a complex topic and any attempt to map it to something like a language tag will require a level of abstraction that will not please someone, somewhere.

> I just remember another use of "=" with language codes:
> the "Accept-Languages:" header can specify a list of alternative language
> codes, each one with its own qualifiers (a "q=" fractional value 
> between 0.0
> and 1.0 after a semi-colon). I fear that some browsers may try to parse a
> numeric value found in the variant tag of a language code.

We knew about that one too. We can consider various characters for a future draft.
> 
> This is another form of language range (alternates are 
> comma-separated, and
> qualified by a semi-colon separator), not discussed in the RFC proposal
> which just speaks about the "*" wildcard and prefixes...

That is beyond the scope of the draft, just as it was beyond the scope of RFC3066 before it.