Language / Locale identifiers

Sun Dec 12 02:09:08 CET 2010

Peter Constable <petercon at microsoft dot com> wrote:

>> The new extension, identified by the singleton 'u', allows a wide 
>> variety of locale-related data items to be included in language tags, 
>> completely within the framework of BCP 47.
>
> From the perspective of BCP 47, such tags are appropriately described 
> as "language tags" with an extension not interpretable in terms of BCP 
> 47. In other words, a language tag with some extra black-box stuff. 
> But in terms of the extension, such tags are not "language tags" but 
> rather are "locale identifiers".

Section 3.7 of 5646 says extension subtags "are reserved for the 
generation of identifiers that contain a language component and are 
compatible with applications that understand language tags."  A tag like 
"he-IL-u-ca-hebrew" is indeed a BCP 47-conformant identifier, 
functionally equivalent to a language tag, although it is true that the 
extension part cannot be interpreted using BCP 47 and the LSR alone. 
Before RFC 6067 was published, a tag like this would not have been BCP 
47-conformant.

>> For example, in an environment where language tags are used as locale 
>> identifiers,
>
> That should be, "in an environment in which locale tags are used"

If these were not intended to be valid language tags, there would have 
been no point in creating RFC 6067.  Plus, as I said, at least a few of 
the extension keys do represent language-identification data. 
Specifying "Traditional Chinese financial numerals" may add the same 
sort of language-tagging value that specifying "Traditional Chinese 
script" adds.

> And this characterization of the meaning
>
>> he-IL-u-ca-hebrew
>> (Hebrew as used in Israel, using the traditional Hebrew calendar)
>
> Is also not quite right: "he-IL-u-ca-hebrew" denotes the _locale_ 
> 'Hebrew-Israel with Hebrew calendar'. One can infer the language 
> 'Hebrew as used in Israel' from that, but in the context in which the 
> extension is interpretable this is not a _language tag_ but rather a 
> _locale identifier_.

I concede this relatively minor distinction.  Calendar information is 
clearly about locales and not languages.  Nevertheless, the tag itself —which 
is obviously intended to identify a locale or locale setting —is now 
syntactically valid as a language tag, in a way that was not true a week 
ago.

> A detail regarding the Unicode language and locale identifiers worth 
> pointing out is that not all valid BCP 47 language tags are valid 
> Unicode language IDs, and there is a special-case ID that is permitted 
> that is not a valid BCP 47 language tag. The syntax is:
>
> ="root"
> / unicode_language_subtag
>  [sep unicode_script_subtag]
>  [sep unicode_region_subtag]
>  *(sep unicode_variant_subtag)
>
> where sep is "-" and unicode_language_X is any LSTR subtag of type X.

This can't be the most current spec, since it doesn't provide for 
extension U at all.

The latest version of the spec (1.9) does show that "root" and "en_US" 
are valid in CLDR while "cmn" and "zh-cmn" are not.  You can also use 
AALAND and SAAHO in some CLDR identifiers.  Certainly the CLDR committee 
can declare any syntax valid or invalid as it sees fit, so there must 
have been some reason for Mark, Addison, and Yoshito Umaoka to go 
through the RFC process in order to align these identifiers with BCP 47.

I don't recommend that people start tagging their Web pages with time 
zone or collation-strength information.  I just want to point out that 
the definition of "BCP 47 language tag" has expanded.

--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s