Language / Locale identifiers
Doug Ewell
doug at ewellic.org
Sun Dec 12 02:09:08 CET 2010
Peter Constable <petercon at microsoft dot com> wrote:
>> The new extension, identified by the singleton 'u', allows a wide
>> variety of locale-related data items to be included in language tags,
>> completely within the framework of BCP 47.
>
> From the perspective of BCP 47, such tags are appropriately described
> as "language tags" with an extension not interpretable in terms of BCP
> 47. In other words, a language tag with some extra black-box stuff.
> But in terms of the extension, such tags are not "language tags" but
> rather are "locale identifiers".
Section 3.7 of 5646 says extension subtags "are reserved for the
generation of identifiers that contain a language component and are
compatible with applications that understand language tags." A tag like
"he-IL-u-ca-hebrew" is indeed a BCP 47-conformant identifier,
functionally equivalent to a language tag, although it is true that the
extension part cannot be interpreted using BCP 47 and the LSR alone.
Before RFC 6067 was published, a tag like this would not have been BCP
47-conformant.
>> For example, in an environment where language tags are used as locale
>> identifiers,
>
> That should be, "in an environment in which locale tags are used"
If these were not intended to be valid language tags, there would have
been no point in creating RFC 6067. Plus, as I said, at least a few of
the extension keys do represent language-identification data.
Specifying "Traditional Chinese financial numerals" may add the same
sort of language-tagging value that specifying "Traditional Chinese
script" adds.
> And this characterization of the meaning
>
>> he-IL-u-ca-hebrew
>> (Hebrew as used in Israel, using the traditional Hebrew calendar)
>
> Is also not quite right: "he-IL-u-ca-hebrew" denotes the _locale_
> 'Hebrew-Israel with Hebrew calendar'. One can infer the language
> 'Hebrew as used in Israel' from that, but in the context in which the
> extension is interpretable this is not a _language tag_ but rather a
> _locale identifier_.
I concede this relatively minor distinction. Calendar information is
clearly about locales and not languages. Nevertheless, the tag itself —which
is obviously intended to identify a locale or locale setting —is now
syntactically valid as a language tag, in a way that was not true a week
ago.
> A detail regarding the Unicode language and locale identifiers worth
> pointing out is that not all valid BCP 47 language tags are valid
> Unicode language IDs, and there is a special-case ID that is permitted
> that is not a valid BCP 47 language tag. The syntax is:
>
> ="root"
> / unicode_language_subtag
> [sep unicode_script_subtag]
> [sep unicode_region_subtag]
> *(sep unicode_variant_subtag)
>
> where sep is "-" and unicode_language_X is any LSTR subtag of type X.
This can't be the most current spec, since it doesn't provide for
extension U at all.
The latest version of the spec (1.9) does show that "root" and "en_US"
are valid in CLDR while "cmn" and "zh-cmn" are not. You can also use
AALAND and SAAHO in some CLDR identifiers. Certainly the CLDR committee
can declare any syntax valid or invalid as it sees fit, so there must
have been some reason for Mark, Addison, and Yoshito Umaoka to go
through the RFC process in order to align these identifiers with BCP 47.
I don't recommend that people start tagging their Web pages with time
zone or collation-strength information. I just want to point out that
the definition of "BCP 47 language tag" has expanded.
--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s
More information about the Ietf-languages
mailing list