Language / Locale identifiers

Sun Dec 12 19:28:32 CET 2010

From: ietf-languages-bounces at alvestrand.no [mailto:ietf-languages-bounces at alvestrand.no] On Behalf Of Doug Ewell

>> From the perspective of BCP 47, such tags are appropriately described 
>> as "language tags" with an extension not interpretable in terms of BCP 
>> 47. In other words, a language tag with some extra black-box stuff.
>> But in terms of the extension, such tags are not "language tags" but 
>> rather are "locale identifiers".

> Section 3.7 of 5646 says extension subtags "are reserved for the 
> generation of identifiers that contain a language component and are 
> compatible with applications that understand language tags."  

What I said above is consistent with that.

> A tag like "he-IL-u-ca-hebrew" is indeed a BCP 47-conformant identifier, 
> functionally equivalent to a language tag, 

Not just functionally equivalent to, but indeed itself a veritable BCP 47 language tag, just one with a non-language extension.

>>> For example, in an environment where language tags are used as locale 
>>> identifiers,
>>
>> That should be, "in an environment in which locale tags are used"

> If these were not intended to be valid language tags, there would have 
> been no point in creating RFC 6067.  Plus, as I said, at least a few of the 
> extension keys do represent language-identification data. 

I didn't say that they weren't intended to be valid language tags. I was only saying that in environments in which the extension is understood, these are not considered language tags but rather locale IDs.

>> And this characterization of the meaning
>>
>>> he-IL-u-ca-hebrew
>>> (Hebrew as used in Israel, using the traditional Hebrew calendar)
>>
>> Is also not quite right: "he-IL-u-ca-hebrew" denotes the _locale_ 
>> 'Hebrew-Israel with Hebrew calendar'. One can infer the language 
>> 'Hebrew as used in Israel' from that, but in the context in which the 
>> extension is interpretable this is not a _language tag_ but rather a 
>> _locale identifier_.
>
> I concede this relatively minor distinction.  Calendar information is 
> clearly about locales and not languages.  Nevertheless, the tag itself 
> —which is obviously intended to identify a locale or locale setting —
> is now syntactically valid as a language tag, in a way that was not true 
> a week ago.

Agreed.

>> A detail regarding the Unicode language and locale identifiers worth 
>> pointing out is that not all valid BCP 47 language tags are valid 
>> Unicode language IDs, and there is a special-case ID that is permitted 
>> that is not a valid BCP 47 language tag. The syntax is:
>>
>> ="root"
>> / unicode_language_subtag
>>  [sep unicode_script_subtag]
>>  [sep unicode_region_subtag]
>>  *(sep unicode_variant_subtag)
>>
>> where sep is "-" and unicode_language_X is any LSTR subtag of type X.
>
> This can't be the most current spec, since it doesn't provide for extension 
> U at all.
>
> The latest version of the spec (1.9)...

This is version 1.9 of that spec. Note that it defines both Unicode_language_id and also Unicode_locale_id. The latter includes the U extension:

= unicode_language_id
 [unicode_locale_extensions]

The former does not.

> I don't recommend that people start tagging their Web pages 
> with time zone or collation-strength information.  

Indeed, no!

Peter