Problems deciding if az- should have multiple registrations...

Fri Apr 11 13:27:59 CEST 2003

Well, I'm probably going to regret replying to this.
> 
> I asked whether "az-Latn" and "az-Latn-AZ" differed in any way. If they 
> do not, then the codes are duplicates.

Okay, but the argument has been presented that 3166 coded values are 
registered to distinguish national or official languages. So:

1. Is 'az' the official or national language of 'AZ'? If yes then 
register 'az-AZ'...

2. Is there a distinction between 'az-latn' and 'az-cyrl'? If yes, then 
register 'az-latn' and 'az-cyrl'.

3. See #1. Repeat as needed.

>>
>> Locale identifiers are hobbled by a long term confusion with language 
>> tags. Fixing locales requires either parallel changes to language tags 
>> or divergence.
> 
> Language tags are language tags, not locale tags. If the computer 
> industry or some players in it have gunked-up software because 
> programmers made erroneous assumptions about the structure of "locale" 
> with regard to its elements, it is encumbant on the industry or those 
> players to structure their software more accurately with regard to good 
> localization and internationalization practice.

I'm not arguing that point. Only that fixing locales relies at least to 
some extent on fixing language identifiers. I need a way to distinguish 
'zh-Hant' from 'zh-Hans'. The best way to do that would be to get an 
Official Language Tag because it is the language tag 'zh' that is 
gunking things up, if you will.

> 
>> If you examine the case for divergence (which is a case I've made 
>> forcefully for the past year or so, so I've spent a lot of time 
>> thinking about it), you eventually end up with problems related to the 
>> fact that the language tag is necessarily part of the locale--and it 
>> conflicts with portions of the locale ID designed to solve this same 
>> problem.
> 
> Language tags are there to tag languages. They are not there to solve 
> everyone's locale problems.

But one thing language tags tag is software resources contained in 
locales. Tagging those resources is a valid use, even by your 
definition. I'm trying to solve locale interoperability problems and my 
point is that introducing a field in the locale structure called 
"script" is solving a problem that language tags really ought to, hence 
this discussion. We can solve the problem separately or together. 
Together looks like a better choice, given that this forum is already 
working on it and changes in language tags will affect locale 
identifiers anyway.

>> I imagine that there are systems with locales that look like:
>>
>> az.ISO8859_1 at latin
>> az-AZ.ISO8859_1 at latin
> 
> Ghastly.

Exactly so. My problem is how to run my Java program over the top of 
that mess and still get system messages in the right language, script, 
and orthography.
> 
>> These are not different on some level recognized as linguistic, but 
>> the data files for these locales are actually not the same and may 
>> actually *be* different in some recognizably linguistic manner.
> 
> May it, indeed?

Sure, why not? We're both speculating here! I didn't go and compare all 
the files either, but I'm pretty sure I know of at least two fields in 
the above speculated locales (yes, I know, I know...) that are different.

> 
>> Japanese has similar problems. There are many systems that have both 
>> 'ja' and 'ja_JP' locales. These are not lingistically different unless 
>> you follow Martin's argument that number formats and the like are 
>> language or orthographic differences.
> 
> 
> 639 and SIL and 3066 specify language tags, Addison, not locales.

No kidding...? didn't I say that a few messages back (and in fact 
propose language to put into a 3066bis to deal with that)?

I think that is the point I'm making: locales often use one or the other 
form interchangeably, even when there is no reason to. Hence, there is 
likely to be data (software resources in this case, but also content, 
etc.) tagged with the 'ja-JP' form.

In fact, I *know* that there is (plain ol' textual) XML and HTML content 
tagged as ja-JP, because I've seen it. It's pointless to tag it that 
way, but there it is!

Your argument is essentially that text tagged 'az-latn' is different 
than 'az-AZ' somehow. We shouldn't modify the tag 'az-AZ' to identify 
the real differences in language, which may be better conveyed by the 
latin or cyrillic identifiers.

Okay, language tags not only aren't locale tags, but they must never 
touch those icky locale things, even obliquely... but I suspect that 
this distinction is lost on average users trying to get the content they 
want via Accept-Language.

>> So I guess:
>>
>> 1. *Are* we in agreement that RFC3066bis needs writing?
> 
> In order to permit a greater flexibility in tagging LANGUAGES, yes. In 
> order to extend it to solve the woes of misbegotten 
> locale-identification systems, no.

Okay, that's what I want. But it is worth noting that fixing language 
tags helps those misbegotten systems too. Knock-on benefits are good.

> 
>> Only if the locale specification doesn't rely on the entities. If the 
>> case is that locales and RFC3066's use of ISO639 and ISO3166 as 
>> Ur-standards is just happenstance, then you are correct. It is my 
>> belief (and I believe Mark's) that the similarity is not actually 
>> accidental.
> 
> I think that is, if you will forgive me, sloppy reasoning. The reality 
> is more subtle and complex than that.

How long do you want the email to be?

I don't see this as sloppy. The fact is that 3066 tags (based on 
639/3166) and locale identifiers (often based on 639/3166) are both 
similarly constituted and that originally the 693/3166 portion of most 
kinds of locale identifier was supposed to identify the 'natural 
language' portion of the locale. This is not an accident. It is by design.

I can't speak to the originators of RFC1766. Probably they will come out 
of the woodwork to inform me that they intentionally chose the same pair 
of Ur standards but for different, incompatible reasons. But I'm not 
sure what difference that makes: both systems are trying to identify 
language preferences on some level. It makes sense to compromise on a 
system that satisfies as many as possible. Having locales use a 
different language identifier seems sloppy to me.

> 
>> That is, fixing language tags and then defining them as the 
>> Ur-standard for locale identifiers solves a lot of long standing 
>> problems and hurts almost no one.
> 
> 
> I do not believe that a language-tag = locale. Many users are 
> multilingual. Many users use languages in places where other languages 
> are spoken in the majority.

Hence efforts like ULocale, one version of which is at 
http://www.inter-locale.com/whitepaper/localeTags.jsp

You'll note that I separate language and region. But I still need a 
language tag for the language material (!!)
> 
>> If Serbian, Uzbek, and Azeri form the complete list of languages that 
>> require some additional registration, then I think we could register 
>> these, given some demonstration of need, and move along. Obviously the 
>> fudge in that sentence is "some". Mark has "some" justification. You 
>> would like "more". Given that no one is likely to research tiny 
>> orthographic differences, the justification proposed is that some form 
>> of unknown-but-real legacy (computer) differentiation is still a 
>> difference.
> 
> I know what a language is and what a locale is, and I'm here to judge 
> the registration of codes for languages.

Lucky you. On both counts.

> 
>> The counter argument appears to be "the computer distinction does not 
>> mark a real human-language distinction". The long list of English 
>> codes suggests that this argument is actually empty: a country *could* 
>> legislate something, but none appear to have done so to the extent 
>> that a separate code need be summarily registered *in advance* of the 
>> difference appearing. Or am I reading this wrong?
> 
> 
> I don't understand your "counter argument".

See 1-3 above. I guess the problem is that you want evidence of a 
"different actual language" in order to create two more codes to fill 
out the pattern of tags, but at the same time:

a) there are examples of not actually different languages that have 
already been registered or are extant.

b) there is no close definition of how separate the language has to be 
before the languages are "different enough" or "actual enough". It 
appears to be a "duck test". So what constitutes a duck?

-- 
Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.

+1 408.962.5487  mailto:aphillips at webmethods.com
-------------------------------------------
Internationalization is an architecture. It is not a feature.

Chair, W3C I18N WG Web Services Task Force
http://www.w3.org/International/ws