Problems deciding if az- should have multiple registrations...

Fri Apr 11 11:46:33 CEST 2003

Michael Everson wrote:
>>
>> Is there a distinction in orthography between each pair of the following?
> 
> Your answer is "No", then.

Mark's example, though, seems to indicate that the existing regime has 
not made a hard-and-fast distinction either. The orthographic 
distinction as justification for not registering seems untenable. The 
real distinction appears to be whether the code would be worth 
registering as a special case because there is demand for using it as a 
separate identifier.
> 
>> If we can only
>> get az-Cryl and az-Latn registered, or if the end goal for 3066bis 
>> will not
>> permit both #5 and #6, then we would probably be forced to define our
>> language codes as "based on" RFC 3066, but not identical.
> 
> Or you could alter your software and make it work properly with language 
> codes and locales.

But that's the point, isn't it? It isn't ICU that is being dealt with 
here, but the underlying system that ICU (or my software, for that 
matter) is running on. ICU could be modified, but if it can't 
interoperate then there will be problems.

Locale identifiers are hobbled by a long term confusion with language 
tags. Fixing locales requires either parallel changes to language tags 
or divergence.

If you examine the case for divergence (which is a case I've made 
forcefully for the past year or so, so I've spent a lot of time thinking 
about it), you eventually end up with problems related to the fact that 
the language tag is necessarily part of the locale--and it conflicts 
with portions of the locale ID designed to solve this same problem. Long 
discussions with Mark and others have led me to the conclusion that the 
simpler and more satisfying conclusion is to treat language as the 
locale identifier and all the other things as preferences.... and fix 
the language tags themselves (to deal with an obvious glaring omission) 
rather than try to circle around the problem in the locale tags.

You might not have the same conclusion.

In any event, if language==locale ID, we really should fix the edge 
cases of language tagging. There appears to be no resistence to the one 
case I actually care about today (zh-*), but I find the problems with 
the parallel example of az-* disturbing.

I imagine that there are systems with locales that look like:

az.ISO8859_1 at latin
az-AZ.ISO8859_1 at latin

These are not different on some level recognized as linguistic, but the 
data files for these locales are actually not the same and may actually 
*be* different in some recognizably linguistic manner.

Japanese has similar problems. There are many systems that have both 
'ja' and 'ja_JP' locales. These are not lingistically different unless 
you follow Martin's argument that number formats and the like are 
language or orthographic differences.

Nonetheless, if we are all in agreement that a generative RFC3066bis 
should be created, registering these temporary markers seems either a) 
irrelevent [possibly a waste of time if the standards process can go 
fast enough] or b) forward-looking depending on your viewpoint.

So I guess:

1. *Are* we in agreement that RFC3066bis needs writing?

2. Why not register things that will become sanctioned in #1?

> 
> These codes are to encode language distinctions, and are NOT intended to 
> become the catch-all for locale identification. The point is, locale 
> specification has not been done correctly, and it should not be on the 
> entity-coders to fix it. It should be on the people who have botched 
> their locale identification structures.

Only if the locale specification doesn't rely on the entities. If the 
case is that locales and RFC3066's use of ISO639 and ISO3166 as 
Ur-standards is just happenstance, then you are correct. It is my belief 
(and I believe Mark's) that the similarity is not actually accidental.

That is, fixing language tags and then defining them as the Ur-standard 
for locale identifiers solves a lot of long standing problems and hurts 
almost no one.

If Serbian, Uzbek, and Azeri form the complete list of languages that 
require some additional registration, then I think we could register 
these, given some demonstration of need, and move along. Obviously the 
fudge in that sentence is "some". Mark has "some" justification. You 
would like "more". Given that no one is likely to research tiny 
orthographic differences, the justification proposed is that some form 
of unknown-but-real legacy (computer) differentiation is still a 
difference.

The counter argument appears to be "the computer distinction does not 
mark a real human-language distinction". The long list of English codes 
suggests that this argument is actually empty: a country *could* 
legislate something, but none appear to have done so to the extent that 
a separate code need be summarily registered *in advance* of the 
difference appearing. Or am I reading this wrong?

Best regards,

Addison

-- 
Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.

+1 408.962.5487  mailto:aphillips at webmethods.com
-------------------------------------------
Internationalization is an architecture. It is not a feature.

Chair, W3C I18N WG Web Services Task Force
http://www.w3.org/International/ws