Message-Id: <6.2.3.4.2.20050823165114.04e03b70@mail.jefsey.com>
Date: Tue, 23 Aug 2005 18:32:45 +0200
To: "Philippe Verdy" <verdy_p@wanadoo.fr>,
	"Doug Ewell" <dewell@adelphia.net>
From: "JFC (Jefsey) Morfin" <jefsey@jefsey.com>
Subject: Re: Questions re ISO-639-1,2,3
Cc: "Peter Constable" <petercon@microsoft.com>,
	<unicode@unicode.org>, <a12n-collaboration@bisharat.net>
In-Reply-To: <011a01c5a7bd$8f363880$0b01a8c0@rodage.dyndns.org>
References: <F8ACB1B494D9734783AAB114D0CE68FE06DB8462@RED-MSG-52.redmond.corp.microsoft.com>
 <00ab01c5a756$43f23520$0a01a8c0@rodage.dyndns.org>
 <002501c5a794$673a4e80$030aa8c0@DEWELL>
 <00bf01c5a79f$9b096510$0b01a8c0@rodage.dyndns.org>
 <084801c5a7a6$9d919260$030aa8c0@DEWELL>
 <011a01c5a7bd$8f363880$0b01a8c0@rodage.dyndns.org>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; format=flowed
Sender: unicode-bounce@unicode.org
Errors-To: unicode-bounce@unicode.org
Precedence: bulk

Philippe,
The problem in using alpha-3 codes is that they are 3 alpha long. An 
IETF Draft, supported by Doug and Peter, proposes a strict variation 
of the RFC 3066 ABNF (structured format) where subtags are partly 
identified by their size, partly by their relative position. I say 
"variation" because - however it includes some additions (which 
result in changes in the RFC 3066 ABNF) - it does not want to be an 
evolution which would permit much needed other changes (IMHO) and 
support innovation, for reasons I will not discuss here. The use of 
alpha-3 in that ABNF could be confusing at some stage with other 
information, all the more than in internet protocols one must not 
consider the case.

This calls for several considerations:

- this Draft wants to make this format the sole format to be used in 
the IANA registry. This worryingly leaves only two possibilities if 
you are not satisfied with that particular format: to defeat the 
Draft, or to build an open alternative to the IANA registry (I was 
engaged in also supporting the Draft ABNF as one of the deprecating 
propositions, and in working on the necessary distribution and 
extension of the IANA system)

- the format lacks several important informations such as the 
referent of the language (is it English, Basic English, by which 
publisher, using which dictionary, etc.), the context of the exchange 
(style, special words, etc.) and the date of standard reference 
(which may not be the date of the document, which is often ignore anyway).

- the format is supposed to be multimodal, but only limited script 
information (founts are not documented) are supported and no space is 
reserved for voice, signs, icons attributes.

- but most of all this proposition does not consider the designated 
content in a network relational exchanges perspective. This is a very 
important point to designate a language. Languages have never been 
made to be identified but to be used. They have been made to permit 
face to face relations. They have been extended (distance, audience 
and time) through scripts. Today they are broadly extended by far 
more complex an evolution than from voice to script. Script have 
introduced memory and communication. Communication is totally changed 
today as is memory. Scripts are much more complex and changed. The 
introduction of the relational services changes the nature of the 
exchanges. The languages themselves change of nature as 
multilingualism extend the capability of language negotiation and 
adaptation, from language to language and therefore within what one 
understood as a same language. The number of terms to be used/known 
is drastically extended too and as a result leads to various views 
(and not version) of a language.

Languages are brain to brain interintelligibility protocols. To want 
to describe the language and cultural evolution, which tries to 
support the increase of exchanges (number, density, complexity), with 
designations of the preceding language era (script), is awkward. It 
would be like trying to describe the internet in using a postal 
paradigm (I use this because this is, to date, unfortunately the main 
problem of the end to end interoperability layer). Like every 
protocol, languages have parameters. These parameters can include the 
country codes - the interest of a numeric code of some size is its 
stability, its multilingualism and its script independence.

Another problem we face in trying to build informations databases 
rather than object database (I suggest you consider the ISO 11179 
effort - not the result but the area of concern in TC32) is the 
versatility of the content. We still live with the idea that we use 
"texts". We actually use "architexts" (what is going to produce the 
vision/version of the text we use, and more and more the interaction 
of our rendering tools). If you say you do not want to consider 
computer languages, as the IETF Draft does, you deprive yourself from 
the very HTML, XML etc. you want to document: it is an architext and 
uses computer [ASCII] language - bravo bisharat!). The same architext 
may include successive information related to several countries, 
regions, ethnolinguistic zones, etc.... and languages. They will have 
to be decoded by an OPES (open pluggable edge service) reader. The 
IETF Charter adequately quote  the relation with the locale, but the 
locale itself is subject to a possibly complex, versatile and 
adaptative negotiation and to interrelation with the other systems 
the computer is related to.

Trying to manage this information with script/text related concepts, 
even in overloading them with a lot of information, would be like 
wanting to run on an high-way with a bicycle.

ISO 639 1, 2, 3 are not appropriate to support this. They are however 
all what we have, as long as ISO 639-6 is not available. ISO 3166 are 
not appropriate, it is however a localisation tool of interest as 
being the most used ISO standard. But others like ISO 3166-2, E.164, 
X.121, geographical coordinates, etc. are of use. What the IETF Draft 
should have provided was an ISO 3166 equivalent adapted to the 
Multilingual Internet. This work is still to be done: it has been 
unfortunately delayed (I started working on a Draft addressing the 
need 13 months ago), but at the same time the (sometimes hot) debate 
over the IETF Draft was not a complete waste as it gave some good experience.

But we now have to leave the bicycle in peace and to look for some 
good Ferrari/Renault.

jfc

At 10:32 23/08/2005, Philippe Verdy wrote:
>From: "Doug Ewell" <dewell@adelphia.net>
>>ISO 3166-1 alpha-2 and alpha-3 code elements are almost identical in
>>their stability (or lack thereof).  I can find no instances in the
>>31-year history of ISO 3166 where an alpha-3 code element was changed
>>while the corresponding alpha-2 code was left unchanged.  (If you can
>>find one, please accept my apologies.)
>
>Yes alpha-3 codes can change for a country, but in fact alpha-3 
>codes have still not been reassigned to different countries, unlike 
>alpha-2 codes. So changes of alha-3 codes just changes the old 
>official code into an alias.
>
>For example ROM changed to ROU, but ROM was not reassigned to another country.
>
>The reassignments of alpha-2 codes to different countries is the 
>main problem for use in locale codes that require longer stability 
>than dated statistics.
>
>What this means is that the alpha-2 codes need to be dated to be 
>disambiguated.
>
>>The numeric code elements (henceforth "codes"), which are really UN
>>codes rather than ISO codes
>
>That's what I said (UNSD means United Nations' Statistics Division 
>if this was not clear)
>
>>are usually considered more stable, but it
>>depends on what kind of stability you are looking for.  ISO alpha codes
>>change when the name of a country changes (or whenever the country feels
>>like changing it; see Romania).  UN numeric codes change when the
>>territory covered by the code changes.  Normally the latter event is
>>less frequent than the former, but the reverse can also happen; in 1993,
>>the numeric code for Ethiopia changed from 230 to 231 (because of the
>>loss of territory to Eritrea) while the alpha codes remained ET and ETH.
>
>OK, but 230 has *still* not been reassigned (it could easily, given 
>the much smaller encoding space for numeric codes which are 
>geographically structured), so it has become an alias for Ethiopia 
>(such alias would remain valid for references to documents speaking 
>about the country before the split, or composed with localization 
>meta-data; of course documents speaking about the country after the 
>split should use the new code, to avoid the ambiguity with Erithrea, 
>but this would not invalidate the past references; but this would be 
>true for any country code, including the CIO 3-letter country codes, 
>or other standards).
>
>My opinion is that the UNDS wants to keep the possibility to make 
>historical searches in its data, without mixing in the same result 
>list the statistics of unrelated countries or territories. This is 
>however less a problem for UN, given that statistics are necessarily 
>dated (this is not the case for many documents needing locale code 
>markup or meta-data).
>
>
>
>