Return-Path: Received: from murder ([unix socket]) by eikenes.alvestrand.no (Cyrus v2.2.8-Mandrake-RPM-2.2.8-4.2.101mdk) with LMTPA; Wed, 24 Aug 2005 04:49:44 +0200 X-Sieve: CMU Sieve 2.2 Received: from localhost (eikenes.alvestrand.no [127.0.0.1]) by eikenes.alvestrand.no (Postfix) with ESMTP id BE0CF32008F for ; Wed, 24 Aug 2005 04:49:44 +0200 (CEST) Received: from eikenes.alvestrand.no ([127.0.0.1]) by localhost (eikenes.alvestrand.no [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 22556-08 for ; Wed, 24 Aug 2005 04:49:37 +0200 (CEST) X-Greylist: domain auto-whitelisted by SQLgrey-1.4.8 Received: from unicode.org (unicode.org [69.13.187.164]) by eikenes.alvestrand.no (Postfix) with ESMTP id EFBF432007B for ; Wed, 24 Aug 2005 04:49:36 +0200 (CEST) Received: from sarasvati.unicode.org (unicode.org [69.13.187.164]) by unicode.org (8.13.4/8.12.11) with ESMTP id j7O2m4rA028943; Tue, 23 Aug 2005 21:48:04 -0500 Received: with ECARTIS (v1.0.0; list unicode); Tue, 23 Aug 2005 21:48:04 -0500 (CDT) Received: from montage.altserver.com (montage.altserver.com [63.247.74.122]) by unicode.org (8.13.4/8.12.11) with ESMTP id j7NNdCES030728 for ; Tue, 23 Aug 2005 18:39:13 -0500 Received: from ver78-2-82-241-91-24.fbx.proxad.net ([82.241.91.24] helo=jfc.afrac.org) by montage.altserver.com with esmtpa (Exim 4.44) id 1E7iMH-0001lm-DM; Tue, 23 Aug 2005 16:39:09 -0700 Message-Id: <6.2.3.4.2.20050823165114.04e03b70@mail.jefsey.com> X-Mailer: QUALCOMM Windows Eudora Version 6.2.3.4 Date: Tue, 23 Aug 2005 18:32:45 +0200 To: "Philippe Verdy" , "Doug Ewell" From: "JFC (Jefsey) Morfin" Subject: Re: Questions re ISO-639-1,2,3 Cc: "Peter Constable" , , In-Reply-To: <011a01c5a7bd$8f363880$0b01a8c0@rodage.dyndns.org> References: <00ab01c5a756$43f23520$0a01a8c0@rodage.dyndns.org> <002501c5a794$673a4e80$030aa8c0@DEWELL> <00bf01c5a79f$9b096510$0b01a8c0@rodage.dyndns.org> <084801c5a7a6$9d919260$030aa8c0@DEWELL> <011a01c5a7bd$8f363880$0b01a8c0@rodage.dyndns.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; format=flowed X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - montage.altserver.com X-AntiAbuse: Original Domain - unicode.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - jefsey.com X-Source: X-Source-Args: X-Source-Dir: X-archive-position: 21377 X-Approved-By: root@unicode.org X-ecartis-version: Ecartis v1.0.0 Sender: unicode-bounce@unicode.org Errors-To: unicode-bounce@unicode.org X-original-sender: jefsey@jefsey.com Precedence: bulk List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: X-List-ID: X-list: unicode X-Virus-Scanned: by amavisd-new at alvestrand.no Philippe, The problem in using alpha-3 codes is that they are 3 alpha long. An IETF Draft, supported by Doug and Peter, proposes a strict variation of the RFC 3066 ABNF (structured format) where subtags are partly identified by their size, partly by their relative position. I say "variation" because - however it includes some additions (which result in changes in the RFC 3066 ABNF) - it does not want to be an evolution which would permit much needed other changes (IMHO) and support innovation, for reasons I will not discuss here. The use of alpha-3 in that ABNF could be confusing at some stage with other information, all the more than in internet protocols one must not consider the case. This calls for several considerations: - this Draft wants to make this format the sole format to be used in the IANA registry. This worryingly leaves only two possibilities if you are not satisfied with that particular format: to defeat the Draft, or to build an open alternative to the IANA registry (I was engaged in also supporting the Draft ABNF as one of the deprecating propositions, and in working on the necessary distribution and extension of the IANA system) - the format lacks several important informations such as the referent of the language (is it English, Basic English, by which publisher, using which dictionary, etc.), the context of the exchange (style, special words, etc.) and the date of standard reference (which may not be the date of the document, which is often ignore anyway). - the format is supposed to be multimodal, but only limited script information (founts are not documented) are supported and no space is reserved for voice, signs, icons attributes. - but most of all this proposition does not consider the designated content in a network relational exchanges perspective. This is a very important point to designate a language. Languages have never been made to be identified but to be used. They have been made to permit face to face relations. They have been extended (distance, audience and time) through scripts. Today they are broadly extended by far more complex an evolution than from voice to script. Script have introduced memory and communication. Communication is totally changed today as is memory. Scripts are much more complex and changed. The introduction of the relational services changes the nature of the exchanges. The languages themselves change of nature as multilingualism extend the capability of language negotiation and adaptation, from language to language and therefore within what one understood as a same language. The number of terms to be used/known is drastically extended too and as a result leads to various views (and not version) of a language. Languages are brain to brain interintelligibility protocols. To want to describe the language and cultural evolution, which tries to support the increase of exchanges (number, density, complexity), with designations of the preceding language era (script), is awkward. It would be like trying to describe the internet in using a postal paradigm (I use this because this is, to date, unfortunately the main problem of the end to end interoperability layer). Like every protocol, languages have parameters. These parameters can include the country codes - the interest of a numeric code of some size is its stability, its multilingualism and its script independence. Another problem we face in trying to build informations databases rather than object database (I suggest you consider the ISO 11179 effort - not the result but the area of concern in TC32) is the versatility of the content. We still live with the idea that we use "texts". We actually use "architexts" (what is going to produce the vision/version of the text we use, and more and more the interaction of our rendering tools). If you say you do not want to consider computer languages, as the IETF Draft does, you deprive yourself from the very HTML, XML etc. you want to document: it is an architext and uses computer [ASCII] language - bravo bisharat!). The same architext may include successive information related to several countries, regions, ethnolinguistic zones, etc.... and languages. They will have to be decoded by an OPES (open pluggable edge service) reader. The IETF Charter adequately quote the relation with the locale, but the locale itself is subject to a possibly complex, versatile and adaptative negotiation and to interrelation with the other systems the computer is related to. Trying to manage this information with script/text related concepts, even in overloading them with a lot of information, would be like wanting to run on an high-way with a bicycle. ISO 639 1, 2, 3 are not appropriate to support this. They are however all what we have, as long as ISO 639-6 is not available. ISO 3166 are not appropriate, it is however a localisation tool of interest as being the most used ISO standard. But others like ISO 3166-2, E.164, X.121, geographical coordinates, etc. are of use. What the IETF Draft should have provided was an ISO 3166 equivalent adapted to the Multilingual Internet. This work is still to be done: it has been unfortunately delayed (I started working on a Draft addressing the need 13 months ago), but at the same time the (sometimes hot) debate over the IETF Draft was not a complete waste as it gave some good experience. But we now have to leave the bicycle in peace and to look for some good Ferrari/Renault. jfc At 10:32 23/08/2005, Philippe Verdy wrote: >From: "Doug Ewell" >>ISO 3166-1 alpha-2 and alpha-3 code elements are almost identical in >>their stability (or lack thereof). I can find no instances in the >>31-year history of ISO 3166 where an alpha-3 code element was changed >>while the corresponding alpha-2 code was left unchanged. (If you can >>find one, please accept my apologies.) > >Yes alpha-3 codes can change for a country, but in fact alpha-3 >codes have still not been reassigned to different countries, unlike >alpha-2 codes. So changes of alha-3 codes just changes the old >official code into an alias. > >For example ROM changed to ROU, but ROM was not reassigned to another country. > >The reassignments of alpha-2 codes to different countries is the >main problem for use in locale codes that require longer stability >than dated statistics. > >What this means is that the alpha-2 codes need to be dated to be >disambiguated. > >>The numeric code elements (henceforth "codes"), which are really UN >>codes rather than ISO codes > >That's what I said (UNSD means United Nations' Statistics Division >if this was not clear) > >>are usually considered more stable, but it >>depends on what kind of stability you are looking for. ISO alpha codes >>change when the name of a country changes (or whenever the country feels >>like changing it; see Romania). UN numeric codes change when the >>territory covered by the code changes. Normally the latter event is >>less frequent than the former, but the reverse can also happen; in 1993, >>the numeric code for Ethiopia changed from 230 to 231 (because of the >>loss of territory to Eritrea) while the alpha codes remained ET and ETH. > >OK, but 230 has *still* not been reassigned (it could easily, given >the much smaller encoding space for numeric codes which are >geographically structured), so it has become an alias for Ethiopia >(such alias would remain valid for references to documents speaking >about the country before the split, or composed with localization >meta-data; of course documents speaking about the country after the >split should use the new code, to avoid the ambiguity with Erithrea, >but this would not invalidate the past references; but this would be >true for any country code, including the CIO 3-letter country codes, >or other standards). > >My opinion is that the UNDS wants to keep the possibility to make >historical searches in its data, without mixing in the same result >list the statistics of unrelated countries or territories. This is >however less a problem for UN, given that statistics are necessarily >dated (this is not the case for many documents needing locale code >markup or meta-data). > > > >