Return-Path: Received: from murder ([unix socket]) by eikenes.alvestrand.no (Cyrus v2.2.8-Mandrake-RPM-2.2.8-4.2.101mdk) with LMTPA; Mon, 09 May 2005 12:06:29 +0200 X-Sieve: CMU Sieve 2.2 Received: from localhost (localhost.localdomain [127.0.0.1]) by eikenes.alvestrand.no (Postfix) with ESMTP id 1064961B49 for ; Mon, 9 May 2005 12:06:29 +0200 (CEST) Received: from eikenes.alvestrand.no ([127.0.0.1]) by localhost (eikenes.alvestrand.no [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 06155-09 for ; Mon, 9 May 2005 12:06:23 +0200 (CEST) X-Greylist: domain auto-whitelisted by SQLgrey-1.4.8 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by eikenes.alvestrand.no (Postfix) with ESMTP id DEE9D61AF1 for ; Mon, 9 May 2005 12:06:22 +0200 (CEST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1DV58R-0003hf-9u; Mon, 09 May 2005 06:05:11 -0400 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1DV58Q-0003ha-49 for ltru@megatron.ietf.org; Mon, 09 May 2005 06:05:10 -0400 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id GAA22797 for ; Mon, 9 May 2005 06:05:07 -0400 (EDT) Received: from montage.altserver.com ([63.247.74.122]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1DV5NV-0000r0-4p for ltru@ietf.org; Mon, 09 May 2005 06:20:45 -0400 Received: from lns-p19-8-idf-82-65-73-204.adsl.proxad.net ([82.65.73.204] helo=jfc.afrac.org) by montage.altserver.com with esmtpa (Exim 4.44) id 1DV58L-0001dU-J4; Mon, 09 May 2005 03:05:06 -0700 Message-Id: <6.2.1.2.2.20050509095439.04a02eb0@mail.jefsey.com> X-Mailer: QUALCOMM Windows Eudora Version 6.2.1.2 Date: Mon, 09 May 2005 12:05:03 +0200 To: Martin Duerst From: "JFC (Jefsey) Morfin" Subject: Re: [Ltru] RFC 2277 - considerations In-Reply-To: <6.0.0.20.2.20050508154021.06275280@itmail.it.aoyama.ac.jp> References: <6.2.1.2.2.20050508032918.039af710@mail.jefsey.com> <6.0.0.20.2.20050508154021.06275280@itmail.it.aoyama.ac.jp> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; format=flowed X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - montage.altserver.com X-AntiAbuse: Original Domain - ietf.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - jefsey.com X-Scan-Signature: 65bc4909d78e8b10349def623cf7a1d1 Cc: ltru@ietf.org X-BeenThere: ltru@lists.ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Language Tag Registry Update working group discussion list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: ltru-bounces@lists.ietf.org Errors-To: ltru-bounces@lists.ietf.org X-Virus-Scanned: amavisd-new at alvestrand.no Dear Martin, what is confusing with IETF vs. other standardisation organisations is that they deal with networks (what means complexity and changes) and do not produce structured updated documents. So, the only way to deal with its deliverables is to consider them as the center of their own world, with their own logic and to confront their various expressions. Not to dwell on outside considerations, internal arfceology being enough. For thee good reasons: - this network centric approach has done well for the last 30 years and produced a consistent system even if sometimes complex. - the user approach is user-centric. The user has his machine and his ADSL plug, and looks in the manual expecting the authors have resolved all the issues/confusion/conflicts. - this is most probably the secret for scalability. Le reveiw this. There is a concept named "charset", with five understandings plus one comment (may be more ??) : 1. yours: "The term 'charset' got to where it is because when the MIME specifications was created, the people creating it were thinking in terms of 7-bit and 8-bit encodings, and in addition just (most probably unconsciously) shortened "coded character set" to "character set". So the term is there mostly by accident." - unconscious for a particular encoding scheme. 2. one which seems to be the one of the W3C, Unicode, ISO members of this WG. I suppose that Misha will not oppose to my quote of him http://czyborra.com/charsets/iso8859.html#ISO-8859-3 . This quote is independent to this debate, is dated 1998 and covers IMHO the whole issue (charset/ISO 10646). "The ISO 8859 charsets are not even remotely as complete as the truly great Unicode but they have been around and usable for quite a while (first registered Internet charsets for use with MIME) and have already offered a major improvement over the plain 7bit US-ASCII. Unicode (ISO 10646) will make this whole chaos of mutually incompatible charsets superfluous because it unifies a superset of all established charsets and is out to cover all the world's languages. But I still haven't seen any software to display all of Unicode on my Unix screen. We're working on it. " - limited lists of characters, an improvement over US ASCII, but smaller than ISO 19646. 3. RFC 2879 says: The term "charset" (referred to as a "character set" in previous versions of this document) is used here to refer to a method of converting a sequence of octets into a sequence of characters. This conversion may also optionally produce additional control information such as directionality indicators. http://www.iana.org/assignments/character-sets (last updated on January 28th, 2005) - N. Freed (Jon Postel) is consciously defining it as a generally encoding system and implicitly acknowledge the multiplicity of lists in defining a IANA registry for them. 4. RFC 2277 which says: "This document uses the term "charset" to mean a set of rules for mapping from a sequence of octets to a sequence of characters, such as the combination of a coded character set and a character encoding scheme; this is also what is used as an identifier in MIME "charset=" parameters, and registered in the IANA charset registry [REG]. (Note that this is NOT a term used by other standards bodies, such as ISO).". - Harald Alvestrand gives a very clear comprehensive definition: "a set of rules" which permits to build any of them, with examples: your first example (MIME) and the IANA registry. 5. The IANA definition: "These are the official names for character sets that may be used in the Internet and may be referred to in Internet documentation. These names are expressed in ANSI_X3.4-1968 which is commonly called US-ASCII or simply ASCII. The character set most commonly use in the Internet and used especially in protocol standards is US-ASCII, this is strongly encouraged. The use of the name US-ASCII is also encouraged.// The character set names may be up to 40 characters taken from the printable characters of US-ASCII. However, no distinction is made between use of upper and lower case letters." - follows a list of 250 charsets. Comment is by RFC 2130 (a Californian Workshop with Harald, Crispin, +) : "The term 'language tag' should be reserved for the short identifier of RFC 1766 [RFC-1766] that only serves to identify the language. While there may be other text attributes intimately associated with the language of the document, such as desired font or text direction, these should be specified with other identifiers rather than overloading the language tag." It then confirms the duality of: "3.4.2: Default Coded Character Set: The default Coded Character Set is the repertoire of ISO-10646. "3.4.3: Default Character Encoding Scheme: "For text-oriented protocols, new protocols should use UTF-8, and protocols that have a backwards compatibility requirement should use the default of the existing protocol, e.g. US-ASCII for mail, and ISO-8859-1 for HTTP" Further to this I understand the global picture as follows. 1. there are several partial complementary understandings of the term "charset". Harald's is the most comprehensive and interesting one. It does not conflict with any of them and enrich them. I stick to it. 2. there are two approaches: - one is by ISO 10646 which is to get a global charset. - one is by the IANA registry which is the specialised charset. 3. there is a problem which is three consistency between these two approaches which is a flexible partition between these two approaches. I think Harald's definition permits to address this: a global charset results from the application of an encoding rule on every characters (ISO 10646) and a specialised charset results from the application of additional rules. 4. new charsets can be registered through an RFC or through a regstration procedure. - I see no reason why ISO 15924 as well as many other standard character sets could not be registered as UTF-8-15924_CODES/NRs and permit a complete support of the language. - I agree with Harald Alvestrand, M. Crispin, etc. in not overloading the langtag with information which does not really belong here. Actually I do not seen the reason why layout information would be included there. I comment your other remarks in the text: On 09:01 08/05/2005, Martin Duerst said: > >I first considered that W3C was right in having identified its need of > "scripts" support, however the idea of making them dependent from > languages seemed to be strange. > >It wasn't W3C which identified the need for scripts support. Some >people that you associate with W3C may have been involved, but >that doesn't mean that it was W3C. Addison documented the problem he wants to solve as an XML problem and the non discussed charter only quote HTML, XML and CLDR. I see no other documented need. I only see objections from other applications which are responded as "not our concern". I also see a pressure by Unicode for reasons I do not understand since they do not want to document CLDR needs. >The need for scripts in language tags (or separate) is an old issue. Yes. I first had to address it in 1982 for French Videotex, Katakana, etc. This does not make it a good point. It was ruled out by RFC 2130. >I remember a discussion with Michael Everson, who proposed an HTTP >header Accept-Script (or some such) years ago on the ietf-languages >list. His privilege. I understand it since he is the Unicode proponent for ISO 15924. This WG Charter proposed to use ISO 15924, but I do not think it is to be used in the langtag (this calls to replace RFC 227 and 2130, to discuss several other RFC for compatibility etc.). This should be discussed on ietf-charsets@iana.org. >The discussion at that time came to the conclusion that this >(having script orthogonal to language) might be a nice idea in principle, >but that in practice, the connection is extremely strong, and in most >cases, script info isn't necessary because it can easily be implied. There is an obvious confusion between charset and script here (at least if you consider that "script" is attached to a language as you seem to imply it, ISO 15924 does it, and this WG's debate lead to think others do). There may be many different charsets to support a language. In French only I can quote obvious ones: - legal - non-accented upper cases - upper cases only - AFNIC > >But the more I think of them, the more I have difficulty understanding > what the "script" notion, introduced in the Draft, brings in addition to > the charsets: it belongs to it. > >There is some connection, in that many "charset"s only encode one script, >or to be more precise, one script + basic ASCII + some symbols. But there >are some important "charset"s (in particular UTF-8 and UTF-16,...) where >this doesn't apply. Also, there are many other encodings that contain >multiple scripts (e.g. you can write Greek with iso-2022-jp, and so on). We may discuss that at length... the important is that you acknowledge "there is some connection" while langage and scripts are "orthogonal", this can only lead to conflicts ... > >The more I see sources of conflicts if this is not respected. > >If you see such conflicts, could you give an actual example? Latin-1 does not fully support French. There is a conflict between charset: 8859-1 and langtag: fr-Latn-FR. > >The more I see that the script is one of the rules which shares in the > definition of the charcter set, and the more I fail to see where the W3C > has a problem (except may be in confusing charset with encording scheme > only, however http://www.w3.org/TR/REC-html40/charset.html starts with a > clear "character set" part where it specifically quote Latin and Cyrilic)). > > > >I come back to a normal process of access to a page/document. > > > >1. to be able to read it I need to know the charset. This is the first > information. It tells me the rules for mapping from a sequence of bytes > to a sequence of characters (character encoding scheme: ex. UTF-8 > >Yes indeed. Having the correct 'charset' is extremely important. > > >and combination of coded characters, ex: ISO 15924). Ex: UTF-8-Latin > >No. If you know it's UTF-8, you can just look inside the document, and >check what script(s!) are used. Orthogonal to the discussion. You can do that with any RFC 3066 langtag. This is what is supposed to be corrected by the Draft. > >2. then when I read I need to understand. I have the language. And > possible region. As per RFC 3066 existing scheme and not calling for a > modification of the existing libraries. > > > >3. the interest is that this is compatible with IDN tables (and permits > to address the high level IDN homograph problem, since charsets are > documented everywhere). I also note that RFC 2277 and 3066 seem to > address the locales need (however CLDR may have some proprietary special > needs, authors have not documented?) > > > >I therefore tend to think the "script" information is to be located in > the charset tag. > >The Web, email, and a lot of other things have worked extremely well without >script information in charset tags, and I don't see why this would not >continue. True. This is why the question is "what is the script information"? A real novelty, or an additional rule to better tune existing parameters. And which parameter. If a novelty, why to not to discuss the langtag (however opposed by RFC 2031). If an additional rule to refine an existing info, better to put it with that info, rather then in an orthogonal container. > >I suppose they are able to understand UTF-8.latin as UTF-8 and that > legacy is transparent? > >Definitely not. For language tags, quite a few applications understand >subtag-based prefixes, as the specs have been defined with subtags in mind >from the start. For charsets, they do not. Charsets do not have and >never had subtags. OK. Then it is UTF-8-Latn and 100 charset registrations, either through ietf-charsets@iana.org or through an appendix to RFC 2277 bis. jfc _______________________________________________ Ltru mailing list Ltru@lists.ietf.org https://www1.ietf.org/mailman/listinfo/ltru