Message-Id: <6.2.1.2.2.20050509095439.04a02eb0@mail.jefsey.com>
Date: Mon, 09 May 2005 12:05:03 +0200
To: Martin Duerst <duerst@it.aoyama.ac.jp>
From: "JFC (Jefsey) Morfin" <jefsey@jefsey.com>
Subject: Re: [Ltru] RFC 2277 - considerations
In-Reply-To: <6.0.0.20.2.20050508154021.06275280@itmail.it.aoyama.ac.jp>
References: <6.2.1.2.2.20050508032918.039af710@mail.jefsey.com>
	<6.0.0.20.2.20050508154021.06275280@itmail.it.aoyama.ac.jp>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; format=flowed
Cc: ltru@ietf.org
Precedence: list
Sender: ltru-bounces@lists.ietf.org
Errors-To: ltru-bounces@lists.ietf.org

Dear Martin,
what is confusing with IETF vs. other standardisation organisations is that 
they deal with networks (what means complexity and changes) and do not 
produce structured updated documents. So, the only way to deal with its 
deliverables is to consider them as the center of their own world, with 
their own logic and to confront their various expressions. Not to dwell on 
outside considerations, internal arfceology being enough.

For thee good reasons:
- this network centric approach has done well for the last 30 years and 
produced a consistent system even if sometimes complex.
- the user approach is user-centric. The user has his machine and his ADSL 
plug, and looks in the manual expecting the authors have resolved all the 
issues/confusion/conflicts.
- this is most probably the secret for scalability.

Le reveiw this. There is a concept named "charset", with five 
understandings plus one comment (may be more ??)
:
1. yours: "The term 'charset' got to where it is because when the MIME 
specifications was created, the people creating it were thinking in terms 
of 7-bit and 8-bit encodings, and in addition just (most probably 
unconsciously) shortened "coded character set" to "character set". So the 
term is there mostly by accident."

- unconscious for a particular encoding scheme.

2. one which seems to be the one of the W3C, Unicode, ISO members of this 
WG. I suppose that Misha will not oppose to my quote of him 
http://czyborra.com/charsets/iso8859.html#ISO-8859-3 . This quote is 
independent to this debate, is dated 1998 and covers IMHO the whole issue 
(charset/ISO 10646).

"The ISO 8859 charsets are not even remotely as complete as the truly great 
Unicode but they have been around and usable for quite a while (first 
registered Internet charsets for use with MIME) and have already offered a 
major improvement over the plain 7bit US-ASCII. Unicode (ISO 10646) will 
make this whole chaos of mutually incompatible charsets superfluous because 
it unifies a superset of all established charsets and is out to cover all 
the world's languages. But I still haven't seen any software to display all 
of Unicode on my Unix screen. We're working on it. "

- limited lists of characters, an improvement over US ASCII, but smaller 
than ISO 19646.

3. RFC 2879 says: The term "charset" (referred to as a "character set" in 
previous versions of this document) is used here to refer to a method of 
converting a sequence of octets into a sequence of characters.  This 
conversion may also optionally produce additional control information such 
as directionality indicators. 
http://www.iana.org/assignments/character-sets (last updated on January 
28th, 2005)

- N. Freed (Jon Postel) is consciously defining it as a generally encoding 
system and implicitly acknowledge the multiplicity of lists in defining a 
IANA registry for them.

4. RFC 2277 which says: "This document uses the term "charset" to mean a 
set of rules for mapping from a sequence of octets to a sequence of 
characters, such as the combination of a coded character set and a 
character encoding scheme; this is also what is used as an identifier in 
MIME "charset=" parameters, and registered in the IANA charset registry 
[REG].  (Note that this is NOT a term used by other standards bodies, such 
as ISO).".

- Harald Alvestrand gives a very clear comprehensive definition: "a set of 
rules" which permits to build any of them, with examples: your first 
example (MIME) and the IANA registry.

5. The IANA definition: "These are the official names for character sets 
that may be used in the Internet and may be referred to in Internet 
documentation.  These names are expressed in ANSI_X3.4-1968 which is 
commonly called US-ASCII or simply ASCII.  The character set most commonly 
use in the Internet and used especially in protocol standards is US-ASCII, 
this is strongly encouraged.  The use of the name US-ASCII is also 
encouraged.// The character set names may be up to 40 characters taken from 
the printable characters of US-ASCII.  However, no distinction is made 
between use of upper and lower case letters."

- follows a list of 250 charsets.

Comment is by RFC 2130 (a Californian Workshop with Harald, Crispin, +) : 
"The term 'language tag' should be reserved for the short identifier of RFC 
1766 [RFC-1766] that only serves to identify the language. While there may 
be other text attributes intimately associated with the language of the 
document, such as desired font or text direction, these should be specified 
with other identifiers rather than overloading the language tag." It then 
confirms the duality of:
"3.4.2:  Default Coded Character Set: The default Coded Character Set is 
the repertoire of ISO-10646.
"3.4.3:   Default Character Encoding Scheme: "For text-oriented protocols, 
new protocols should use UTF-8, and protocols that have a backwards 
compatibility requirement should use the default of the existing protocol, 
e.g. US-ASCII for mail, and ISO-8859-1 for HTTP"


Further to this I understand the global picture as follows.

1. there are several partial complementary understandings of the term 
"charset". Harald's is the most comprehensive and interesting one. It does 
not conflict with any of them and enrich them. I stick to it.

2. there are two approaches:
- one is by ISO 10646 which is to get a global charset.
- one is by the IANA registry which is the specialised charset.

3. there is a problem which is three consistency between these two 
approaches which is a flexible partition between these two approaches. I 
think Harald's definition permits to address this: a global charset results 
from the application of an encoding rule on every characters (ISO 10646) 
and a specialised charset results from the application of additional rules.

4. new charsets can be registered through an RFC or through a regstration 
procedure.
     - I see no reason why ISO 15924 as well as many other standard 
character sets could not be registered as UTF-8-15924_CODES/NRs and permit 
a complete support of the language.
     - I agree with Harald Alvestrand, M. Crispin, etc. in not overloading 
the langtag with information which does not really belong here. Actually I 
do not seen the reason why layout information would be included there.

I comment your other remarks in the text:

On 09:01 08/05/2005, Martin Duerst said:
> >I first considered that W3C was right in having identified its need of 
> "scripts" support, however the idea of making them dependent from 
> languages seemed to be strange.
>
>It wasn't W3C which identified the need for scripts support. Some
>people that you associate with W3C may have been involved, but
>that doesn't mean that it was W3C.

Addison documented the problem he wants to solve as an XML problem and the 
non discussed charter only quote HTML, XML and CLDR. I see no other 
documented need. I only see objections from other applications which are 
responded as "not our concern". I also see a pressure by Unicode for 
reasons I do not understand since they do not want to document CLDR needs.

>The need for scripts in language tags (or separate) is an old issue.

Yes. I first had to address it in 1982 for French Videotex, Katakana, etc. 
This does not make it a good point. It was ruled out by RFC 2130.

>I remember a discussion with Michael Everson, who proposed an HTTP
>header Accept-Script (or some such) years ago on the ietf-languages
>list.

His privilege. I understand it since he is the Unicode proponent for ISO 
15924. This WG Charter proposed to use ISO 15924, but I do not think it is 
to be used in the langtag (this calls to replace RFC 227 and 2130, to 
discuss several other RFC for compatibility etc.). This should be discussed 
on ietf-charsets@iana.org.

>The discussion at that time came to the conclusion that this
>(having script orthogonal to language) might be a nice idea in principle,
>but that in practice, the connection is extremely strong, and in most
>cases, script info isn't necessary because it can easily be implied.

There is an obvious confusion between charset and script here (at least if 
you consider that "script" is attached to a language as you seem to imply 
it, ISO 15924 does it, and this WG's debate lead to think others do).

There may be many different charsets to support a language. In French only 
I can quote obvious ones:
- legal
- non-accented upper cases
- upper cases only
- AFNIC

> >But the more I think of them,  the more I have difficulty understanding 
> what the "script" notion, introduced in the Draft, brings in addition to 
> the charsets: it belongs to it.
>
>There is some connection, in that many "charset"s only encode one script, 
>or to be more precise, one script + basic ASCII + some symbols. But there 
>are some important "charset"s (in particular UTF-8 and UTF-16,...) where 
>this doesn't apply. Also, there are many other encodings that contain 
>multiple scripts (e.g. you can write Greek with iso-2022-jp, and so on).

We may discuss that at length... the important is that you acknowledge 
"there is some connection" while langage and scripts are "orthogonal", this 
can only lead to conflicts ...

> >The more I see sources of conflicts if this is not respected.
>
>If you see such conflicts, could you give an actual example?

Latin-1 does not fully support French. There is a conflict between charset: 
8859-1 and langtag: fr-Latn-FR.

> >The more I see that the script is one of the rules which shares in the 
> definition of the charcter set, and the more I fail to see where the W3C 
> has a problem (except may be in confusing charset with encording scheme 
> only, however http://www.w3.org/TR/REC-html40/charset.html starts with a 
> clear "character set" part where it specifically quote Latin and Cyrilic)).
> >
> >I come back to a normal process of access to a page/document.
> >
> >1. to be able to read it I need to know the charset. This is the first 
> information. It tells me the rules for mapping from a sequence of bytes 
> to a sequence of characters (character encoding scheme: ex. UTF-8
>
>Yes indeed. Having the correct 'charset' is extremely important.
>
> >and combination of coded characters, ex: ISO 15924). Ex: UTF-8-Latin
>
>No. If you know it's UTF-8, you can just look inside the document, and 
>check what script(s!) are used.

Orthogonal to the discussion. You can do that with any RFC 3066 langtag. 
This is what is supposed to be corrected by the Draft.

> >2. then when I read I need to understand. I have the language. And 
> possible region. As per RFC 3066 existing scheme and not calling for a 
> modification of the existing libraries.
> >
> >3. the interest is that this is compatible with IDN tables (and permits 
> to address the high level IDN homograph problem, since charsets are 
> documented everywhere). I also note that RFC 2277 and 3066 seem to 
> address the locales need (however CLDR may have some proprietary special 
> needs, authors have not documented?)
> >
> >I therefore tend to think the "script" information is to be located in 
> the charset tag.
>
>The Web, email, and a lot of other things have worked extremely well without
>script information in charset tags, and I don't see why this would not
>continue.

True. This is why the question is "what is the script information"? A real 
novelty, or an additional rule to better tune existing parameters. And 
which parameter. If a novelty, why to not to discuss the langtag (however 
opposed by RFC 2031). If an additional rule to refine an existing info, 
better to put it with that info, rather then in an orthogonal container.

> >I suppose they are able to understand UTF-8.latin as UTF-8 and that 
> legacy is transparent?
>
>Definitely not. For language tags, quite a few applications understand
>subtag-based prefixes, as the specs have been defined with subtags in mind
>from the start. For charsets, they do not. Charsets do not have and
>never had subtags.

OK. Then it is UTF-8-Latn and 100 charset registrations, either through 
ietf-charsets@iana.org or through an appendix to RFC 2277 bis.

jfc  


_______________________________________________
Ltru mailing list
Ltru@lists.ietf.org
https://www1.ietf.org/mailman/listinfo/ltru