Return-Path: Received: from murder ([unix socket]) by eikenes.alvestrand.no (Cyrus v2.2.8-Mandrake-RPM-2.2.8-4.2.101mdk) with LMTPA; Tue, 10 May 2005 00:55:10 +0200 X-Sieve: CMU Sieve 2.2 Received: from localhost (localhost.localdomain [127.0.0.1]) by eikenes.alvestrand.no (Postfix) with ESMTP id 74BF161B53 for ; Tue, 10 May 2005 00:55:10 +0200 (CEST) Received: from eikenes.alvestrand.no ([127.0.0.1]) by localhost (eikenes.alvestrand.no [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 15651-04 for ; Tue, 10 May 2005 00:55:06 +0200 (CEST) X-Greylist: domain auto-whitelisted by SQLgrey-1.4.8 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by eikenes.alvestrand.no (Postfix) with ESMTP id 26A2961AF1 for ; Tue, 10 May 2005 00:55:05 +0200 (CEST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1DVH8A-00011F-QP; Mon, 09 May 2005 18:53:42 -0400 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1DVH88-00011A-G9 for ltru@megatron.ietf.org; Mon, 09 May 2005 18:53:41 -0400 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id SAA22279 for ; Mon, 9 May 2005 18:53:37 -0400 (EDT) Received: from montage.altserver.com ([63.247.74.122]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1DVHNK-0000Ik-I5 for ltru@ietf.org; Mon, 09 May 2005 19:09:23 -0400 Received: from lns-p19-1-idf-82-251-88-88.adsl.proxad.net ([82.251.88.88] helo=jfc.afrac.org) by montage.altserver.com with esmtpa (Exim 4.44) id 1DVH7s-0003XR-KS; Mon, 09 May 2005 15:53:25 -0700 Message-Id: <6.2.1.2.2.20050509181241.048ab7f0@mail.jefsey.com> X-Mailer: QUALCOMM Windows Eudora Version 6.2.1.2 Date: Tue, 10 May 2005 00:53:22 +0200 To: ned.freed@mrochek.com, Martin Duerst From: "JFC (Jefsey) Morfin" Subject: Re: [Ltru] RFC 2277 - considerations In-Reply-To: <01LO1QSCZ7S800004T@mauve.mrochek.com> References: <6.2.1.2.2.20050508032918.039af710@mail.jefsey.com> <6.0.0.20.2.20050508154021.06275280@itmail.it.aoyama.ac.jp> <01LO1QSCZ7S800004T@mauve.mrochek.com> Mime-Version: 1.0 X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - montage.altserver.com X-AntiAbuse: Original Domain - ietf.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - jefsey.com X-Scan-Signature: ff67cea9f7df2d77f61a364cea0926e8 Cc: LTRU Working Group X-BeenThere: ltru@lists.ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Language Tag Registry Update working group discussion list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============1574770498==" Sender: ltru-bounces@lists.ietf.org Errors-To: ltru-bounces@lists.ietf.org X-Virus-Scanned: amavisd-new at alvestrand.no --===============1574770498== Content-Type: multipart/alternative; boundary="=====================_62015393==.ALT" --=====================_62015393==.ALT Content-Type: text/plain; charset="us-ascii"; format=flowed Dear Ned, I thank you to confirm the definition of the charset, as including two information: - coded character set: with the repertoire ISO 10646 as a default - character encoding scheme: with UTF-8 as a default. is a stable and well accepted matter. This means that one can build from here. At 16:23 09/05/2005, ned.freed@mrochek.com wrote: >> > But the more I think of them, the more I have difficulty understanding >> > what the "script" notion, introduced in the Draft, brings in addition to >> > the charsets: it belongs to it. > >>There is some connection, in that many "charset"s only encode one script, >>or to be more precise, one script + basic ASCII + some symbols. >>But there are some important "charset"s (in particular UTF-8 and >>UTF-16,...) where this doesn't apply. >>Also, there are many other >>encodings that contain multiple scripts (e.g. you can write >>Greek with iso-2022-jp, and so on). > >And since the trend is (hopefully) towards using all-inclusive charsets like >utf-8, Yeap. Using a default all-inclusive charsets permits define easily specialised charsets. Scripts.txt (if correct?) documents the ISO 15924 based charsets. >the ability to determine script from the charset label alone, which as >you say never worked all that well, is going to disappear over time. OTOH, as >things coalesce around Unicode (irrespective of encoding), the need to know >lots of charsets in order to dig script information out of the actual content >is going to decrease. Let not confuse standardisation with uniformisation. The strength of ISO 10646 is to permit an unlimited number of charsets without problem. ISO 15924 gives the names of 102 of them. But there are much more. >> > I therefore tend to think the "script" information is to be located in >> > the charset tag. > >>The Web, email, and a lot of other things have worked extremely well without >>script information in charset tags, and I don't see why this would not >>continue. > >Absolutely. Charsets were carefully defined to provide the information >necessary to display or process a given object. It is very intentionally >NOT defined to be a label describing the specific content of a particular >document. Correct. Languages and contents are orthogonal to layout. We have a good example with IDN Tables and the anti-phishing lists discussed on the IDN list. They define charsets. The ietf-languages@alvestrand.no list and RFC 3066 bis, refuse to discuss IDNs however the DNS is affected by the Draft. This is because scripts do not belong to RFC 3066 but to RFC 2277. >> > I suppose they are able to understand UTF-8.latin as UTF-8 and that >> > legacy is transparent? > >>Definitely not. For language tags, quite a few applications understand >>subtag-based prefixes, as the specs have been defined with subtags in mind >>from the start. > >And those which do not at least ignore subtags are therefore broken. We will then have to use registered charsets UTF-8-XXX. It calls for a list of registered charsets. But subtags will be necessary anyway for 3rd level IDNs, to filter out the code points used for phishing. They will be used by some applications. >For charsets, they do not. Charsets do not have and >>never had subtags. > >Absolutely. And moreover, the rules for what constitutes a "charset" are >intentionally pretty narrow, so as to prevent creep of stuff into >charset-space >that properly belongs elsewhere. Right. This is what is good: this is clean. No confusion as trying to locate the information in the langtag. >(Sadly, there was a period during which the rules weren't being properly >applied to the charset registration process, so there is some amount of >cruft in the registry.) The Unicode scripts may not fully match ISO 15924, I quote Mark Davis (http://www.unicode.org/reports/tr24/tr24-7.html) : "In some cases the match between these script values and the ISO 15924 codes is not precise, because the goals are somewhat different. ISO 15924 is aimed primarily at the bibliographic identification of scripts; consequently it occasionally identifies varieties of scripts that may be useful for book cataloging, but which are not considered distinct scripts in the Unicode Standard. For example, ISO 15924 has separate script codes for the Fraktur and Gaelic varieties of the Latin script. Where there are no corresponding ISO 15924 codes, the private use ones starting with Q are used." However, Mark continues "Such values are likely to change in the future. In such a case, the Q-names will be retained as aliases in the [PropValue] for backwards compatibility.". So, we may expect that at the end of the day, Unicode script.txt will exactly define the ISO 15924 described UTF-8 charsets ("partitions" of the UCS in Unicode wording). I therefore have some difficulty to understand why we should discuss non content oriented scripts info within content oriented langtags (while they could not support the content oriented referent/style info?) while users are fully able to understand, chose and document the 102 charsets (i.e. a 102x102 multilingual matrix): UTF-8-ARAB UTF-8-ARMN UTF-8-BALI UTF-8-BATK UTF-8-BENG UTF-8-BLIS UTF-8-BOPO UTF-8-BRAH UTF-8-BRAI UTF-8-BUGI UTF-8-BUHD UTF-8-CANS UTF-8-CHAM UTF-8-CHER UTF-8-CIRT UTF-8-COPT UTF-8-CPRT UTF-8-CYRL UTF-8-CYRS UTF-8-DEVA UTF-8-DSRT UTF-8-EGYD UTF-8-EGYH UTF-8-EGYP UTF-8-ETHI UTF-8-GEOK UTF-8-GEOR UTF-8-GLAG UTF-8-GOTH UTF-8-GREK UTF-8-GUJR UTF-8-GURU UTF-8-HANG UTF-8-HANI UTF-8-HANO UTF-8-HANS UTF-8-HANT UTF-8-HEBR UTF-8-HIRA UTF-8-HMNG UTF-8-HRKT UTF-8-HUNG UTF-8-INDS UTF-8-ITAL UTF-8-JAVA UTF-8-KALI UTF-8-KANA UTF-8-KHAR UTF-8-KHMR UTF-8-KNDA UTF-8-LAOO UTF-8-LATF UTF-8-LATG UTF-8-LATN UTF-8-LEPC UTF-8-LIMB UTF-8-LINA UTF-8-LINB UTF-8-MAND UTF-8-MAYA UTF-8-MERO UTF-8-MLYN UTF-8-MONG UTF-8-MYMR UTF-8-NKOO UTF-8-OGAM UTF-8-ORKH UTF-8-ORYA UTF-8-OSMA UTF-8-PERM UTF-8-PHAG UTF-8-PHNX UTF-8-PLRD UTF-8-QAAA UTF-8-QABX UTF-8-RORO UTF-8-RUNR UTF-8-SARA UTF-8-SHAW UTF-8-SINH UTF-8-SYLO UTF-8-SYRC UTF-8-SYRE UTF-8-SYRJ UTF-8-SYRN UTF-8-TAGB UTF-8-TALE UTF-8-TALU UTF-8-TAML UTF-8-TELU UTF-8-TENG UTF-8-TFNG UTF-8-TGLG UTF-8-THAA UTF-8-THAI UTF-8-TIBT UTF-8-UGAR UTF-8-VAII UTF-8-VISP UTF-8-XPEO UTF-8-XSUX UTF-8-YIII which have no problem of default, implied, etc. ? jfc --=====================_62015393==.ALT Content-Type: text/html; charset="us-ascii" Dear Ned,
I thank you to confirm the definition of the charset, as including two information:
- coded character set: with the repertoire ISO 10646 as a default
- character encoding scheme: with UTF-8 as a default.
is a stable and well accepted matter.

This means that one can build from here.

At 16:23 09/05/2005, ned.freed@mrochek.com wrote:
> But the more I think of them,  the more I have difficulty understanding
> what the "script" notion, introduced in the Draft, brings in addition to
> the charsets: it belongs to it.

There is some connection, in that many "charset"s only encode one script,
or to be more precise, one script + basic ASCII + some symbols.
But there are some important "charset"s (in particular UTF-8 and
UTF-16,...) where this doesn't apply.
Also, there are many other
encodings that contain multiple scripts (e.g. you can write
Greek with iso-2022-jp, and so on).

And since the trend is (hopefully) towards using all-inclusive charsets like
utf-8,

Yeap. Using a default all-inclusive charsets permits define easily specialised charsets. Scripts.txt (if correct?) documents the ISO 15924 based charsets.

the ability to determine script from the charset label alone, which as
you say never worked all that well, is going to disappear over time. OTOH, as
things coalesce around Unicode (irrespective of encoding), the need to know
lots of charsets in order to dig script information out of the actual content
is going to decrease.

Let not confuse standardisation with uniformisation. The strength of ISO 10646 is to permit an unlimited number of charsets without problem. ISO 15924 gives the names of 102 of them. But there are much more.

> I therefore tend to think the "script" information is to be located in
> the charset tag.

The Web, email, and a lot of other things have worked extremely well without
script information in charset tags, and I don't see why this would not
continue.

Absolutely. Charsets were carefully defined to provide the information necessary to display or process a given object. It is very intentionally NOT defined to be a label describing the specific content of a particular document.

Correct. Languages and contents are orthogonal to layout.  We have a good example with IDN Tables and the anti-phishing lists discussed on the IDN list. They define charsets. The ietf-languages@alvestrand.no list and RFC 3066 bis, refuse to discuss IDNs however the DNS is affected by  the Draft.

This is because scripts do not belong to RFC 3066 but to RFC 2277.

> I suppose they are able to understand UTF-8.latin as UTF-8 and that
> legacy is transparent?

Definitely not. For language tags, quite a few applications understand
subtag-based prefixes, as the specs have been defined with subtags in mind
from the start.

And those which do not at least ignore subtags are therefore broken.

We will then have to use registered charsets UTF-8-XXX. It calls for a list of registered charsets. But subtags will be necessary anyway for 3rd level IDNs, to filter out the code points used for phishing. They will be used by some applications.

For charsets, they do not. Charsets do not have and
never had subtags.

Absolutely. And moreover, the rules for what constitutes a "charset" are
intentionally pretty narrow, so as to prevent creep of stuff into charset-space
that properly belongs elsewhere.

Right. This is what is good: this is clean. No confusion as trying to locate the information in the langtag.

(Sadly, there was a period during which the rules weren't being properly applied to the charset registration process, so there is some amount of cruft in the registry.)

The Unicode scripts may not fully match ISO 15924, I quote Mark Davis ( http://www.unicode.org/reports/tr24/tr24-7.html) : "In some cases the match between these script values and the ISO 15924 codes is not precise, because the goals are somewhat different. ISO 15924 is aimed primarily at the bibliographic identification of scripts; consequently it occasionally identifies varieties of scripts that may be useful for book cataloging, but which are not considered distinct scripts in the Unicode Standard. For example, ISO 15924 has separate script codes for the Fraktur and Gaelic varieties of the Latin script. Where there are no corresponding ISO 15924 codes, the private use ones starting with Q are used."

However, Mark continues "Such values are likely to change in the future. In such a case, the Q-names will be retained as aliases in the [PropValue] for backwards compatibility.".

So, we may expect that at the end of the day, Unicode script.txt will exactly define the ISO 15924 described UTF-8 charsets ("partitions" of the UCS in Unicode wording).

I therefore have some difficulty to understand why we should discuss non content oriented scripts info within content oriented langtags (while they could not support the content oriented referent/style info?) while users are fully able to understand, chose and document the 102 charsets (i.e. a 102x102 multilingual matrix):

UTF-8-ARAB
UTF-8-ARMN
UTF-8-BALI
UTF-8-BATK
UTF-8-BENG
UTF-8-BLIS
UTF-8-BOPO
UTF-8-BRAH
UTF-8-BRAI
UTF-8-BUGI
UTF-8-BUHD
UTF-8-CANS
UTF-8-CHAM
UTF-8-CHER
UTF-8-CIRT
UTF-8-COPT
UTF-8-CPRT
UTF-8-CYRL
UTF-8-CYRS
UTF-8-DEVA
UTF-8-DSRT
UTF-8-EGYD
UTF-8-EGYH
UTF-8-EGYP
UTF-8-ETHI
UTF-8-GEOK
UTF-8-GEOR
UTF-8-GLAG
UTF-8-GOTH
UTF-8-GREK
UTF-8-GUJR
UTF-8-GURU
UTF-8-HANG
UTF-8-HANI
UTF-8-HANO
UTF-8-HANS
UTF-8-HANT
UTF-8-HEBR
UTF-8-HIRA
UTF-8-HMNG
UTF-8-HRKT
UTF-8-HUNG
UTF-8-INDS
UTF-8-ITAL
UTF-8-JAVA
UTF-8-KALI
UTF-8-KANA
UTF-8-KHAR
UTF-8-KHMR
UTF-8-KNDA
UTF-8-LAOO
UTF-8-LATF
UTF-8-LATG
UTF-8-LATN
UTF-8-LEPC
UTF-8-LIMB
UTF-8-LINA
UTF-8-LINB
UTF-8-MAND
UTF-8-MAYA
UTF-8-MERO
UTF-8-MLYN
UTF-8-MONG
UTF-8-MYMR
UTF-8-NKOO
UTF-8-OGAM
UTF-8-ORKH
UTF-8-ORYA
UTF-8-OSMA
UTF-8-PERM
UTF-8-PHAG
UTF-8-PHNX
UTF-8-PLRD
UTF-8-QAAA
UTF-8-QABX
UTF-8-RORO
UTF-8-RUNR
UTF-8-SARA
UTF-8-SHAW
UTF-8-SINH
UTF-8-SYLO
UTF-8-SYRC
UTF-8-SYRE
UTF-8-SYRJ
UTF-8-SYRN
UTF-8-TAGB
UTF-8-TALE
UTF-8-TALU
UTF-8-TAML
UTF-8-TELU
UTF-8-TENG
UTF-8-TFNG
UTF-8-TGLG
UTF-8-THAA
UTF-8-THAI
UTF-8-TIBT
UTF-8-UGAR
UTF-8-VAII
UTF-8-VISP
UTF-8-XPEO
UTF-8-XSUX
UTF-8-YIII

which have no problem of default, implied, etc. ?

jfc --=====================_62015393==.ALT-- --===============1574770498== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Ltru mailing list Ltru@lists.ietf.org https://www1.ietf.org/mailman/listinfo/ltru --===============1574770498==--