Message-Id: <6.2.1.2.2.20050509181241.048ab7f0@mail.jefsey.com>
Date: Tue, 10 May 2005 00:53:22 +0200
To: ned.freed@mrochek.com, Martin Duerst <duerst@it.aoyama.ac.jp>
From: "JFC (Jefsey) Morfin" <jefsey@jefsey.com>
Subject: Re: [Ltru] RFC 2277 - considerations
In-Reply-To: <01LO1QSCZ7S800004T@mauve.mrochek.com>
References: <6.2.1.2.2.20050508032918.039af710@mail.jefsey.com>
	<6.0.0.20.2.20050508154021.06275280@itmail.it.aoyama.ac.jp>
	<01LO1QSCZ7S800004T@mauve.mrochek.com>
Mime-Version: 1.0
Cc: LTRU Working Group <ltru@ietf.org>
Precedence: list
Content-Type: multipart/mixed; boundary="===============1574770498=="
Sender: ltru-bounces@lists.ietf.org
Errors-To: ltru-bounces@lists.ietf.org

--===============1574770498==
Content-Type: multipart/alternative;
	boundary="=====================_62015393==.ALT"

--=====================_62015393==.ALT
Content-Type: text/plain; charset="us-ascii"; format=flowed

Dear Ned,
I thank you to confirm the definition of the charset, as including two 
information:
- coded character set: with the repertoire ISO 10646 as a default
- character encoding scheme: with UTF-8 as a default.
is a stable and well accepted matter.

This means that one can build from here.

At 16:23 09/05/2005, ned.freed@mrochek.com wrote:
>> > But the more I think of them,  the more I have difficulty understanding
>> > what the "script" notion, introduced in the Draft, brings in addition to
>> > the charsets: it belongs to it.
>
>>There is some connection, in that many "charset"s only encode one script,
>>or to be more precise, one script + basic ASCII + some symbols.
>>But there are some important "charset"s (in particular UTF-8 and
>>UTF-16,...) where this doesn't apply.
>>Also, there are many other
>>encodings that contain multiple scripts (e.g. you can write
>>Greek with iso-2022-jp, and so on).
>
>And since the trend is (hopefully) towards using all-inclusive charsets like
>utf-8,

Yeap. Using a default all-inclusive charsets permits define easily 
specialised charsets. Scripts.txt (if correct?) documents the ISO 15924 
based charsets.

>the ability to determine script from the charset label alone, which as
>you say never worked all that well, is going to disappear over time. OTOH, as
>things coalesce around Unicode (irrespective of encoding), the need to know
>lots of charsets in order to dig script information out of the actual content
>is going to decrease.

Let not confuse standardisation with uniformisation. The strength of ISO 
10646 is to permit an unlimited number of charsets without problem. ISO 
15924 gives the names of 102 of them. But there are much more.

>> > I therefore tend to think the "script" information is to be located in
>> > the charset tag.
>
>>The Web, email, and a lot of other things have worked extremely well without
>>script information in charset tags, and I don't see why this would not
>>continue.
>
>Absolutely. Charsets were carefully defined to provide the information 
>necessary to display or process a given object. It is very intentionally 
>NOT defined to be a label describing the specific content of a particular 
>document.

Correct. Languages and contents are orthogonal to layout.  We have a good 
example with IDN Tables and the anti-phishing lists discussed on the IDN 
list. They define charsets. The ietf-languages@alvestrand.no list and RFC 
3066 bis, refuse to discuss IDNs however the DNS is affected by  the Draft.

This is because scripts do not belong to RFC 3066 but to RFC 2277.

>> > I suppose they are able to understand UTF-8.latin as UTF-8 and that
>> > legacy is transparent?
>
>>Definitely not. For language tags, quite a few applications understand
>>subtag-based prefixes, as the specs have been defined with subtags in mind
>>from the start.
>
>And those which do not at least ignore subtags are therefore broken.

We will then have to use registered charsets UTF-8-XXX. It calls for a list 
of registered charsets. But subtags will be necessary anyway for 3rd level 
IDNs, to filter out the code points used for phishing. They will be used by 
some applications.

>For charsets, they do not. Charsets do not have and
>>never had subtags.
>
>Absolutely. And moreover, the rules for what constitutes a "charset" are
>intentionally pretty narrow, so as to prevent creep of stuff into 
>charset-space
>that properly belongs elsewhere.

Right. This is what is good: this is clean. No confusion as trying to 
locate the information in the langtag.

>(Sadly, there was a period during which the rules weren't being properly 
>applied to the charset registration process, so there is some amount of 
>cruft in the registry.)

The Unicode scripts may not fully match ISO 15924, I quote Mark Davis 
(http://www.unicode.org/reports/tr24/tr24-7.html) : "In some cases the 
match between these script values and the ISO 15924 codes is not precise, 
because the goals are somewhat different. ISO 15924 is aimed primarily at 
the bibliographic identification of scripts; consequently it occasionally 
identifies varieties of scripts that may be useful for book cataloging, but 
which are not considered distinct scripts in the Unicode Standard. For 
example, ISO 15924 has separate script codes for the Fraktur and Gaelic 
varieties of the Latin script. Where there are no corresponding ISO 15924 
codes, the private use ones starting with Q are used."

However, Mark continues "Such values are likely to change in the future. In 
such a case, the Q-names will be retained as aliases in the [PropValue] for 
backwards compatibility.".

So, we may expect that at the end of the day, Unicode script.txt will 
exactly define the ISO 15924 described UTF-8 charsets ("partitions" of the 
UCS in Unicode wording).

I therefore have some difficulty to understand why we should discuss non 
content oriented scripts info within content oriented langtags (while they 
could not support the content oriented referent/style info?) while users 
are fully able to understand, chose and document the 102 charsets (i.e. a 
102x102 multilingual matrix):

UTF-8-ARAB
UTF-8-ARMN
UTF-8-BALI
UTF-8-BATK
UTF-8-BENG
UTF-8-BLIS
UTF-8-BOPO
UTF-8-BRAH
UTF-8-BRAI
UTF-8-BUGI
UTF-8-BUHD
UTF-8-CANS
UTF-8-CHAM
UTF-8-CHER
UTF-8-CIRT
UTF-8-COPT
UTF-8-CPRT
UTF-8-CYRL
UTF-8-CYRS
UTF-8-DEVA
UTF-8-DSRT
UTF-8-EGYD
UTF-8-EGYH
UTF-8-EGYP
UTF-8-ETHI
UTF-8-GEOK
UTF-8-GEOR
UTF-8-GLAG
UTF-8-GOTH
UTF-8-GREK
UTF-8-GUJR
UTF-8-GURU
UTF-8-HANG
UTF-8-HANI
UTF-8-HANO
UTF-8-HANS
UTF-8-HANT
UTF-8-HEBR
UTF-8-HIRA
UTF-8-HMNG
UTF-8-HRKT
UTF-8-HUNG
UTF-8-INDS
UTF-8-ITAL
UTF-8-JAVA
UTF-8-KALI
UTF-8-KANA
UTF-8-KHAR
UTF-8-KHMR
UTF-8-KNDA
UTF-8-LAOO
UTF-8-LATF
UTF-8-LATG
UTF-8-LATN
UTF-8-LEPC
UTF-8-LIMB
UTF-8-LINA
UTF-8-LINB
UTF-8-MAND
UTF-8-MAYA
UTF-8-MERO
UTF-8-MLYN
UTF-8-MONG
UTF-8-MYMR
UTF-8-NKOO
UTF-8-OGAM
UTF-8-ORKH
UTF-8-ORYA
UTF-8-OSMA
UTF-8-PERM
UTF-8-PHAG
UTF-8-PHNX
UTF-8-PLRD
UTF-8-QAAA
UTF-8-QABX
UTF-8-RORO
UTF-8-RUNR
UTF-8-SARA
UTF-8-SHAW
UTF-8-SINH
UTF-8-SYLO
UTF-8-SYRC
UTF-8-SYRE
UTF-8-SYRJ
UTF-8-SYRN
UTF-8-TAGB
UTF-8-TALE
UTF-8-TALU
UTF-8-TAML
UTF-8-TELU
UTF-8-TENG
UTF-8-TFNG
UTF-8-TGLG
UTF-8-THAA
UTF-8-THAI
UTF-8-TIBT
UTF-8-UGAR
UTF-8-VAII
UTF-8-VISP
UTF-8-XPEO
UTF-8-XSUX
UTF-8-YIII

which have no problem of default, implied, etc. ?

jfc 
--=====================_62015393==.ALT
Content-Type: text/html; charset="us-ascii"

<html>
<body>
Dear Ned,<br>
I thank you to confirm the definition of the charset, as including two
information:<br>
- coded character set: with the repertoire ISO 10646 as a default<br>
- character encoding scheme: with UTF-8 as a default.<br>
is a stable and well accepted matter.<br><br>
This means that one can build from here. <br><br>
At 16:23 09/05/2005, ned.freed@mrochek.com wrote:<br>
<blockquote type=cite class=cite cite="">
<blockquote type=cite class=cite cite="">&gt; But the more I think of
them,&nbsp; the more I have difficulty understanding<br>
&gt; what the &quot;script&quot; notion, introduced in the Draft, brings
in addition to<br>
&gt; the charsets: it belongs to it.</blockquote><br>
<blockquote type=cite class=cite cite="">There is some connection, in
that many &quot;charset&quot;s only encode one script,<br>
or to be more precise, one script + basic ASCII + some symbols.<br>
But there are some important &quot;charset&quot;s (in particular UTF-8
and<br>
UTF-16,...) where this doesn't apply.<br>
Also, there are many other<br>
encodings that contain multiple scripts (e.g. you can write<br>
Greek with iso-2022-jp, and so on).</blockquote><br>
And since the trend is (hopefully) towards using all-inclusive charsets
like<br>
utf-8,</blockquote><br>
Yeap. Using a default all-inclusive charsets permits define easily
specialised charsets. Scripts.txt (if correct?) documents the ISO 15924
based charsets.<br><br>
<blockquote type=cite class=cite cite="">the ability to determine script
from the charset label alone, which as<br>
you say never worked all that well, is going to disappear over time.
OTOH, as<br>
things coalesce around Unicode (irrespective of encoding), the need to
know<br>
lots of charsets in order to dig script information out of the actual
content<br>
is going to decrease.</blockquote><br>
Let not confuse standardisation with uniformisation. The strength of ISO
10646 is to permit an unlimited number of charsets without problem. ISO
15924 gives the names of 102 of them. But there are much more.<br><br>
<blockquote type=cite class=cite cite="">
<blockquote type=cite class=cite cite="">&gt; I therefore tend to think
the &quot;script&quot; information is to be located in<br>
&gt; the charset tag.</blockquote><br>
<blockquote type=cite class=cite cite="">The Web, email, and a lot of
other things have worked extremely well without<br>
script information in charset tags, and I don't see why this would
not<br>
continue.</blockquote><br>
Absolutely. Charsets were carefully defined to provide the information
necessary to display or process a given object. It is very intentionally
NOT defined to be a label describing the specific content of a particular
document.</blockquote><br>
Correct. Languages and contents are orthogonal to layout.&nbsp; We have a
good example with IDN Tables and the anti-phishing lists discussed on the
IDN list. They define charsets. The ietf-languages@alvestrand.no list and
RFC 3066 bis, refuse to discuss IDNs however the DNS is affected by&nbsp;
the Draft. <br><br>
This is because scripts do not belong to RFC 3066 but to RFC 2277.
<br><br>
<blockquote type=cite class=cite cite="">
<blockquote type=cite class=cite cite="">&gt; I suppose they are able to
understand UTF-8.latin as UTF-8 and that<br>
&gt; legacy is transparent?</blockquote><br>
<blockquote type=cite class=cite cite="">Definitely not. For language
tags, quite a few applications understand<br>
subtag-based prefixes, as the specs have been defined with subtags in
mind<br>
from the start.</blockquote><br>
And those which do not at least ignore subtags are therefore
broken.</blockquote><br>
We will then have to use registered charsets UTF-8-XXX. It calls for a
list of registered charsets. But subtags will be necessary anyway for 3rd
level IDNs, to filter out the code points used for phishing. They will be
used by some applications.<br><br>
<blockquote type=cite class=cite cite="">For charsets, they do not.
Charsets do not have and<br>
<blockquote type=cite class=cite cite="">never had
subtags.</blockquote><br>
Absolutely. And moreover, the rules for what constitutes a
&quot;charset&quot; are<br>
intentionally pretty narrow, so as to prevent creep of stuff into
charset-space<br>
that properly belongs elsewhere.</blockquote><br>
Right. This is what is good: this is clean. No confusion as trying to
locate the information in the langtag. <br><br>
<blockquote type=cite class=cite cite="">(Sadly, there was a period
during which the rules weren't being properly applied to the charset
registration process, so there is some amount of cruft in the
registry.)</blockquote><br>
The Unicode scripts may not fully match ISO 15924, I quote Mark Davis
(<a href="http://www.unicode.org/reports/tr24/tr24-7.html" eudora="autourl">
http://www.unicode.org/reports/tr24/tr24-7.html</a>) : &quot;In some
cases the match between these script values and the ISO 15924 codes is
not precise, because the goals are somewhat different. ISO 15924 is aimed
primarily at the bibliographic identification of scripts; consequently it
occasionally identifies varieties of scripts that may be useful for book
cataloging, but which are not considered distinct scripts in the Unicode
Standard. For example, ISO 15924 has separate script codes for the
Fraktur and Gaelic varieties of the Latin script. Where there are no
corresponding ISO 15924 codes, the private use ones starting with Q are
used.&quot;<br><br>
However, Mark continues &quot;Such values are likely to change in the
future. In such a case, the Q-names will be retained as aliases in the
[PropValue] for backwards compatibility.&quot;. <br><br>
So, we may expect that at the end of the day, Unicode script.txt will
exactly define the ISO 15924 described UTF-8 charsets
(&quot;partitions&quot; of the UCS in Unicode wording).<br><br>
I therefore have some difficulty to understand why we should discuss non
content oriented scripts info within content oriented langtags (while
they could not support the content oriented referent/style info?) while
users are fully able to understand, chose and document the 102 charsets
(i.e. a 102x102 multilingual matrix):<br><br>
<font face="Courier New, Courier">UTF-8-ARAB<br>
UTF-8-ARMN<br>
UTF-8-BALI<br>
UTF-8-BATK<br>
UTF-8-BENG<br>
UTF-8-BLIS<br>
UTF-8-BOPO<br>
UTF-8-BRAH<br>
UTF-8-BRAI<br>
UTF-8-BUGI<br>
UTF-8-BUHD<br>
UTF-8-CANS<br>
UTF-8-CHAM<br>
UTF-8-CHER<br>
UTF-8-CIRT<br>
UTF-8-COPT<br>
UTF-8-CPRT<br>
UTF-8-CYRL<br>
UTF-8-CYRS<br>
UTF-8-DEVA<br>
UTF-8-DSRT<br>
UTF-8-EGYD<br>
UTF-8-EGYH<br>
UTF-8-EGYP<br>
UTF-8-ETHI<br>
UTF-8-GEOK<br>
UTF-8-GEOR<br>
UTF-8-GLAG<br>
UTF-8-GOTH<br>
UTF-8-GREK<br>
UTF-8-GUJR<br>
UTF-8-GURU<br>
UTF-8-HANG<br>
UTF-8-HANI<br>
UTF-8-HANO<br>
UTF-8-HANS<br>
UTF-8-HANT<br>
UTF-8-HEBR<br>
UTF-8-HIRA<br>
UTF-8-HMNG<br>
UTF-8-HRKT<br>
UTF-8-HUNG<br>
UTF-8-INDS<br>
UTF-8-ITAL<br>
UTF-8-JAVA<br>
UTF-8-KALI<br>
UTF-8-KANA<br>
UTF-8-KHAR<br>
UTF-8-KHMR<br>
UTF-8-KNDA<br>
UTF-8-LAOO<br>
UTF-8-LATF<br>
UTF-8-LATG<br>
UTF-8-LATN<br>
UTF-8-LEPC<br>
UTF-8-LIMB<br>
UTF-8-LINA<br>
UTF-8-LINB<br>
UTF-8-MAND<br>
UTF-8-MAYA<br>
UTF-8-MERO<br>
UTF-8-MLYN<br>
UTF-8-MONG<br>
UTF-8-MYMR<br>
UTF-8-NKOO<br>
UTF-8-OGAM<br>
UTF-8-ORKH<br>
UTF-8-ORYA<br>
UTF-8-OSMA<br>
UTF-8-PERM<br>
UTF-8-PHAG<br>
UTF-8-PHNX<br>
UTF-8-PLRD<br>
UTF-8-QAAA<br>
UTF-8-QABX<br>
UTF-8-RORO<br>
UTF-8-RUNR<br>
UTF-8-SARA<br>
UTF-8-SHAW<br>
UTF-8-SINH<br>
UTF-8-SYLO<br>
UTF-8-SYRC<br>
UTF-8-SYRE<br>
UTF-8-SYRJ<br>
UTF-8-SYRN<br>
UTF-8-TAGB<br>
UTF-8-TALE<br>
UTF-8-TALU<br>
UTF-8-TAML<br>
UTF-8-TELU<br>
UTF-8-TENG<br>
UTF-8-TFNG<br>
UTF-8-TGLG<br>
UTF-8-THAA<br>
UTF-8-THAI<br>
UTF-8-TIBT<br>
UTF-8-UGAR<br>
UTF-8-VAII<br>
UTF-8-VISP<br>
UTF-8-XPEO<br>
UTF-8-XSUX<br>
UTF-8-YIII<br><br>
</font>which have no problem of default, implied, etc. ?<br><br>
jfc</body>
</html>

--=====================_62015393==.ALT--



--===============1574770498==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
Ltru mailing list
Ltru@lists.ietf.org
https://www1.ietf.org/mailman/listinfo/ltru

--===============1574770498==--