Perhaps a draft RFC on Script codes?

Fri, 06 Dec 2002 13:28:00 GMT

See also immediately previous email from John Clews
to ietf-languages@iana.org

------------------------------------------------------------
           START OF PROPOSED RFC ON SCRIPT CODES
------------------------------------------------------------

Preliminary information

John Clews has suggested that due to unresolved problems in ISO in
developing ISO 15924: Codes for representation of names of scripts
(now delayed in ISO at the FDIS stage), and the need for such codes
by the Internet community, the Internet community will be better
served by developing its own RFC instead.

As ISO DIS 15924 has been stable for some time, it is suggested that
any such RFC should draw on work in ISO DIS 15924.

A copy of DIS (not FDIS) 15924 is available at:
http://www.evertype.com/standards/iso15924/document/dis15924.pdf

The following initial draft is suggested as a first attempt at an RFC
for script codes, and uses some of the RFC conventions such as having
text only in English, and the use of US-ASCII as the character set,
but it is heavily dependent on information in ISO DIS 15924.

Perhaps an even simpler solution would be to add the main table and
minimal text from this, in a revision of RFC 3066.

Michael Everson is the editor of the ISO 15924 project, and also IANA
Language Tag Reviewer, and he is owed an enormous debt of gratitude
by the whole Internet community, for his work on language codes and
script codes, and if an RFC comes to pass, the Internet community
will be likely to benefit further from further comment by him in this
area too.

Michael Everson at Evertype would be an obvious IANA Script Tag
Reviewer, if one were to be appointed under a Script Tags RFC, just
as he is already IANA Language Tag Reviewer, in relation to RFC 3066.

This document also owes an enormous debt of gratitude to the work of
Michael Everson too, as much of the text is from. or is close to,
ISO DIS 15924.

Note: John Clews was formerly Chair of ISO/TC46/SC2 (Conversion of
Written Languages), which was responsible (in ISO committee terms)
for overseeing the earlier stages of the development of earlier
drafts of ISO 15924.

John Clews
6 December 2002

------------------------------------------------------------

Contents

1 Scope
2 Normative references
3 Definitions
4 Script codes: methodology
5 Script codes: table

Annexes (informative)

Annex A: Typology of scripts
Annex B: Scripts under consideration for future addition to this draft RFC
Annex C: References
Annex D: Relationship between this proposed RFC and ISO DIS 15924

------------------------------------------------------------

Code for the representation of names of scripts

1 Scope

This draft RFC provides a code for the presentation of
names of scripts. The codes were devised for use in terminology,
lexicography, bibliography, and linguistics, but they may be used for
any application requiring the expression of scripts in coded form.
This draft RFC also includes guidance on the use of script codes in
some of these applications.

NOTE: In principle, this draft RFC is intended to provide coded
representation for the names of all the scripts of the world. Unique
identification of a script is not always straightforward and obvious;
therefore a number of scripts have been listed in Annex B pending
further study before codes are standardized for them.

------------------------------------------------------------

2 Normative references

ISO 639:1988 Code for the representation of the
names of languages. 

ISO 639-2:1998 Codes for the representation of 
names of languages - Part 2: Alpha-3 codes.

ISO 3166-1:1997 Codes for the representation of 
the names of countries and their subdivisions -
Part 1: Country codes. 

ISO/IEC 10646-1:2000 Information technology -
Universal Multiple-Octet Coded Character Set 
(UCS) -Part 1: Architecture and Basic Multilingual 
Plane. 

------------------------------------------------------------

3 Definitions 

For the purpose of this draft RFC the
following definitions apply: 

3.1 
alias 
A script code which is a collection of two or more 
script codes.

3.2 
code 
Data representation in different forms according to 
a pre-established set of rules. (ISO 639-2:1998)

3.3
country code 
A combination of characters used to designate the 
name of a country.

3.4 
font 
A collection of glyph images having the same 
basic design, e.g. Courier Bold Oblique. (ISO/IEC 
9541-1:1991)

3.5 
glyph 
A recognizable abstract graphic symbol which is 
independent of any specific design. (ISO/IEC 
9541-1:1991)

3.6 
language code 
A combination of characters used to represent [the 
name of] a language or languages.(ISO 639-2:1998)

3.7 
script 
A set of graphic characters used for the written 
form of one or more languages. (ISO/IEC 10646-1)

NOTE 1: A script, as opposed to an arbitrary subset of 
characters, is defined in distinction to other scripts; in 
general, readers of one script may be unable to read the 
glyphs of another script easily, even where there is a 
historic relation between them (see 3.9). 

NOTE 2: In certain cases, this draft RFC provides codes
which are not subsumed under this definition. 
Examples: the codes for aliases and the variant codes. 

3.8 
script code 
A combination of characters used to represent the 
name of a script.

3.9 
script variant 
A particular form of one script that is so unique 
that it can almost be considered itself to be a 
distinct script.

------------------------------------------------------------

4 Script codes: methodology

4.1 Structure of the alphabetic script codes

The alphabetic script codes are created from the 
original script name in the language commonly 
used for it, transliterated or transcribed into Latin 
letters. If a country, where the script concerned 
has the status of a national script, requests a 
certain script code, preference is given to this 
code whenever possible. The four-letter codes 
shall be written with an initial capital Latin letter 
and final small Latin letters (taken from the range 
Aaaa to Zzzz). This serves to help differentiate 
script codes from language codes and country 
codes: so, for example, Mong mon MON or Mong 
mn MN would refer to a book in the Mongolian 
script, in the Mongolian language, originating in 
Mongolia. 

[[NOTE: See 4.7 regarding changes to the codes.]]

[[4.2 Typology of scripts

Information of types of scripts which may be encountered may be found
in Annex A, and in the references in Annex C.]]

4.3 Relation of the script codes to ISO standards

The four-letter codes are derived from ISO 639-2
where the name of a script and the name of a 
language using the script are identical
(example: Gujarati - ISO 639-2: Guj; this draft RFC: Gujr).

In cases where there is no identity, the script name may
have a unique form (examples:
Korean  kor, Hangul Hang;
Punjabi pan, Gurmukhi Guru;
Dhivehi div, Thaana Thaa).

Where possible the the first three letters of the
four-letter code corresponds to the three-letter 
code. Preference is given to the Bibliographical 
codes given in ISO 639-2 in deriving the codes 
specified in this draft RFC.

4.4 Adaptation of the script codes 

[[If adapting this draft RFC to other
scripts (for example, Cyrillic or Greek), codes shall 
be formed according to the principles of this 
RFC]].

4.5 Addition of new script codes 

[[For the purpose of allocating additional script
codes, a Script Tag Reviewer shall be appointed]]

4.6 Application of script codes 
Script codes can be used in the following 
particular instances. 

4.6.1 To indicate generally the scripts in which 
documents are or have been written or recorded. 

Example: 
<META HTTP-EQUIV=3D"Content-Language"
CONTENT=3D"ga, ru">
<META NAME=3D"Content-Script"
CONTENT=3D"Latg, Cyrl">

4.6.2 To indicate the script specified in document 
holding records (order records, bibliographic 
records, etc.). 

Examples: 
In bibliographies:

Ryte pomerantsen: Jewish folk humor. New York:
Schocken Books, 1965. xxvi, 203 p.; 20 cm. In 
Yiddish (Latn) and English. 

Kroatisch-Deutsch und Deutsch-Kroatisch: mit einem 
Anhang der wichtigeren Neubildungen des 
Kroatischen und Deutschen. =D0 Berlin: Axel Juncker, 
1941. vi, 302, 314, 32 p.; 15 cm. In Croatian (Latn) 
and German (Latf) 

4.6.3 To indicate the script used by an application.

Example: 

"Laser Syriac: The fonts supplied in this package are
coded according to collection 85 of Annex A of ISO/IEC 
10646 and provide a complete set of glyphs in all three 
of the styles used to write Syriac (Syre, Syrn, Syrj)."

[[4.7 Changes of script codes
In order to preserve the integrity of data coded 
using the codes set forth in this draft RFC, it is intended
that the four-letter and
numeric codes specified herein shall not be 
changed unless there be extraordinarily 
compelling reasons to do so.]]

------------------------------------------------------------
5 Script codes: table

Alphabetical table of four-letter script codes

[lists in other orders, e.g. by script name, can be generated from this]

------------------------------------------------------------
Code  Name
------------------------------------------------------------

Arab  Arabic 
Aram  Aramaic
Armn  Armenian
Aves  Avestan 
Batk  Batak 
Beng  Bengali
Blis  Blissymbols
Bopo  Bopomofo 
Brah  Brahmi
Brai  Braille
Bugi  Buginese
Buhd  Buhid 
Cans  Unified Canadian Aboriginal Syllabics
Cham  Cham
Cher  Cherokee
Cirt  Cirth
Cpmn  Cypro-Minoan
Cprt  Cypriot syllabary
Cyrl  Cyrillic
Cyrs  Cyrillic (Old Church Slavonic variant)
Deva  Devanagari
Dsrt  Deseret (Mormon)
Egyd  Egyptian demotic
Egyh  Egyptian hieratic
Egyp  Egyptian hieroglyphs
Ethi  Ethiopic (Ge'ez)
Geoa  Georgian (Asomtavruli)
Geon  Georgian (Nuskhuri)
Geor  Georgian (Mkhedruli)
Glag  Glagolitic
Goth  Gothic
Grek  Greek
Gujr  Gujarati
Guru  Gurmukhi
Hang  Hangul
Hani  Han ideographs
Hano  Hanunoo
Hebr  Hebrew
Hira  Hiragana
Hmng  Pahawh Hmong
Hrkt  (alias for Hiragana + Katakana)
Hung  Old Hungarian
Inds  Indus (Harappan)
Ital  Old Italic (Etruscan, Oscan, etc.)
Java  Javanese 
Jpan  (alias for Han + Hiragana + Katakana)
Kali  Kayah Li
Kana  Katakana
Khar  Kharosthi
Khmr  Khmer
Knda  Kannada
Laoo  Lao
Latf  Latin (Fraktur variant)
Latg  Latin (Gaelic variant)
Latn  Latin
Lepc  Lepcha (Rong)
Lina  Linear A
Linb  Linear B
Mand  Mandaean
Maya  Mayan hieroglyphs
Mero  Meroitic
Mlym  Malayalam
Mong  Mongolian
Mymr  Myanmar (Burmese)
Ogam  Ogham 
Orkh  Orkhon
Orya  Oriya
Osma  Osmanya
Palv  Pahlavi
Perm  Old Permic
Phnx  Phoenician
Phst  Phaistos
Plrd  Pollard Phonetic

Qaaa- Reserved for private use
Qtzz

Roro  Rongorongo 
Runr  Runic
Shaw  Shavian (Shaw)
Sinh  Sinhala
Syrc  Syriac
Syrj  Syriac (Western variant)
Syrn  Syriac (Eastern variant)
Syre  Syriac (Estrangelo variant)
Tagb  Tagbanwa
Taml  Tamil
Telu  Telugu
Teng  Tengwar
Tfng  Tifinagh (Berber)
Tglg  Tagalog
Thaa  Thaana
Thai  Thai
Tibt  Tibetan
Vaii  Vai
Visp  Visible Speech
Xpeo  Cuneiform, Old Persian
Xsux  Cuneiform, Sumero-Akkadian
Xuga  Cuneiform, Ugaritic
Yiii  Yi
Zxxx  Code for unwritten languages
Zyyy  Code for undetermined script
Zzzz  Code for uncoded script

------------------------------------------------------------
Annex A: Typology of scripts (informative)
------------------------------------------------------------

The following information is included to indicate types of scripts
that may be encountered.

Any numeric codes here are informative only, and should NOT be used
as code elements in Internet use.

The INFORMATIVE numeric script codes below have been assigned to
provide some measure of mnemonicity to the codes used. The following
ranges have been used:

000-099 Hieroglyphic and cuneiform scripts
100-199 Right-to-left alphabetic scripts
200-299 Left-to-right alphabetic scripts
300-399 Brahmi-derived scripts
400-499 Syllabic scripts
500-599 Ideographic scripts
600-699 Undeciphered scripts
700-799 (unassigned)
800-899 (unassigned)
900-999 Private use, aliases, special codes

NOTE 1: ISO/IEC 10646 uses the character-glyph 
model (defined in ISO/IEC TR 15285:1998) to classify 
the characters used to write different languages. This draft RFC
does not attempt to apply the character-glyph
model, because it is sometimes important to identify 
certain script variants regardless of the encoding a 
given text may employ. For example, a Syriac book may 
be written in one of the three variants of the Syriac script 
(Estrangelo, Eastern, Western). Identification of such
script variants, while outside the scope of ISO/IEC 
10646, is relevant to the content of script codes. For 
example, a user ordering a book through interlibrary 
loan may prefer, or may wish to exclude, the Gaelic 
variant of the Latin script for reasons of ease of legibility 
or familiarity with one of the variants. 

NOTE 2: The classifications here reflect the chief 
attribute of the scripts so classified, and are not 
necessarily comprehensive of the ways in which the 
scripts are used. For example, while Ogham may be 
written from left to right, it is also written vertically from 
bottom to top. Similarly, the Hangul (Hangu=F9 l, Hangeul) 
alphabet is sometimes written in vertical columns, and 
the letters of its alphabet are arranged in syllabic 
clusters. 

NOTE 3: Within each category numeric identifiers 
assigned to scripts have followed a principle of 
chronology, and genetic relationship, though this 
principle cannot be established by any hard and fast 
rule, since scripts may have many different charac-
teristics. Codes have been assigned by spacing them 
out so that scripts encoded in future may be assigned to 
appropriate places in the range. 

------------------------------------------------------------
Annex B: Scripts under consideration for future addition to this draft RFC =
(informative)
------------------------------------------------------------

[[In order to make sure that ISO 15924 is useful for
people needing to provide codes for all the scripts 
of the world, scripts included in ISO 15924 were 
selected with input and feedback from national 
standards organizations and/or qualified experts.]]

[[Some scripts were not included in this edition
because sufficient information was not available 
during the preparation and review stages. ]]

It is intended that codes for the names of these 
scripts will be allocated when available information 
is judged to be sufficient. Such scripts include: 

Ahom 
Aiha (Kesh) 
Alvan 
Aymara pictograms 
Aztec pictograms 
Balinese 
Balti
Bamum (Cameroon)
[etc].

------------------------------------------------------------
Annex C: References
------------------------------------------------------------

Auge, Claude, ed. 1898-1907. Nouveau Larousse illustre.
7 volumes et 1 volume de supplement. Paris: Larousse.

Barry, Randall K., ed. 1997. ALA-LC romanization tables:
transliteration schemes for non-Roman scripts. Washington, DC:
Library of Congress Cataloging Distribution Service. ISBN
0-8444-0940-5

Calvet, Louis-Jean. 1998. Histoire de l'ecriture. Paris: Hachette
Litteratures. ISBN 2-01-278943-9

Coyaud, Maurice. 1987. Les langues dans le monde chinois. Tome 2,
Pour l'analyse du folklore. Paris: [s.n.].

Daniels, Peter T., and William Bright, eds. 1996. The world's writing
systems. New York; Oxford: Oxford

University Press. ISBN 0-19-507993-0

Diringer, David. 1996. The alphabet: a key to the history of mankind.
New Delhi: Munshiram Manoharlal. ISBN 82-215-0780-0

Encyclopaedia universalis. 1994. 28 volumes. Paris: Encyclopaedia
universalis France. ISBN 2-85229-240-4

Faulmann, Carl. 1990 (1880). Das Buch der Schrift. Frankfurt am Main:
Eichborn. ISBN 3-8218-1720-8

Fevrier, James. 1995. Histoire de l'ecriture. Paris: Payot. ISBN
2-228-88976-8.

Gaur, Albertine. 1992. A history of writing. Revised ed. London: The
British Library. ISBN 0-7123-0270-0

Gelb, I. J. 1952. A study of writing: the foundations of
grammatology. Chicago: University of Chicago Press.

Haarmann, Harald. 1990. Universalgeschichte der Schrift.
Frankfurt/Main; New York: Campus. ISBN 3-593-34346-0

Imprimerie Nationale. 1990. Les caracteres de l'Imprimerie Nationale.
Paris: Imprimerie Nationale Editions. ISBN 2-11-081085-8

ISO/IEC Technical Report 15285:1998. An operational model for
characters and glyphs.

Jensen, Hans. 1969. Die Schrift in Vergangenheit und Gegenwart.
3., neubearbeitete und erweiterte Auflage.
Berlin: VEB Deutscher Verlag der Wissenschaften.

Malherbe, Michel. 1995. Les langues du monde. Paris: Robert Laffont.
ISBN 2-221-05947-6

Nakanishi, Akira. 1990. Writing systems of the world: alphabets,
syllabaries, pictograms. Rutland, VT: Charles E. Tuttle. ISBN
0-8048-1654-9

Robinson, Andrew. 1995. The story of writing. London: Thames and
Hudson. ISBN 0-500-01665-8

Tamiseur, Jean-Christophe, ed. 1998. Dictionnaire des peuples.
Paris: Larousse. ISBN 2-03-720240-7.

Unicode Consortium. 2000. The Unicode standard, Version 3.0. Reading,
MA: Addison-Wesley. ISBN 0-201-61633-5

------------------------------------------------------------
Annex D: Relationship between this proposed RFC and ISO DIS 15924
------------------------------------------------------------

This proposed RFC has alligned its text and tables to ISO DIS 15924.
ISO preparatory matter, and matters relating to ISO procedures
(especially Annex A of ISO DIS 15924) are not alligned.

RFCs are usually written in English, and in US-ASCII text.
Tables in this proposed RFC do not include French language names, and
names of scripts use US-ASCII encoding.

The words "this draft RFC" or similar appear in comparable places where
ISO DIS 15924 text might read "this standard" or similar.

Numeric codes would clash with alphabetic codes, in being non-unique,
and are therefore not included in the table of codes.

Other reference to numeric codes (in section 4.2) has been moved to
an informative part of this proposed RFC. It may be useful to
consider whether for the purposes of this draft RFC the arrangement
of this information would suffice, and that the numeric information
could be removed here too, to ensure absolute uniqueness.

Information in Section 4.2 of ISO DIS 15924 is cross referenced in
Section 4.2 of this proposed RFC, and similar information appears in
Table 2 of Annex C of this proposed RFC.

This proposed RFC has its text in sections 1-4, its main table of
codes in Annex A (whereas ISO DIS 15924 has this table as part of
section 4).

Annex A of ISO DIS 15924 is largely deals with ISO procedures and
similar information is not included in this proposed RFC.

Annexes B and C of this proposed RFC are similar to Annexes B and C
of ISO DIS 15924.

This proposed RFC also includes this Annex D, which indicates
similarities to ISO DIS 15924.

[[ ]] Small sections using this device are kept at this stage to
allow comparison between this proposed RFC and ISO DIS 15924, to
maintain continuity of paragraph numbering etc. Some of these may
benefit from removal at an early stage, as they are more relevant
to ISO development than to RFC development.

These either illustrate that this section is not particularly
relevant to the development of this proposed RFC, or alternatively
that the text of that section needs to be updated in ISO 15924,
if ISO 15924 is ever pursued beyond its current stage of progress
within ISO.

There are no equivalents of the following tables in ISO DIS 15924:

Table 2 Numeric list of script codes
Table 3 Alphabetical list of English script names
Table 4 Alphabetical list of French script names

The reason for this is that
(a) an Alphabetical list of English script names can be autogenerated
    from the main table;
(b) given the need for a single unique representation of a
    script name to be used in Internet services, the use of a numeric
    identifier might not be used;
(c) French names are not used in this version of this proposed RFC.

Note: those tables were mistakenly numbered in the contents of an
earlier draft of ISO DIS 15924 as tables 3-5.

------------------------------------------------------------
           END OF PROPOSED RFC ON SCRIPT CODES
------------------------------------------------------------

Best regards

John Clews

--
John Clews,
Keytempo Limited (Information Management),
8 Avenue Rd, Harrogate, HG2 7PG
Email: Scripts2@sesame.demon.co.uk
tel: +44 1423 888 432;

Committee Member of ISO/IEC/JTC1/SC22/WG20: Internationalization;
Committee Member of ISO/TC37/SC2/WG1: Language Codes