Perhaps a draft RFC on Script codes?
John Clews
Scripts2@sesame.demon.co.uk
Fri, 06 Dec 2002 13:28:00 GMT
Perhaps a draft RFC on Script codes?
See also immediately previous email from John Clews
to ietf-languages@iana.org
------------------------------------------------------------
START OF PROPOSED RFC ON SCRIPT CODES
------------------------------------------------------------
Preliminary information
John Clews has suggested that due to unresolved problems in ISO in
developing ISO 15924: Codes for representation of names of scripts
(now delayed in ISO at the FDIS stage), and the need for such codes
by the Internet community, the Internet community will be better
served by developing its own RFC instead.
As ISO DIS 15924 has been stable for some time, it is suggested that
any such RFC should draw on work in ISO DIS 15924.
A copy of DIS (not FDIS) 15924 is available at:
http://www.evertype.com/standards/iso15924/document/dis15924.pdf
The following initial draft is suggested as a first attempt at an RFC
for script codes, and uses some of the RFC conventions such as having
text only in English, and the use of US-ASCII as the character set,
but it is heavily dependent on information in ISO DIS 15924.
Perhaps an even simpler solution would be to add the main table and
minimal text from this, in a revision of RFC 3066.
Michael Everson is the editor of the ISO 15924 project, and also IANA
Language Tag Reviewer, and he is owed an enormous debt of gratitude
by the whole Internet community, for his work on language codes and
script codes, and if an RFC comes to pass, the Internet community
will be likely to benefit further from further comment by him in this
area too.
Michael Everson at Evertype would be an obvious IANA Script Tag
Reviewer, if one were to be appointed under a Script Tags RFC, just
as he is already IANA Language Tag Reviewer, in relation to RFC 3066.
This document also owes an enormous debt of gratitude to the work of
Michael Everson too, as much of the text is from. or is close to,
ISO DIS 15924.
Note: John Clews was formerly Chair of ISO/TC46/SC2 (Conversion of
Written Languages), which was responsible (in ISO committee terms)
for overseeing the earlier stages of the development of earlier
drafts of ISO 15924.
John Clews
6 December 2002
------------------------------------------------------------
Contents
1 Scope
2 Normative references
3 Definitions
4 Script codes: methodology
5 Script codes: table
Annexes (informative)
Annex A: Typology of scripts
Annex B: Scripts under consideration for future addition to this draft RFC
Annex C: References
Annex D: Relationship between this proposed RFC and ISO DIS 15924
------------------------------------------------------------
Code for the representation of names of scripts
1 Scope
This draft RFC provides a code for the presentation of
names of scripts. The codes were devised for use in terminology,
lexicography, bibliography, and linguistics, but they may be used for
any application requiring the expression of scripts in coded form.
This draft RFC also includes guidance on the use of script codes in
some of these applications.
NOTE: In principle, this draft RFC is intended to provide coded
representation for the names of all the scripts of the world. Unique
identification of a script is not always straightforward and obvious;
therefore a number of scripts have been listed in Annex B pending
further study before codes are standardized for them.
------------------------------------------------------------
2 Normative references
ISO 639:1988 Code for the representation of the
names of languages.
ISO 639-2:1998 Codes for the representation of
names of languages - Part 2: Alpha-3 codes.
ISO 3166-1:1997 Codes for the representation of
the names of countries and their subdivisions -
Part 1: Country codes.
ISO/IEC 10646-1:2000 Information technology -
Universal Multiple-Octet Coded Character Set
(UCS) -Part 1: Architecture and Basic Multilingual
Plane.
------------------------------------------------------------
3 Definitions
For the purpose of this draft RFC the
following definitions apply:
3.1
alias
A script code which is a collection of two or more
script codes.
3.2
code
Data representation in different forms according to
a pre-established set of rules. (ISO 639-2:1998)
3.3
country code
A combination of characters used to designate the
name of a country.
3.4
font
A collection of glyph images having the same
basic design, e.g. Courier Bold Oblique. (ISO/IEC
9541-1:1991)
3.5
glyph
A recognizable abstract graphic symbol which is
independent of any specific design. (ISO/IEC
9541-1:1991)
3.6
language code
A combination of characters used to represent [the
name of] a language or languages.(ISO 639-2:1998)
3.7
script
A set of graphic characters used for the written
form of one or more languages. (ISO/IEC 10646-1)
NOTE 1: A script, as opposed to an arbitrary subset of
characters, is defined in distinction to other scripts; in
general, readers of one script may be unable to read the
glyphs of another script easily, even where there is a
historic relation between them (see 3.9).
NOTE 2: In certain cases, this draft RFC provides codes
which are not subsumed under this definition.
Examples: the codes for aliases and the variant codes.
3.8
script code
A combination of characters used to represent the
name of a script.
3.9
script variant
A particular form of one script that is so unique
that it can almost be considered itself to be a
distinct script.
------------------------------------------------------------
4 Script codes: methodology
4.1 Structure of the alphabetic script codes
The alphabetic script codes are created from the
original script name in the language commonly
used for it, transliterated or transcribed into Latin
letters. If a country, where the script concerned
has the status of a national script, requests a
certain script code, preference is given to this
code whenever possible. The four-letter codes
shall be written with an initial capital Latin letter
and final small Latin letters (taken from the range
Aaaa to Zzzz). This serves to help differentiate
script codes from language codes and country
codes: so, for example, Mong mon MON or Mong
mn MN would refer to a book in the Mongolian
script, in the Mongolian language, originating in
Mongolia.
[[NOTE: See 4.7 regarding changes to the codes.]]
[[4.2 Typology of scripts
Information of types of scripts which may be encountered may be found
in Annex A, and in the references in Annex C.]]
4.3 Relation of the script codes to ISO standards
The four-letter codes are derived from ISO 639-2
where the name of a script and the name of a
language using the script are identical
(example: Gujarati - ISO 639-2: Guj; this draft RFC: Gujr).
In cases where there is no identity, the script name may
have a unique form (examples:
Korean kor, Hangul Hang;
Punjabi pan, Gurmukhi Guru;
Dhivehi div, Thaana Thaa).
Where possible the the first three letters of the
four-letter code corresponds to the three-letter
code. Preference is given to the Bibliographical
codes given in ISO 639-2 in deriving the codes
specified in this draft RFC.
4.4 Adaptation of the script codes
[[If adapting this draft RFC to other
scripts (for example, Cyrillic or Greek), codes shall
be formed according to the principles of this
RFC]].
4.5 Addition of new script codes
[[For the purpose of allocating additional script
codes, a Script Tag Reviewer shall be appointed]]
4.6 Application of script codes
Script codes can be used in the following
particular instances.
4.6.1 To indicate generally the scripts in which
documents are or have been written or recorded.
Example:
<META HTTP-EQUIV=3D"Content-Language"
CONTENT=3D"ga, ru">
<META NAME=3D"Content-Script"
CONTENT=3D"Latg, Cyrl">
4.6.2 To indicate the script specified in document
holding records (order records, bibliographic
records, etc.).
Examples:
In bibliographies:
Ryte pomerantsen: Jewish folk humor. New York:
Schocken Books, 1965. xxvi, 203 p.; 20 cm. In
Yiddish (Latn) and English.
Kroatisch-Deutsch und Deutsch-Kroatisch: mit einem
Anhang der wichtigeren Neubildungen des
Kroatischen und Deutschen. =D0 Berlin: Axel Juncker,
1941. vi, 302, 314, 32 p.; 15 cm. In Croatian (Latn)
and German (Latf)
4.6.3 To indicate the script used by an application.
Example:
"Laser Syriac: The fonts supplied in this package are
coded according to collection 85 of Annex A of ISO/IEC
10646 and provide a complete set of glyphs in all three
of the styles used to write Syriac (Syre, Syrn, Syrj)."
[[4.7 Changes of script codes
In order to preserve the integrity of data coded
using the codes set forth in this draft RFC, it is intended
that the four-letter and
numeric codes specified herein shall not be
changed unless there be extraordinarily
compelling reasons to do so.]]
------------------------------------------------------------
5 Script codes: table
Alphabetical table of four-letter script codes
[lists in other orders, e.g. by script name, can be generated from this]
------------------------------------------------------------
Code Name
------------------------------------------------------------
Arab Arabic
Aram Aramaic
Armn Armenian
Aves Avestan
Batk Batak
Beng Bengali
Blis Blissymbols
Bopo Bopomofo
Brah Brahmi
Brai Braille
Bugi Buginese
Buhd Buhid
Cans Unified Canadian Aboriginal Syllabics
Cham Cham
Cher Cherokee
Cirt Cirth
Cpmn Cypro-Minoan
Cprt Cypriot syllabary
Cyrl Cyrillic
Cyrs Cyrillic (Old Church Slavonic variant)
Deva Devanagari
Dsrt Deseret (Mormon)
Egyd Egyptian demotic
Egyh Egyptian hieratic
Egyp Egyptian hieroglyphs
Ethi Ethiopic (Ge'ez)
Geoa Georgian (Asomtavruli)
Geon Georgian (Nuskhuri)
Geor Georgian (Mkhedruli)
Glag Glagolitic
Goth Gothic
Grek Greek
Gujr Gujarati
Guru Gurmukhi
Hang Hangul
Hani Han ideographs
Hano Hanunoo
Hebr Hebrew
Hira Hiragana
Hmng Pahawh Hmong
Hrkt (alias for Hiragana + Katakana)
Hung Old Hungarian
Inds Indus (Harappan)
Ital Old Italic (Etruscan, Oscan, etc.)
Java Javanese
Jpan (alias for Han + Hiragana + Katakana)
Kali Kayah Li
Kana Katakana
Khar Kharosthi
Khmr Khmer
Knda Kannada
Laoo Lao
Latf Latin (Fraktur variant)
Latg Latin (Gaelic variant)
Latn Latin
Lepc Lepcha (Rong)
Lina Linear A
Linb Linear B
Mand Mandaean
Maya Mayan hieroglyphs
Mero Meroitic
Mlym Malayalam
Mong Mongolian
Mymr Myanmar (Burmese)
Ogam Ogham
Orkh Orkhon
Orya Oriya
Osma Osmanya
Palv Pahlavi
Perm Old Permic
Phnx Phoenician
Phst Phaistos
Plrd Pollard Phonetic
Qaaa- Reserved for private use
Qtzz
Roro Rongorongo
Runr Runic
Shaw Shavian (Shaw)
Sinh Sinhala
Syrc Syriac
Syrj Syriac (Western variant)
Syrn Syriac (Eastern variant)
Syre Syriac (Estrangelo variant)
Tagb Tagbanwa
Taml Tamil
Telu Telugu
Teng Tengwar
Tfng Tifinagh (Berber)
Tglg Tagalog
Thaa Thaana
Thai Thai
Tibt Tibetan
Vaii Vai
Visp Visible Speech
Xpeo Cuneiform, Old Persian
Xsux Cuneiform, Sumero-Akkadian
Xuga Cuneiform, Ugaritic
Yiii Yi
Zxxx Code for unwritten languages
Zyyy Code for undetermined script
Zzzz Code for uncoded script
------------------------------------------------------------
Annex A: Typology of scripts (informative)
------------------------------------------------------------
The following information is included to indicate types of scripts
that may be encountered.
Any numeric codes here are informative only, and should NOT be used
as code elements in Internet use.
The INFORMATIVE numeric script codes below have been assigned to
provide some measure of mnemonicity to the codes used. The following
ranges have been used:
000-099 Hieroglyphic and cuneiform scripts
100-199 Right-to-left alphabetic scripts
200-299 Left-to-right alphabetic scripts
300-399 Brahmi-derived scripts
400-499 Syllabic scripts
500-599 Ideographic scripts
600-699 Undeciphered scripts
700-799 (unassigned)
800-899 (unassigned)
900-999 Private use, aliases, special codes
NOTE 1: ISO/IEC 10646 uses the character-glyph
model (defined in ISO/IEC TR 15285:1998) to classify
the characters used to write different languages. This draft RFC
does not attempt to apply the character-glyph
model, because it is sometimes important to identify
certain script variants regardless of the encoding a
given text may employ. For example, a Syriac book may
be written in one of the three variants of the Syriac script
(Estrangelo, Eastern, Western). Identification of such
script variants, while outside the scope of ISO/IEC
10646, is relevant to the content of script codes. For
example, a user ordering a book through interlibrary
loan may prefer, or may wish to exclude, the Gaelic
variant of the Latin script for reasons of ease of legibility
or familiarity with one of the variants.
NOTE 2: The classifications here reflect the chief
attribute of the scripts so classified, and are not
necessarily comprehensive of the ways in which the
scripts are used. For example, while Ogham may be
written from left to right, it is also written vertically from
bottom to top. Similarly, the Hangul (Hangu=F9 l, Hangeul)
alphabet is sometimes written in vertical columns, and
the letters of its alphabet are arranged in syllabic
clusters.
NOTE 3: Within each category numeric identifiers
assigned to scripts have followed a principle of
chronology, and genetic relationship, though this
principle cannot be established by any hard and fast
rule, since scripts may have many different charac-
teristics. Codes have been assigned by spacing them
out so that scripts encoded in future may be assigned to
appropriate places in the range.
------------------------------------------------------------
Annex B: Scripts under consideration for future addition to this draft RFC =
(informative)
------------------------------------------------------------
[[In order to make sure that ISO 15924 is useful for
people needing to provide codes for all the scripts
of the world, scripts included in ISO 15924 were
selected with input and feedback from national
standards organizations and/or qualified experts.]]
[[Some scripts were not included in this edition
because sufficient information was not available
during the preparation and review stages. ]]
It is intended that codes for the names of these
scripts will be allocated when available information
is judged to be sufficient. Such scripts include:
Ahom
Aiha (Kesh)
Alvan
Aymara pictograms
Aztec pictograms
Balinese
Balti
Bamum (Cameroon)
[etc].
------------------------------------------------------------
Annex C: References
------------------------------------------------------------
Auge, Claude, ed. 1898-1907. Nouveau Larousse illustre.
7 volumes et 1 volume de supplement. Paris: Larousse.
Barry, Randall K., ed. 1997. ALA-LC romanization tables:
transliteration schemes for non-Roman scripts. Washington, DC:
Library of Congress Cataloging Distribution Service. ISBN
0-8444-0940-5
Calvet, Louis-Jean. 1998. Histoire de l'ecriture. Paris: Hachette
Litteratures. ISBN 2-01-278943-9
Coyaud, Maurice. 1987. Les langues dans le monde chinois. Tome 2,
Pour l'analyse du folklore. Paris: [s.n.].
Daniels, Peter T., and William Bright, eds. 1996. The world's writing
systems. New York; Oxford: Oxford
University Press. ISBN 0-19-507993-0
Diringer, David. 1996. The alphabet: a key to the history of mankind.
New Delhi: Munshiram Manoharlal. ISBN 82-215-0780-0
Encyclopaedia universalis. 1994. 28 volumes. Paris: Encyclopaedia
universalis France. ISBN 2-85229-240-4
Faulmann, Carl. 1990 (1880). Das Buch der Schrift. Frankfurt am Main:
Eichborn. ISBN 3-8218-1720-8
Fevrier, James. 1995. Histoire de l'ecriture. Paris: Payot. ISBN
2-228-88976-8.
Gaur, Albertine. 1992. A history of writing. Revised ed. London: The
British Library. ISBN 0-7123-0270-0
Gelb, I. J. 1952. A study of writing: the foundations of
grammatology. Chicago: University of Chicago Press.
Haarmann, Harald. 1990. Universalgeschichte der Schrift.
Frankfurt/Main; New York: Campus. ISBN 3-593-34346-0
Imprimerie Nationale. 1990. Les caracteres de l'Imprimerie Nationale.
Paris: Imprimerie Nationale Editions. ISBN 2-11-081085-8
ISO/IEC Technical Report 15285:1998. An operational model for
characters and glyphs.
Jensen, Hans. 1969. Die Schrift in Vergangenheit und Gegenwart.
3., neubearbeitete und erweiterte Auflage.
Berlin: VEB Deutscher Verlag der Wissenschaften.
Malherbe, Michel. 1995. Les langues du monde. Paris: Robert Laffont.
ISBN 2-221-05947-6
Nakanishi, Akira. 1990. Writing systems of the world: alphabets,
syllabaries, pictograms. Rutland, VT: Charles E. Tuttle. ISBN
0-8048-1654-9
Robinson, Andrew. 1995. The story of writing. London: Thames and
Hudson. ISBN 0-500-01665-8
Tamiseur, Jean-Christophe, ed. 1998. Dictionnaire des peuples.
Paris: Larousse. ISBN 2-03-720240-7.
Unicode Consortium. 2000. The Unicode standard, Version 3.0. Reading,
MA: Addison-Wesley. ISBN 0-201-61633-5
------------------------------------------------------------
Annex D: Relationship between this proposed RFC and ISO DIS 15924
------------------------------------------------------------
This proposed RFC has alligned its text and tables to ISO DIS 15924.
ISO preparatory matter, and matters relating to ISO procedures
(especially Annex A of ISO DIS 15924) are not alligned.
RFCs are usually written in English, and in US-ASCII text.
Tables in this proposed RFC do not include French language names, and
names of scripts use US-ASCII encoding.
The words "this draft RFC" or similar appear in comparable places where
ISO DIS 15924 text might read "this standard" or similar.
Numeric codes would clash with alphabetic codes, in being non-unique,
and are therefore not included in the table of codes.
Other reference to numeric codes (in section 4.2) has been moved to
an informative part of this proposed RFC. It may be useful to
consider whether for the purposes of this draft RFC the arrangement
of this information would suffice, and that the numeric information
could be removed here too, to ensure absolute uniqueness.
Information in Section 4.2 of ISO DIS 15924 is cross referenced in
Section 4.2 of this proposed RFC, and similar information appears in
Table 2 of Annex C of this proposed RFC.
This proposed RFC has its text in sections 1-4, its main table of
codes in Annex A (whereas ISO DIS 15924 has this table as part of
section 4).
Annex A of ISO DIS 15924 is largely deals with ISO procedures and
similar information is not included in this proposed RFC.
Annexes B and C of this proposed RFC are similar to Annexes B and C
of ISO DIS 15924.
This proposed RFC also includes this Annex D, which indicates
similarities to ISO DIS 15924.
[[ ]] Small sections using this device are kept at this stage to
allow comparison between this proposed RFC and ISO DIS 15924, to
maintain continuity of paragraph numbering etc. Some of these may
benefit from removal at an early stage, as they are more relevant
to ISO development than to RFC development.
These either illustrate that this section is not particularly
relevant to the development of this proposed RFC, or alternatively
that the text of that section needs to be updated in ISO 15924,
if ISO 15924 is ever pursued beyond its current stage of progress
within ISO.
There are no equivalents of the following tables in ISO DIS 15924:
Table 2 Numeric list of script codes
Table 3 Alphabetical list of English script names
Table 4 Alphabetical list of French script names
The reason for this is that
(a) an Alphabetical list of English script names can be autogenerated
from the main table;
(b) given the need for a single unique representation of a
script name to be used in Internet services, the use of a numeric
identifier might not be used;
(c) French names are not used in this version of this proposed RFC.
Note: those tables were mistakenly numbered in the contents of an
earlier draft of ISO DIS 15924 as tables 3-5.
------------------------------------------------------------
END OF PROPOSED RFC ON SCRIPT CODES
------------------------------------------------------------
Best regards
John Clews
--
John Clews,
Keytempo Limited (Information Management),
8 Avenue Rd, Harrogate, HG2 7PG
Email: Scripts2@sesame.demon.co.uk
tel: +44 1423 888 432;
Committee Member of ISO/IEC/JTC1/SC22/WG20: Internationalization;
Committee Member of ISO/TC37/SC2/WG1: Language Codes