The role of country codes.

Thu May 29 08:34:05 CEST 2003

Jon Hanna wrote on 05/29/2003 05:08:05 AM:

> I very much doubt that English is the sole exception to this.
>
> In current usage the country codes identify both the orthographic and
other
> differences, and it works well because they are pretty much where both of
> these differences should be with respect to the primary subtag.

While we could debate whether en-IE is or isn't the best example of the use
of country codes for things other than spelling, I think Jon's point on
such use of country codes is valid. For instance, country codes are used in
existing data and implementations with "es" for differences within Spanish
that are divided more or less along country lines and that have to do
mainly with vocabulary, not spelling.

> With the
> introduction of script information into language codes the double-duty of
> the country codes no longer works well. The obvious priority is to place
the
> differences in vocabulary and syntax before the script information and
the
> orthographic differences after, I don't think this translates well to any
> suggested encoding.

It seems to me Jon is concluding failure before considering possibilities.

IIRC, there are country-based differences within German in relation to both
vocabulary and spelling, and the two don't coincide entirely. (In the
discussion of de-1901 etc, I had initially advocated de-1901-CH etc, though
ultimately conceded to de-CH-1901.) That raises a theoretical question of a
possible need to distinguish both dialect and spelling along independent
country lines, and such a distinction could perhaps be expressed by
something like "de-CH-DE", where "de-CH" denotes "Swiss dialect variant of
German" and "-DE" denotes "spelling conventions of Germany". As this hasn't
been registered, though, this example is only hypothetical -- I don't know
how real the potential need for such a distinction within German might be.

In discussion that followed distribution of his paper to Michael and me,
Peter Edberg raise similar possibilities, giving the example of "en-GB-US"
for British English language/dialect ("lorry" vs "truck") in U.S.
orthography ("color" vs "colour"). Offhand, though, I can't think of points
at which vocabulary and spelling differences coincide; that would have to
be something like "color" vs "colour" where this wordform has imporants
senses -- e.g. "to colo(u)r" meaning "to paint" -- that exist in one
country but not the other.

Whether that example is good or not, Peter clearly suggested in his paper
that vocabulary distinctions precede writing-related distinctions
(orthography, spelling). That way the tag can be cleanly parsed into two
parts: the first half is purely language/dialect, and the second half (when
present) is related to writing. He also had in mind that defaults for
writing could be assumed, so that "en" can be taken to mean "English in the
common Latin orthography", and "en-us", to mean "US dialect variant of
English with US spelling conventions" (which is how these are effectively
used today). Using this basic approach of
Language/DialectVariations-WritingVariations, other possibilities would
include tags of the form "uz-CN-latn-UZ" (the China dialect of Uzbek
written in Latin script using Uzbekistan spelling conventions -- again, a
hypothetical example).

Of course, this idea doesn't coincide with the also-interesting suggestion
that a script ID should be sequenced immediately after the language (but
not dialect) for the sake of left-end hierarchical processes such as
current accept-lang implementations, since differences in script are much
more imporant in relation to user preference than are differences in
spelling or dialect.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485