Region subtags under 3066 and 3066bis

Thu Feb 24 22:55:55 CET 2005

Hi -

> From: "Frank Ellermann" <nobody at xyzzy.claranet.de>
> To: <ietf-languages at alvestrand.no>
> Sent: Wednesday, February 23, 2005 7:19 AM
> Subject: Re: Region subtags under 3066 and 3066bis
...
> I don't like the concept of "country codes" in language tags,
> and these "private use country codes" make a bad idea worse.
...

half-rhetorically:

Why do we think we need the "script" and "region" subtags?
Because we know that for some languages,
   1) there are multiple orthographies,
   2) the differences in these orthographies are significant enough that
       users may need to specify which one they want in order to be able to
       understand a document
   3) "script" is certainly a critical dimension; even a fluent native speaker
       may be unable to read a text in one of the scripts that can be used
       for a language.
   4) the "region" dimension is at least somewhat helpful in identifying
       orthographies because the educational institutions of a given country,
       even if there is no legislated requirement to teach a particular orthography,
       will probably do so simply as a matter of pedagogic and administrative
       sanity.

There are some problems with these.  A few include:
    1) orthographies change over time.  Witness the recent German mess, or,
        more dramatically, the changes in English since the 18th century, or the
        orthographic reforms of Russian.
    2) there may be multiple orthographies in use (effectively in free variation)
        within a region and script.  Modern English spelling may be an example
        of this.  (Though I'd argue that in that case, users are so used to the
        variations that there's little point in making the distinction.  We've all learned
        both the British and American spellings of many words, and the only time
        we really care which one is used is when we're taking a spelling test in school,
        or proof-reading something for publication.  In both those cases, some dictionary
        is cited as the authority, rather than any particular :national standard.)
    3) users don't necessarily agree whether a given difference is significant, or
        even real.  For example, in an ESL textbook I saw in Viet Nam, there was
        a table of "American" and "British" word equivalents.  The only problem was
        that in several cases I, a native speaker of American English, would definitely
        have preferred the "British" form, and, in other cases, would not have considered
        the "British" choice "marked".)
   4) uncertainty about what the referent of a region subtag might be, as in the case
       of YU, or even US if one looks at, say, the last 200 years (though one
       might argue that this is covered by (1) above)

If country codes aren't the right thing to use to identify substantial regional
variations, what would be good alternatives?  It seems a chunk of text may
be characterized as being in some language(s), using some script(s), with
words from some lexicon(s), written using some set of orthographic
convention(s).  Do ALL these dimensions need to be crammed into a language tag?
Is there a better stand in for all these attributes than a script identifier and a
region subtag?  Do we REALLY need that much precision?  Does the increased
precision make things any more accurate for the purposes for which these
tags are actually used?

Country codes seem to be a short-hand for particular combinations, and
one would generally have to look to a registration to figure out exactly which
lexicon, etc. had been given AS AN EXAMPLE of that language, and whether
there were references that distinguished this particular variant of a language
from others. I'd be surprised if any registration could reasonably be read
as exhaustively defining a language, though some on this list seem to want
to look at things that way.

When we look at the current registration process, we've seen languages
documented using reference material covering all of these dimensions.
Which are essential, and which are accidental?

Randy