Non-ASCII language tags (was: Re: New draft submitted of 3066bis...)

Sun Nov 7 01:36:40 CET 2004

Here's my attempt to summarize what is going on with this talk of
non-ASCII language tags.  (Warning: non-ASCII characters in this message
will probably get clobbered in the digest.)

I don't think anybody is really proposing to use or allow blatantly
non-ASCII values such as "ελ" or "ру" in xml:lang attributes, because
that would be, well, stupid.  I agree with Harald and John that any
deliberate attempt to create a non-ASCII language tag, whether due to a
characterization of RFC 3066 as "advisory" or for any other reason, is
doomed to failure, and deserving of scorn and ridicule as well.

Elliotte's concern seems to be if a user with a Turkish locale sent or
received a language tag, say "it" for Italian, and it got uppercased to
"İT", or conversely if the uppercase tag "IT" got lowercased to "ıt".
These transformed tags would contain non-ASCII characters because the
case folding algorithm applied to them was locale-dependent.  Elliotte
proposes to prevent this problem by specifying in RFC 3066bis that only
the "normal," or "traditional," or if you will, "English" casing
conventions should be applied to language tags.

This seems fair, and I propose that after the existing passage in
Section 2.1:

"The tags and their subtags, including private-use and extensions, are
to be treated as case insensitive: there exist conventions for the
capitalization of some of them, but these should not be taken to carry
meaning."

the following sentence (or something like it) should be added:

"'Case insensitivity' for language tags SHALL refer specifically to the
locale-independent relationship between the uppercase US-ASCII letters
'A' through 'Z' and the corresponding lowercase letters 'a' through
'z'."

This should satisfy Elliotte's concern without dragging Unicode case
folding or English ethnocentrism into the mix.

Actually specifying the quasi-mathematical relationship 'A'-'a' (as
suggested) would be inappropriate for this definition because it assumes
a "mutually contiguous" allocation scheme that is not necessarily
present in all character sets, although all modern character sets do
exhibit it.  (In particular, 'A'-'a' is a negative value in ASCII, but a
positive value in EBCDIC.)

I should note that the passage in RFC 3066bis just before this:

"Note that although [RFC 2234][12] refers to octets, the language tags
described in this document are sequences of characters from the US-ASCII
repertoire. Language tags may be used in documents and applications that
use other encodings, so long as these encompass the US-ASCII repertoire.
An example of this would be an XML document that uses the Unicode
UTF-16LE encoding."

was something that I asked Mark and Addison for, several months ago, to
replace the previous wording which stated that language tags must be in
US-ASCII.  There are two scenarios in which they are not.

First, characters in the ASCII repertoire can exist in character sets
other than ASCII and supersets of ASCII.  UTF-16 is a good example,
mentioned in the current wording, but you could also represent a
language tag in EBCDIC or BOCU-1, neither of which uses the same code
points as ASCII (that is, 'A' ≠ 0x41).  Then there are character sets
like FIELDATA and the old Sinclair ZX81 character set, neither of which
supports lowercase letters at all.  You could even construct a perfectly
valid, uppercase language tag in one of those, although of course it
could not be lowercased.

Second, there are those gosh-darned Unicode Plane 14 tag characters.  I
know, they are "strongly discouraged" and "valid only in special
protocols during a lunar eclipse in the middle of Ramadan."  But they
are still defined as representing an RFC 3066 (or successor) language
tag, and they are one-to-one mirrors of the ASCII repertoire.  While RFC
3066bis (and predecessors) state that language tags must be from the
ASCII repertoire, the Plane 14 mechanism -- in those rare cases where it
is permitted -- should be seen as a legitimate special-purpose
application which modifies the target character set using a one-to-one
mapping.  A protocol that specifies a language tag of "������", as part
of an appropriate protocol, should not be considered to violate RFC
3066bis in the same way that <span lang="ελ"> would.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/