Non-ASCII language tags (was: Re: New draft submitted of 3066bis...)

Mon Nov 8 17:14:56 CET 2004

I agree that that or something like it is a reasonable clarification.

‎Mark

----- Original Message ----- 
From: "Doug Ewell" <dewell at adelphia.net>
To: <ietf-languages at alvestrand.no>
Sent: Saturday, November 06, 2004 16:36
Subject: Non-ASCII language tags (was: Re: New draft submitted of
3066bis...)

> Here's my attempt to summarize what is going on with this talk of
> non-ASCII language tags.  (Warning: non-ASCII characters in this message
> will probably get clobbered in the digest.)
>
> I don't think anybody is really proposing to use or allow blatantly
> non-ASCII values such as "ελ" or "ру" in xml:lang attributes, because
> that would be, well, stupid.  I agree with Harald and John that any
> deliberate attempt to create a non-ASCII language tag, whether due to a
> characterization of RFC 3066 as "advisory" or for any other reason, is
> doomed to failure, and deserving of scorn and ridicule as well.
>
> Elliotte's concern seems to be if a user with a Turkish locale sent or
> received a language tag, say "it" for Italian, and it got uppercased to
> "İT", or conversely if the uppercase tag "IT" got lowercased to "ıt".
> These transformed tags would contain non-ASCII characters because the
> case folding algorithm applied to them was locale-dependent.  Elliotte
> proposes to prevent this problem by specifying in RFC 3066bis that only
> the "normal," or "traditional," or if you will, "English" casing
> conventions should be applied to language tags.
>
> This seems fair, and I propose that after the existing passage in
> Section 2.1:
>
> "The tags and their subtags, including private-use and extensions, are
> to be treated as case insensitive: there exist conventions for the
> capitalization of some of them, but these should not be taken to carry
> meaning."
>
> the following sentence (or something like it) should be added:
>
> "'Case insensitivity' for language tags SHALL refer specifically to the
> locale-independent relationship between the uppercase US-ASCII letters
> 'A' through 'Z' and the corresponding lowercase letters 'a' through
> 'z'."
>
> This should satisfy Elliotte's concern without dragging Unicode case
> folding or English ethnocentrism into the mix.
>
> Actually specifying the quasi-mathematical relationship 'A'-'a' (as
> suggested) would be inappropriate for this definition because it assumes
> a "mutually contiguous" allocation scheme that is not necessarily
> present in all character sets, although all modern character sets do
> exhibit it.  (In particular, 'A'-'a' is a negative value in ASCII, but a
> positive value in EBCDIC.)
>
> I should note that the passage in RFC 3066bis just before this:
>
> "Note that although [RFC 2234][12] refers to octets, the language tags
> described in this document are sequences of characters from the US-ASCII
> repertoire. Language tags may be used in documents and applications that
> use other encodings, so long as these encompass the US-ASCII repertoire.
> An example of this would be an XML document that uses the Unicode
> UTF-16LE encoding."
>
> was something that I asked Mark and Addison for, several months ago, to
> replace the previous wording which stated that language tags must be in
> US-ASCII.  There are two scenarios in which they are not.
>
> First, characters in the ASCII repertoire can exist in character sets
> other than ASCII and supersets of ASCII.  UTF-16 is a good example,
> mentioned in the current wording, but you could also represent a
> language tag in EBCDIC or BOCU-1, neither of which uses the same code
> points as ASCII (that is, 'A' ≠ 0x41).  Then there are character sets
> like FIELDATA and the old Sinclair ZX81 character set, neither of which
> supports lowercase letters at all.  You could even construct a perfectly
> valid, uppercase language tag in one of those, although of course it
> could not be lowercased.
>
> Second, there are those gosh-darned Unicode Plane 14 tag characters.  I
> know, they are "strongly discouraged" and "valid only in special
> protocols during a lunar eclipse in the middle of Ramadan."  But they
> are still defined as representing an RFC 3066 (or successor) language
> tag, and they are one-to-one mirrors of the ASCII repertoire.  While RFC
> 3066bis (and predecessors) state that language tags must be from the
> ASCII repertoire, the Plane 14 mechanism -- in those rare cases where it
> is permitted -- should be seen as a legitimate special-purpose
> application which modifies the target character set using a one-to-one
> mapping.  A protocol that specifies a language tag of "������", as part
> of an appropriate protocol, should not be considered to violate RFC
> 3066bis in the same way that <span lang="ελ"> would.
>
> -Doug Ewell
>  Fullerton, California
>  http://users.adelphia.net/~dewell/
>
>
> _______________________________________________
> Ietf-languages mailing list
> Ietf-languages at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/ietf-languages
>