FW: Proposed Successor to RFC 3066 (language tags)

Thu Nov 20 01:44:07 CET 2003

Philippe posted the message below to the Unicode list in response to an announce there from Mark and I. I am purposely responding on this list with no cross-post to the Unicode list, since this topic is Off-topic for that list.

Thanks again Philippe for your comments. My response follows:

> However the problem with that scheme is its new use of characters "%" and
> "=". There are a lot of applications that where not expecting 

1. We considered the compatibility problem when making our proposal. Mark and I decided to include the two "incompatible" characters as the better choice in design. It remains to be seen whether our choice of escape characters, etc., are a problem. 

> I think it's quite strange that these extensions have not used the existing
> restricted encoding set to encode them, instead on relying on "%" and "=".

2. The use of "_" and "." would also be illegal under the existing standard. Pick your favorite ASCII characters... we chose these two, but any two you pick other than ALPHANUM and HYPHEN-MINUS would be additions to the range of characters. I think you're thinking of domain names (URIs have a less retrictive set). Other characters or escaping mechanisms can be substituted if they make more sense.

> Under this new scheme, the following language tag may be valid:
> "sr-Latn-SP-2003-x-href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0"
> which here would mean: {
>     language="sr"; // Serbian
>     script="Latn"; // Latin
>     region="SP"; // Serbia-Montenegro

3. Note that "SP" is unassigned. Click here for the semi-official table: http://www.iso.ch/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/iso_3166-1_decoding_table.html. According to this draft, the code for "Serbia and Montenegro" would be CS1, a specially registered value.

> For GET URLs, these extra "%" and "=" will need to be URL-encoded to get
> through correctly, as the following would become possible and prone to

4. URLs must obey the rules for URLs. Values placed in a "GET" URL that used an extended language tag as a value would be no exception. This may be unattractive, but doesn't invalidate the design. I'm more concerned about direct consumers, such as the Accept-Language header in HTTP, than I am about transport encodings. The characters chosen were picked from the list of characters valid in HTTP headers and URIs.

> languages codes should be the shortest ISO-639 codes (is it true for a few
> legacy languages which previously were coded with 3 letters and upgraded to
> 2-letter codes, until there was a policy not to do it anymore in the
> future?)

5. RFC 3066 requires the shortest form of both ISO639 and ISO3166 to be used (the latter being a moot point currently). The new draft does not change this, but does extend it in the area of stability.

> But at least this draft offers a good starting point to indicate locales
> more precisely.

6. No, these are tags for the identification of languages. A "locales extension" is something Mark and I have in mind, but this isn't it.

> and for documenting the legal values of variant codes (either a year
> with possible era, or a registered tag)

7. Don't forget private use variant tags.

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility
http://www.webMethods.com
Chair, W3C Internationalization (I18N) Working Group
Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International

Internationalization is an architecture. 
It is not a feature.

-----Original Message-----
From: Philippe Verdy [mailto:verdy_p at wanadoo.fr]
Sent: Wednesday, November 19, 2003 3:51 PM
To: aphillips at webmethods.com
Cc: unicode at unicode.org
Subject: Re: Proposed Successor to RFC 3066 (language tags)

From: Addison Phillips [wM]
> Please note that there is a discussion list for this topic at:
ietf-languages at iana.org
>
> While Mark and I welcome your comments here or privately, off-list, you
can best be
> a part of the discussion by joining that list. Join the list by sending a
request email
> to:  ietf-languages-request at iana.org

I note that the language tags proposal includes the following EBNF
productions for extensions that may be padded after the language code,
script code, region code, variant code:

extensions  = "-x" 1* ("-" key "=" value)
key  = ALPHA *alphanum
value  = 1* utf8uri
alphanum  = (ALPHA / DIGIT)
utf8uri  = (ALPHA / DIGIT / 1*4 ("%" 2 HEXDIG))

Under this new scheme, the following language tag may be valid:
"sr-Latn-SP-2003-x-href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0"
which here would mean: {
    language="sr"; // Serbian
    script="Latn"; // Latin
    region="SP"; // Serbia-Montenegro
    variant="2003";
    extensions="-x"; {
        href="http://www.iana.org/"
        version="1.0"
    }
}

However the problem with that scheme is its new use of characters "%" and
"=". There are a lot of applications that where not expecting something else
in this field than just alphanum and "-" or "_" or ".", so that the language
tag could safely be used without specific escaping within URIs (for example
in HTTP GET URLs) or as options of a MIME type (I take a sample here, which
may not correspond to an existing option of the "text/plain" MIME type):

Content-Encoding: text/plain; charset=UTF-8;
lang=sr-Latn-SP-2003-x-href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0

This may break the compatiblity of some parsers if such "extended language
tags" are found there, as there are two "=" signs within the value of the
"lang=" option.

For GET URLs, these extra "%" and "=" will need to be URL-encoded to get
through correctly, as the following would become possible and prone to
generate form data parsing errors:

http://www.anysite.domain/process-form.cgi?lang=sr-Latn-SP-2003-x-href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0

I think it's quite strange that these extensions have not used the existing
restricted encoding set to encode them, instead on relying on "%" and "=".
Why not using "_" instead of "=" and "." instead of "%", like this:
"sr-Latn-SP-2003-x-href_http.3A.2F.2Fwww.2Eiana.2Eorg.2F-version_1.2E0"
(same meaning as the first example above).

But at least this draft offers a good starting point to indicate locales
more precisely.

I fully support the new reference to the ISO-15924 standard for the script
code, and for documenting the legal values of variant codes (either a year
with possible era, or a registered tag), as well as clearly indicating that
languages codes should be the shortest ISO-639 codes (is it true for a few
legacy languages which previously were coded with 3 letters and upgraded to
2-letter codes, until there was a policy not to do it anymore in the
future?)

Where does it affect Unicode, I don't know, except in its possible normative
data tables which may contain future language code conditions, or in
Language tags inserted in the Unicode encoded texts. Does Unicode need these
extensions?