FW: Proposed Successor to RFC 3066 (language tags)

Addison Phillips [wM] aphillips at webmethods.com
Thu Nov 20 01:44:07 CET 2003


Philippe posted the message below to the Unicode list in response to an announce there from Mark and I. I am purposely responding on this list with no cross-post to the Unicode list, since this topic is Off-topic for that list.

Thanks again Philippe for your comments. My response follows:

> However the problem with that scheme is its new use of characters "%" and
> "=". There are a lot of applications that where not expecting 

1. We considered the compatibility problem when making our proposal. Mark and I decided to include the two "incompatible" characters as the better choice in design. It remains to be seen whether our choice of escape characters, etc., are a problem. 

> I think it's quite strange that these extensions have not used the existing
> restricted encoding set to encode them, instead on relying on "%" and "=".

2. The use of "_" and "." would also be illegal under the existing standard. Pick your favorite ASCII characters... we chose these two, but any two you pick other than ALPHANUM and HYPHEN-MINUS would be additions to the range of characters. I think you're thinking of domain names (URIs have a less retrictive set). Other characters or escaping mechanisms can be substituted if they make more sense.

> Under this new scheme, the following language tag may be valid:
> "sr-Latn-SP-2003-x-href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0"
> which here would mean: {
>     language="sr"; // Serbian
>     script="Latn"; // Latin
>     region="SP"; // Serbia-Montenegro

3. Note that "SP" is unassigned. Click here for the semi-official table: http://www.iso.ch/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/iso_3166-1_decoding_table.html. According to this draft, the code for "Serbia and Montenegro" would be CS1, a specially registered value.

> For GET URLs, these extra "%" and "=" will need to be URL-encoded to get
> through correctly, as the following would become possible and prone to

4. URLs must obey the rules for URLs. Values placed in a "GET" URL that used an extended language tag as a value would be no exception. This may be unattractive, but doesn't invalidate the design. I'm more concerned about direct consumers, such as the Accept-Language header in HTTP, than I am about transport encodings. The characters chosen were picked from the list of characters valid in HTTP headers and URIs.

> languages codes should be the shortest ISO-639 codes (is it true for a few
> legacy languages which previously were coded with 3 letters and upgraded to
> 2-letter codes, until there was a policy not to do it anymore in the
> future?)

5. RFC 3066 requires the shortest form of both ISO639 and ISO3166 to be used (the latter being a moot point currently). The new draft does not change this, but does extend it in the area of stability.

> But at least this draft offers a good starting point to indicate locales
> more precisely.

6. No, these are tags for the identification of languages. A "locales extension" is something Mark and I have in mind, but this isn't it.

> and for documenting the legal values of variant codes (either a year
> with possible era, or a registered tag)

7. Don't forget private use variant tags.

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility
http://www.webMethods.com
Chair, W3C Internationalization (I18N) Working Group
Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International

Internationalization is an architecture. 
It is not a feature.

-----Original Message-----
From: Philippe Verdy [mailto:verdy_p at wanadoo.fr]
Sent: Wednesday, November 19, 2003 3:51 PM
To: aphillips at webmethods.com
Cc: unicode at unicode.org
Subject: Re: Proposed Successor to RFC 3066 (language tags)


From: Addison Phillips [wM]
> Please note that there is a discussion list for this topic at:
ietf-languages at iana.org
>
> While Mark and I welcome your comments here or privately, off-list, you
can best be
> a part of the discussion by joining that list. Join the list by sending a
request email
> to:  ietf-languages-request at iana.org

I note that the language tags proposal includes the following EBNF
productions for extensions that may be padded after the language code,
script code, region code, variant code:

extensions  = "-x" 1* ("-" key "=" value)
key  = ALPHA *alphanum
value  = 1* utf8uri
alphanum  = (ALPHA / DIGIT)
utf8uri  = (ALPHA / DIGIT / 1*4 ("%" 2 HEXDIG))

Under this new scheme, the following language tag may be valid:
"sr-Latn-SP-2003-x-href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0"
which here would mean: {
    language="sr"; // Serbian
    script="Latn"; // Latin
    region="SP"; // Serbia-Montenegro
    variant="2003";
    extensions="-x"; {
        href="http://www.iana.org/"
        version="1.0"
    }
}

However the problem with that scheme is its new use of characters "%" and
"=". There are a lot of applications that where not expecting something else
in this field than just alphanum and "-" or "_" or ".", so that the language
tag could safely be used without specific escaping within URIs (for example
in HTTP GET URLs) or as options of a MIME type (I take a sample here, which
may not correspond to an existing option of the "text/plain" MIME type):

Content-Encoding: text/plain; charset=UTF-8;
lang=sr-Latn-SP-2003-x-href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0

This may break the compatiblity of some parsers if such "extended language
tags" are found there, as there are two "=" signs within the value of the
"lang=" option.

For GET URLs, these extra "%" and "=" will need to be URL-encoded to get
through correctly, as the following would become possible and prone to
generate form data parsing errors:

http://www.anysite.domain/process-form.cgi?lang=sr-Latn-SP-2003-x-href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0

I think it's quite strange that these extensions have not used the existing
restricted encoding set to encode them, instead on relying on "%" and "=".
Why not using "_" instead of "=" and "." instead of "%", like this:
"sr-Latn-SP-2003-x-href_http.3A.2F.2Fwww.2Eiana.2Eorg.2F-version_1.2E0"
(same meaning as the first example above).

But at least this draft offers a good starting point to indicate locales
more precisely.

I fully support the new reference to the ISO-15924 standard for the script
code, and for documenting the legal values of variant codes (either a year
with possible era, or a registered tag), as well as clearly indicating that
languages codes should be the shortest ISO-639 codes (is it true for a few
legacy languages which previously were coded with 3 letters and upgraded to
2-letter codes, until there was a policy not to do it anymore in the
future?)

Where does it affect Unicode, I don't know, except in its possible normative
data tables which may contain future language code conditions, or in
Language tags inserted in the Unicode encoded texts. Does Unicode need these
extensions?



More information about the Ietf-languages mailing list