Preferred Values for Irregular Tags

Phillips, Addison addison at
Wed Jan 20 19:47:19 CET 2010

The grandfathered 3066 tags already have to be handled exceptionally in code. What Mark is trying to do is map all of them to “normal” (what RFC 5646 calls “regular”) language tags. Otherwise a tag such as “en-GB-oed” is treated as if it were a single atomic subtag.

This doesn’t pose a problem for existing users of the tag. They don’t have to change to a new variant (although it would be nice of them to migrate to it). Their existing content will work as well as ever. However, software that sees the tag “en-GB-oed” can transform it to (for example) “en-GB-oxford” and then do useful things (such as noticing that it is related to “en-GB” when doing text-to-speech, for example). And other users can apply more or fewer subtags to suit their needs (“en-oxford”, “en-Brai-GB-oxford”, or absurdly “en-GB-fonipa-oxford”).

Some of the other cases that Mark raises seem reasonable. I think I’d just ignore the deprecated ones.

Addison Phillips
Globalization Architect -- Lab126

Internationalization is not a feature.
It is an architecture.

From: ietf-languages-bounces at [mailto:ietf-languages-bounces at] On Behalf Of Mark Davis ?
Sent: Wednesday, January 20, 2010 10:37 AM
To: Michael Everson
Cc: ietflang IETF Languages Discussion
Subject: Re: Preferred Values for Irregular Tags

The grandfathered tags behave differently than anything else. All the other tags are productive: you can combine them in different ways with expected results, while the grandfathered tags are atomic; you can't combine one of them with, say, a region. Moreover, you can write APIs to deal with that structure, returning the base language code, script code, etc. The uniformity of program APIs is of extreme importance when you are dealing with massive amounts of program code.

Of course we could parse en-GB-oed. But it doesn't fit into the regular ABNF production rules, and so doesn't work well in APIs.

Out of the billions of possible language tags (without even counting combinations using variants), there are literally only a handful of grandfathered codes (that cannot be correctly mapped to regular language tags). If we can fix these few, then there is nothing standing in the way of everyone being able to use all of them effectively.

That is, for existing data, we (and others like us) would convert tags like en-GB-oed on input to regular tags; then the information is still accessible. Otherwise our only choices are to dump the data or map to the 'closest' code.


On Wed, Jan 20, 2010 at 10:23, Michael Everson <everson at<mailto:everson at>> wrote:
On 20 Jan 2010, at 01:58, Mark Davis ☕ wrote:
> Why do this? Well, at Google we convert anything that has an
> irregular format to a regular format.
Which means what? Your programmers aren't able to identify and parse
the string "en-GB-oed"? Guess what, Mark... that has been in use since
2003. There's data out there in it.

Please explain what it is that you are up to.

Michael Everson *
Ietf-languages mailing list
Ietf-languages at<mailto:Ietf-languages at>

-------------- next part --------------
An HTML attachment was scrubbed...

More information about the Ietf-languages mailing list