Sample IANA language subtag registry

Doug Ewell dewell at adelphia.net
Wed Jul 7 06:02:23 CEST 2004


Section 3.2 of draft-phillips-langtags-04 describes the format of the
IANA language subtag registry, which would be a normative part of RFC
3066bis.

This registry is to be assembled by the language subtag reviewer, but to
get a jump on implementation and to solidify my understanding of
conformance issues, I've gone ahead and created my own copy of what the
registry might look like:

    http://users.adelphia.net/~dewell/lstreg.txt

This is intended as a sample of the final registry, and at present
contains only the standard codes:

* ISO 639 alpha-2 and alpha-3 language codes
* ISO 15924 alpha-4 script codes
* ISO 3166 alpha-2 and UN M.49 numeric region codes

and no entries yet for registered subtags or grandfathered whole-tags.

This is not intended in any way to supersede or pre-empt the work of the
language subtag reviewer.  If it throws any light on issues surrounding
the registry and its use, or raises any questions that need to be
answered before RFC 3066bis is ratified, so much the better.

Comments and criticisms on this sample registry are STRONGLY solicited.

Some notes regarding this implementation:

1.  All description fields have been normalized to ASCII, in accordance
with Section 3.2.  This was done by simply stripping the diacritics from
a few language and region names; for instance, "Côte d'Ivoire" lost its
circumflex.  I don't know if that approach is correct for "Bokmål."
Some items in parentheses in the ISO 15924 lists were shortened or
deleted because their only role was to note different spellings due to
the variant use of diacritics.

2.  All ISO 639 and 3166 names with multiple parts separated by
semicolons were truncated at the first semicolon.  This applies to names
like "Norwegian Bokmål; Bokmål, Norwegian" and had to be done because of
the semicolon-delimited nature of the registry file.  The alternative
would be to make it a semicolon-quote delimited file, which would be
more painful to parse.

3.  All non-canonical region names were updated to reflect the modern
name, if any.  For example, the region code HV ("Upper Volta") is a true
alias for BF ("Burkina Faso"), and so the name for HV has been changed
to "Burkina Faso" as well.

In keeping with this, please note that I renamed YU from "Yugoslavia" to
"Serbia and Montenegro."  This is contrary to the example in Section
3.2, but I think it is correct because YU should actually serve as an
alias for the canonical 891 "Serbia and Montenegro."  The code YU was
most recently used to refer to the modern country now called Serbia and
Montenegro, not the pre-1990s nation of Yugoslavia that contained six
republics.  This might be controversial.  Discussion of this detail is
especially welcome.

4.  Codes are sorted by code element (not by description) within their
respective categories, except that all alpha-2 language codes appear
before alpha-3 codes, and alpha-2 region codes appear before numeric
codes.

5.  Deprecated and/or alias codes are intermixed with current and/or
canonical codes.

6.  The date field is set to today's date (2004-07-07).  Obviously this
is for illustration and will not be the date of adoption of the subtag
registry.

7.  Deprecated and changed codes include a comment showing when each
code was deprecated or changed within its respective standard.  I used
the ISO 639 Change Notice page and Clive Feather's ISO 3166 summary page
<http://www.davros.org/misc/iso3166.html> to get these dates.  Someone
with a real copy of ISO 3166-3 might be able to do better.  Comments for
the older-but-canonical region codes (TP and ZR) show the code that was
subsequently assigned in ISO 3166 and mapped as an alias in the
registry.

8.  In my Web page "Supplementary codes for RFC 3066bis," I mentioned
the UN numeric code 172 ("Commonwealth of Independent States"), which
was found among the economic-grouping codes that are not suitable for
language tags.  I've included region 172 in this sample registry, but
I'm still not sure whether it should be valid in language tags.  Perhaps
ar-172 might need to be distinguished from ar-SA, in the same way es-419
needs to be distinguished from es-ES.  This is another question I'd like
to see answered.

In the next few days, I'll try to add entries for the registered subtags
and grandfathered tags that I think will be established, although again
I'm not trying to steal the reviewer's job.

Thanks in advance for any comments,

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/




More information about the Ietf-languages mailing list