Region subtags under 3066 and 3066bis (long)

Sat Feb 19 19:33:13 CET 2005

I apologize in advance for the length of this post (13 KB).

Frank Ellermann <nobody at xyzzy dot claranet dot de> wrote:

> Okay, I found 830 on the page:
> <http://unstats.un.org/unsd/methods/m49/m49alpha.htm>
>
> You probably copy all these codes to your registry, and if the
> UN later removes a code it's still available in the registry.

Not exactly.  The rules for using a UN numeric code are clearly stated
in the draft.

There are 231 UN codes that represent countries, plus another 37 that
represent regions like "Northern Africa" or economic groupings like
"Small island developing States," plus another 10 that have been removed
from use.

Of those 231 country codes, only three are used in the draft registry:
830 for Channel Islands and 833 for Isle of Man, because there is no ISO
3166 alpha code for them, and 200 for Czechoslovakia, because its ISO
3166 alpha code has been reused for Serbia and Montenegro.  Additional
UN numeric codes would be used if ISO 3166/MA pulled another "CS" by
reusing another alpha code.  The alpha code would continue to have its
old meaning, and the new entity would be represented by a UN numeric
code.

Also, the 30 UN numeric codes that refer to geographical regions are in
the registry, but the ones that denote economic groupings are not.  This
is by rule in the draft (Section 2.2.4 in draft-09).

So it is not simply a matter of copying all the codes to the registry.
It is true, however, that once a code is added to the registry, it stays
there permanently.

> When they'll reach 999 they will pray continue with 1000 etc.
> and not try to recycle old codes.  That part should be clear.

A common myth about the numeric codes seems to be that because they are
"already" up to 894, they must be in imminent danger of running out of
codes.  Not true; as mentioned above, only 278 codes have been assigned
since 1982.  The codes are not assigned in sequential order, and there
is no code change when a country merely changes its name (unlike ISO
3166).

>> IM (Isle of Man) is covered by 833.
>
> Let's hope that they find "gv" before experimenting with 833.

There is no requirement, human or technical, that the Manx language be
associated 1-to-1 with the Isle of Man.  Indeed, there is a great deal
more English spoken on IOM than Manx.

> Maybe add a comment in your list, that alpha-3 codes are only
> listed if there was no alpha-2 code when this registry "was"
> (= will be) started.  And therefore I won't find "glv".

Again, the draft is very clear on this point.  If an alpha-2 code is
available for a given language, only that code is valid for language
tags, and *not* the corresponding alpha-3 code.  Thus neither "fra" nor
"deu" (nor "fre" nor "ger") is valid for use in language tags.  This
concept was actually introduced with RFC 3066, back in 2001.

>> Having to hunt down a reference to a withdrawn ISO 3166 code
>> element (not available for free from ISO) would be a recipe
>> for trouble.
>
> That's why I'm unhappy with the NH example in the draft and its
> reference to the 3rd edition of ISO 3166:  The "free" list is
> the actual list of the 5th edition.  Your collection of regions
> contains numerous codes I've never before heard of (e.g. PU).

I don't have a copy of ISO 3166-3; it costs 70 Swiss francs, which is
roughly 45 euros or 59 U.S. dollars.  I got my historical data from
Clive Feather's page at http://www.davros.org/misc/iso3166.html.  It
isn't official, but I generally think of Clive's work as being
thoroughly researched, so I doubt he made up any of the codes or got any
of the dates badly wrong.  I welcome any corrections from anyone who has
a "real" copy of ISO 3166-3.

It is true that the draft refers to the 3rd edition of ISO 3166 (August
1988).  This is unchanged from RFC 1766, published in 1995.  Perhaps
that should be updated.

But the rationale for using NH in examples was precisely to demonstrate
the use of no-longer-active ISO 3166 codes in language tags.  NH does
not exist in the official on-line ISO 3166-1 list, but neither do ZR and
YU.  They did exist in previous versions of ISO 3166.  If you are
thinking that those "previous versions," or codes taken from them, are
not mentioned anywhere in the draft, you are correct and we may want to
discuss that as well.

> The deprecated FQ could get a canonical TF...
> The deprecated BQ could be AQ or HM or GS...
> The deprecated PU could get a canonical UM...

The rule on canonical equivalents is that code X is considered a
canonical equivalent of code Y (alternatively, Y has an alias of X) if
they represent the exact same entity.  To cite an often-cited example,
YU is an alias for CS because the entity that had the ISO 3166 code YU
is the same as the one that now has CS.  (The "original" Yugoslavia
split apart in the early 1990s; the code wasn't changed until 2003.)
Likewise, ZR is an alias for CD because the two subtags represent the
same plot of land.

Region subtags do not have the canonical/alias relationship if they do
not represent the same plot of land.  FQ is not exactly the same as TF.
PU is not exactly the same as UM.  BQ is... well, you said it yourself:
it's not even clear what code now represents the area formerly coded as
BQ.

Some background on this:

It's important that we have explicit rules to determine what codes are
used, and why, rather than just picking whatever strikes our fancy.  The
rule about the canonical/alias relationship, above, is applied
consistently.  The rule about when UN numeric codes are used is applied
consistently.  So is the "cutoff date" rule that determines when an
alpha-2 region subtag keeps its "old" association and when it takes on a
"new" association.  (See the end of Appendix C for this.)

A couple of months ago, we were accused of "cherry-picking" because we
applied the subtag CS to Czechoslovakia, rather than to Serbia and
Montenegro.  In fact, there was a "cutoff date" rule in place: region
subtags were given the meaning they had in ISO 3166 as of 2003-01-01.
Because CS was changed later in 2003, it fell after the cutoff date.
However, in subsequent versions of the draft, that date was subsequently
moved to 2005-01-01, which means that CS now means Serbia and
Montenegro, but any ISO 3166 code changes *after* the cutoff date will
be handled differently (the old code will remain canonical, the new code
will be an alias).

The point is that if the semi-arbitrary setting of a cutoff date, to
make CS fall on the desired side of the line, is viewed as
"cherry-picking"... well, what would happen if we actually *did* decide
arbitrarily, case by case, which regions should have codes and which
should not?

The former, withdrawn, deprecated, whatever, codes should really be one
of the least of our worries.  They are there for compatibility with
possible past usage.  They are not intended for tags to be generated
today or in the future.  For consistency -- again, to avoid being
arbitrary cherry-pickers -- we have included codes that were withdrawn
before language tags existed as we know them.  They should not pose any
problems to anyone, as long as people Tag Content Wisely.  If you have
content tagged as "fr-FQ", and don't get the Google hits that you would
have gotten if you had used "fr-TF" instead, the sky probably will not
fall.

> The deprecated NT is useless in a registry about languages.

Lots of region subtags have no bearing on languages.  AQ and BV leap to
mind, and those aren't even deprecated.  But we are trying to avoid
being arbitrary.

> The deprecated DD has a canonical DE.  In theory de-DD could
> make sense, but that would be also the moment where I'd want
> a region code for say Wales.

I'd like to start investigating ways to encode the four "countries" of
the UK, so we can get that issue settled and not have it be an albatross
around our necks.  Certainly there are known differences between English
English and Scottish English.

> You have a "changed RH" for ZW, how old is your source ?  That
> was long before RfC 1766 was published, it's ancient history.

My source for withdrawn ISO 3166 codes was Clive Feather's page,
mentioned above.  He has the change from RH to ZW occuring in 1980 (no
specific date), which is in fact the year that Southern Rhodesia became
Zimbabwe.  (Not just a name change, either.)  I'm sure an official copy
of ISO 3166-3 would say the same.

This is 15 years before RFC 1766 was published, but again, it would be
arbitrary to exclude this code.

> Is the "official" name of Macedonia really still "FYROM" ?

It is according to both ISO 3166 and UN M.49.  But remember, even though
the names in the registry are taken from the relevant standards, they
are not normative.  They are just descriptions to identify the region.
You are free to call it "Macedonia" or "FYROM" or "the Skopje regime"
(I've seen that) or whatever else you feel is appropriate, as long as it
is clear you are not referring to, say, the historical region of modern
Greece also known as "Macedonia."

> Some "gradfathered" codes:
> Is i-default different from "und" (alpha-3) ?

I assume so.  See the registration form for i-default.  I personally
hate the fact that und is allowed in language tags, and Section 2.3
discourages it, but again it should not cause any significant problems
in the real world.

> i-klingon has "tlh" in the comment (6th column), why not in the
> 5th column (canonical) ?  I know that it's not yet in the 3066
> registry, but that's only a bug, or isn't it ?
>
> Same problem with i-lux vs. lb.  In that case it is already in
> the 3066 registry.  Navajo also belongs to this group.

This is a loophole in the draft, sort of, that I would also like to see
resolved.

The draft says that the "canonical" field can only contain subtags of
the same type as the alias itself.  For example, the language subtag iw
has a canonical value of he, another language subtag.  The region subtag
BU has a canonical value of MM, another region subtag.  According to
this model, strictly speaking, a grandfathered tag can only have a
canonical value that is another grandfathered tag.  But that's silly;
there are no duplicate "grandfathered" tags and never will be in the
future.

Meanwhile, the fact is that we do want the grandfathered tag i-klingon
to have a canonical value of tlh.  Specifically, that is the WHOLE TAG
tlh, not the language subtag tlh.  (In other words, you can't write
*i-klingon-GB and have it be canonically equivalent to tlh-GB; the
latter would be valid but the former would not.)  The problem is that
the draft currently does not allow this.  I think this needs to be
written in.

>> It extends 3066 in this regard by re-allowing BU, DD, FX, NH,
>> SU, TP, YU, and ZR.
>
> If it was invalid under RfC 1766 in 1996 please get rid of it.

They are allowed for consistency.  Are they likely to cause real-world
problems?

>> the IETF is not in the business of deciding what is a
>> country.
>
> Sure, but FK is a valid ccTLD and a valid 3166 country
> code, and unlike BQ / FQ / NQ there's also a population.
> The UN region number is 238.

FK is also a valid ISO 3166-1 country code, and is therefore a valid
region subtag under RFC 3066bis (and 3066 and 1766).  I don't see the
controversy here.

> As far as I'm concerned the queen of Scotland, the queen
> of England, and the duchess of Normandy can fight it out
> if they have a problem with AC / BM / CP / DG / FK / GB
> / GG / GI / GS / IM / IO / JE / KY / MS / PN / SH / TA
> / TC / UK / VG.  If they don't confuse VG and VI and do
> not alienate the about 47 inhabitants of PN, who cares ?

I'm lost here.  Some of these are valid region subtags, some are not.
They are included in, or excluded from, the registry on the basis of the
underlying standards and the rules described above.  Nothing is included
or excluded on the basis that we personally thought it would matter, or
didn't think it would matter, or didn't care.

> Apparently 3066bis doesn't want CP, DG, and TA in this
> collection, that's fine.  It also doesn't want AC, GG,
> JE, and UK.  That's a pain, because users know ccTLDs.
> And adding a useless BQ to GS is utter dubious.

ISO 3166 codes are used, as they have been for 10 years now, because
they identify countries and country-like entities as defined by an
international organization more qualified to do so than you or I.  The
ccTLD mechanism has chosen to encode some additional things, like AC and
GG and JE, and that is up to them.  It has also chosen to use UK for
United Kingdom instead of GB, and that has actually added to the
confusion because now there are TWO current codes for the same entity
(at least in some people's minds).  The draft uses the codes that it
uses, consistently.

I appreciate your feedback on the repertoire of region subtags.  I think
it would be better to argue about the rules that govern the repertoire,
and say "we should (or shouldn't) allow non-ISO-3166 ccTLDs" or "we
should (or shouldn't) allow ISO 3166 codes withdrawn before 1995,"
rather than arguing about specific codes like BQ.  Debating the merits
of specific codes would make sense if we were picking them one-by-one,
but that is not the case.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/