Language Identifier List Comments, updated

Doug Ewell dewell at adelphia.net
Tue Dec 28 07:28:54 CET 2004


JFC (Jefsey) Morfin <jefsey at jefsey dot com> wrote:

> the internet is usually accepted within the IETF as the adherence to
> the documents resulting from the Internet standard process. What is
> discussed in here is a review of the BCP 47 (
> http://www.inter-locale.com/ID/draft-phillips-langtags-08.html ). The
> discussion of Tex's page is not a problem, what is a problem is that
> its discussion (please consider your own comment below) does not make
> difference between what is private suggestion and what is
> authoritative in an Internet RFC context.

The fact that Tex and others are discussing his page on an IETF mailing
list does not make his page an official IETF document, nor does it imply
that Tex or the rest of us consider it to be one.

Tex has requested that we use the subject line "Language Identifier List
Comments" for his list, and other subject lines for other topics.  Until
just now, we had been using the subject line "New Last Call: 'Tags for
Identifying Languages' to BCP" for discussions of the draft, and had
been doing a fairly good job keeping the two topics separate.

> Please reread what you just wrote. "there is no reason to qualify
> [Catalan] with a region subtag": only an acknowledged Catalan language
> authority can say that. etc. You are in the process of defining the
> language of the countries. What IETF can only do is to say "if there
> is a need to qualify Catalan this is the way to do it".

Please reread what I wrote, all of it:

"For example, according to the page, 'ca' for Catalan is enough
information; there is no reason to qualify it with a region subtag, as
'ca-ES', because Catalan is Catalan regardless of where spoken, or
because it is only spoken in Spain."

I didn't make any such judgment myself; I said "according to the page."
And Tex and John are not "defining the language of the countries"
either -- that is, saying that Catalan *should* only be spoken in Spain
or that Spaniards *should* speak Catalan.  This is all merely an attempt
to describe what is, not prescribe what should be.

>> The RFC never defines the *entities* associated with
>> these codes, and it is very clear about where the definitions do come
>> from (generally the U.N.).
>
> The RFC 1591/3490 define that authorities, and IANA acknowledges them,
> for the only Internet governed language related issue which is IDNA
> (this is discussed below).

Language tags have been around a lot longer than IDNA.

> This is a final discussion of the Phillips-08 Draft, you call "RFC
> 3066bis". This discussion only shows that its attempt to define the
> tags of these languages does not clarify enough its confusions. What
> is to be discussed is what is unclear in relation to other issues
> under Internet standard process.

RFC 3066bis and predecessors define how language tags should be
assembled.  They define the *structure* of language tags:

If you mean Catalan, you write "ca".
If you mean Catalan as used in Spain, you write "ca-ES".
If you mean Catalan written in the Fraktur script variant, you write
"ca-Latf".
And so on.

What "confusions" relative to this are in need of clarification?

> http://www.iana.org/assignments/idn/
> http://www.iana.org/assignments/idn/registry-language-template.txt
> http://www.iana.org/assignments/idn/pl-arabic.html
> http://www.iana.org/assignments/idn/jp-japanese.html

These are all interesting, in their own way, but what do they have to do
with the structure of language tags?

> You will note that the Polish entry (much controverted submission due
> to its linguistic and political implication, resulting into a letter
> of apologies by the NASK) is only a character set list.

I did note this.  Indeed, "ar-PL" struck me as an amusing tag.  I would
be surprised if it ever shows up on Tex's page.  But it's perfectly
legal, just like "ale-BE", and RFC 3066bis makes no judgment as to
whether either of these tags "should" be used.

> We have to understand that there are three levels of language usage as
> per the Internet standard process or as introduced by this IANA
> procedure and which are candidate to the BCP47 tagging.

RFC 3066bis mentions a few of the different uses to which language tags
can be put.  That is as far as its scope goes.  Anything that is
specific to a particular protocol's use of language tags falls within
the scope of that protocol.

>> As Tex said, language tags are used for much more than IDNA.  And
>> once again, I fail to see what Unicode has to do with any of this.
>
> The Internet standard process is about end to end interoperability,
> not about brain to brain interintelligility. However sucj an
> interintelligbility calls for a character level interopeability. The
> se are the same characters for texts, documents, LHS and RHS. In a
> network, consistency is of the essence, so it is likely that what is
> to be done for RHS, LHS, protocols, documents etc. is to be consistent
> or that we will meet extensive conflicts due to the importance and the
> complexity of the issue.

Whoa.  But again, I fail to see what Unicode has to do with any of this.
Language tags are constructed using ASCII-range characters.

>> This is way out of scope for RFC 3066bis or any of its predecessors.
>
> No. This is consistent with and affects current IANA procedures
> regarding IANA tables named after the document you discuss the update.

Issues involving IANA tables belong with standards related to IANA
tables.

I read through all the documents you cited (except the Japanese one).  I
noted that many of them make use of language tags.  Can you tell me
exactly how these documents, and the standards behind them, are affected
if the range of language tags is expanded to allow script subtags,
variant subtags, and private-use subtags?  Does it change the way a
Danish registry will allocate domain names in the ".museum" namespace?
Does it encourage or discourage the creation of another "ar-PL"
scenario?  Where are the complications and confusion?

> I never said that anything should be "forced", but that 2 alpha
> overlaps the ccTLD list creating user's confusion. There is a need for
> a simple formatted contextual cultural definition. It cannot be 2 and
> 3 alphas. It has to be 2 + "*" or 3. It is likely that most of the new
> usages will stabilize using 3 letters (over 7250 3 letters tags, a few
> 2 letters tags will be odd and resource consuming in new
> applications).

Language tags, even under RFC 1766, are of the form <language> or
<language>-<country>.  They can never be of the form <country> or
<country>-<language>.  Where are end users getting confused?

The primary language subtag has been either 2 or 3 letters
(deterministically) since the publication of RFC 3066 almost four years
ago.  No debilitating levels of confusion or instability seem to have
been observed thus far.  There *would* be confusion and instability,
however, if implementations had to be changed to recognize "fr" and
"fra" as equivalent when they did not have to do so before.

> I can only recommend you to read the part 9.6 of the following ETSI
> document.
> http://portal.etsi.org/stfs/documents/STF231/eg_202132v010101p.pdf
>
> I submit that the RFC 3066bis ABNF should be checked against that
> recommendations.

This appears to be a well-written document.  Section 9.6 states the need
to identify languages, and recommends against using little flag icons.
Fair enough; I don't think anyone here would strongly disagree.

It makes mention of the standards ISO 639-2 and ISO 3166-1 (as well as
ES 201 381, with which I am not familiar), but I do not read this as a
recommendation against the established, well-known RFC 3066 approach of
using ISO 639-1 alpha-2 codes when available, and ISO 639-2 alpha-3
codes otherwise.

In general, before claiming that RFC 3066bis is inadequate because it
does not meet the requirements of some protocol, I think it would be
good to ask whether the rules in question are the same as in RFC 3066.
For example, mixed alpha-2 and alpha-3 language codes are allowed by RFC
3066.  The ability to choose between "ca" and "ca-ES" exists in RFC
3066.  If these situations do not cause havoc now, why would they cause
havoc under RFC 3066bis?

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/




More information about the Ietf-languages mailing list