draft-phillips-langtags-08, process, specifications, and extensions

Thu Dec 30 18:11:10 CET 2004

Bruce Lilly <blilly at erols dot com> wrote:

>> Ultimately, the existance of the RFC 3066 language tag registry
>> trumps all of your arguments about this: all of the tags defined in
>> the generative mechanism of RFC 3066bis could have been registered
>> under 3066 (with loss of functionality for the users of those tags,
>> to be sure). The argument that every complete tag used anywhere is
>> trumped by the existance of the generative mechanism in RFC 3066.
>> Registered variant subtags still must have a recommended range to
>> which they apply. Very little has changed, except that using subtags
>> is a bit more logical.
>
> I've reread that several times and can't make sense of it. Could you
> please rephrase.

1.  All tags valid under the generative RFC 3066bis syntax could have
been registered, and therefore would have been valid, under RFC 3066 as
well.

2.  RFC 3066 did not require every possible combination of language
subtag + country subtag to be registered.  Indeed, Section 2.2 of RFC
3066 specifically says such combinations "do not need to be registered
with IANA before use."  Yet you criticize RFC 3066bis for allowing
"en-Latn-US-boont" to be used without being registered as a unit.

3.  Registered variant subtags must have a recommended range to which
they apply.  Users are permitted to write
"sr-Latn-CS-gaulish-boont-guoyu", but are cautioned that the use of
'gaulish' and 'boont' and 'guoyu' is probably inappropriate.

> RFC 3066 has no review process for subtags. They are what the ISO
> lists say they are. It does have a review process for IANA
> registered tags as part of that registration procedure, which
> (except for private use tags) must be followed before use of a
> tag not based on ISO language as a primary tag, and optional
> ISO country as a secondary tag.

Having to wait for each specific tag to be registered that does not
consist of language + country has proven to be inadequate.  Vendors have
gone outside the spec and created "RFC 3066-like" tags to meet important
needs like script tagging.  A standard that gives people what they need
(and doesn't hurt the rest) is better than one which forces people to
violate it.

> Not so; the ISO language and country codes are certainly subject
> to scrutiny (but not to second-guessing and cherry-picking). Under
> RFC 3066, a tag may be generated from the standard ISO tag, or it
> may be an IANA registered tag (leaving aside private use tags for
> the moment).  A parser can easily determine what such a tag is; if
> the primary subtag has 2 or 3 letters, it is an ISO language code.
> If the second subtag has 2 letters, it is an ISO 3166 country code.
> Anything else is either private use (primary subtag is x) or is
> registered as a complete IANA tag, or is an error.

Is it not the case that RFC 3066bis provides a similar, but expanded,
ability to determine the type of each subtag based on its length and
position within the tag?

> [de-AT-1901, incidentally, (as an example) does not meet the RFC 3066
> requirement of 3 to 8 characters in the second subtag for registration
> with IANA...].

Absolutely correct.  The needs for RFC 3066 tags that go beyond language
+ country has gotten to the point where they have been registered in
violation of the RFC.  Does that not indicate the need for a revision of
the core specification?

> Under the proposed draft, anybody may legally generate
> a tag such as
>   sr-Latn-CS-gaulish-boont-guoyu-i-enochian
> or
>   sr-Latn-CS-gaulish-boont-guoyu-i-enochian-x-foo
> with *no* specific registration requirements (i.e. all components
> are either registered or require no registration). In the latter
> case, a parser can only determine that it contains a private-use
> subtag after wading through the other subtags.  In either case,
> it is difficult (to say the least) for the recipient or his
> software to determine what the generator of that tag intended to
> convey.

First, remove the "i-enochian" piece from your examples.  That is a
grandfathered whole-tag and cannot be embedded in a tag that contains
other stuff.  Check the ABNF again.

Second, it is true that "sr-Latn-CS-gaulish-boont-guoyu[-x-foo]" can be
legally generated without being registered.  That is intentional.  We
have seen that registering whole tags for things like script and
orthographic variant ('-1901' and '-1996') is tantamount to making
special exceptions.  They do not solve the general problem.  So we had
"yi-latn", and then we got "az-Latn" and "sr-Latn" and "uz-Latn", and
now someone is quite reasonably requesting "be-latn".  These are all
tags with legitimate needs.  Perhaps someone will make the case that
Japanese written in Romaji needs to be specially indicated and will
write a request for "ja-Latn", and they will be right too.  Allowing
script subtags to be used generatively, instead of having to be
individually registered, solves this real problem.

It is true that a RFC 3066bis variant subtag can be used with a "prefix"
(currently equivalent to "primary language subtag") that is not
recommended for that variant.  So you can write not only "cel-gaulish"
but also "sr-gaulish".  Perhaps that should be reconsidered.  But even
under RFC 3066, one could write "sr-IQ", which is also unlikely to
reflect a real-world situation (not that someone in Pakistan could be
speaking Serbian, but that "Serbian as spoken in Pakistan" is a discrete
concept in need of separate tagging).

Even writing "haw-US" could be viewed as inappropriate, if it is
determined that the "United States" variant of Hawaiian is really the
only variant worth tagging and that plain "haw" should therefore be used
instead.  This is what Tex's page attempts to sort out (although Tex's
page is informative only and not connected to the draft, and I hesitate
to mention it because not everyone seems to understand this).

But both RFC 3066 with its choice of "haw" vs. "haw-US", and RFC 3066bis
with its choice of "sr" vs. "sr-Latn" vs. "sr-CS" vs. "sr-Latn-CS",
allow flexibility in tagging.  They imply an unwritten rule, that tag
generators should Tag Content Wisely (perhaps it should not be
unwritten), and they require tag recipients to show flexibility, and to
be "liberal in what they accept."  I believe there was a fellow named
Jon, fairly well respected in the Internet standards community, who said
that.

If a user writes "sr-Latn-CS-gaulish-boont-guoyu", it is supremely easy
to tell what each of the subtags means by looking it up in the registry.
(This is NO DIFFERENT from having to look up "en" and "US" in the
respective ISO standards to tell what they mean, except that there is
one one source instead of two.)  "What the generator intends to convey"
may always be difficult to ascertain.  As Peter points out, what does
the generator mean to convey by writing "de-CH" instead of "de"?  Does
she refer to spelling, vocabulary, level of formality?

> Returning to the private use issue; in RFC 3066, as in
> every other case that I know of where x is used as an indicator
> of private use for some name, it is used as a prefix of the name,
> never buried deep inside the name (as provided for by the draft
> proposal).

That is a feature, not a bug.  Generators can write "en-US-x-texas" and
have that tag mean a lot more than "x-en-texas" to recipients who don't
understand the private-use part.

>> The new draft actually provides a framework in which any subtag's
>> type can be discerned from its position and size, even if the subtag
>> itself is unrecognized: this is actually *better* than you could
>> obtain with the existing registry.
>
> Not quite; in the examples above one cannot determine what "enochian"
> is from its size and position alone -- one needs to know that it
> follows a single character subtag and that the single character is
> not an x.

The fact that it follows a single-character subtag is part of its
"position."

> Surely you're not claiming that each individual generator must
> separately register "sr", "Latn", "CS" etc. in order to use them!?!

Of course he is not.

> A recipient using software that interprets RFC 3066
> tags isn't going to be able to do anything useful with any
> hypothetical tag which contains a script subtag that would be
> produced under the draft rules (if the script subtag were to appear
> *after* the region sugtag, one could at least match "sr-CS-Latn"[...]
> to "sr-CS", which an RFC 3066 parser could handle.

Of course it can.  "Matching" does not have to consist solely of
stripping subtags from the right.

> Again returning to private-use, an RFC 3066 parser can (only)
> determine that a private-use tag is in use if it has x as the primary
> tag. There are provisions in the draft syntax that break backwards
> compatibility.

Where?  Are there existing RFC 3066 tags that have a subtag of 'x'?

What backward compatibility is broken?  (Specifically, not by
stipulation.)

>> Well you can't have it both ways. Either CS means Czechoslovakia or
>> it means Serbia and Montenegro.
>
> Certainly in language tags "CS" is in use to mean Srbija i
> Crna Gora-Srpski.  I haven't seen any documented cases where
> it is used (in language tags) to mean Czechoslovakia (but I
> haven't started any archelogical digs to try to uncover any).
> If there has been no such use, then the brouhaha over the change
> is much ado about nothing.  If there has been such use, then
> it's clear that interpretation is going to have to be linked to
> time of generation of the tag if the semantics are to be
> preserved.

Please look at draft-09.  It does what you want with regard to "CS".

> For the moment, we're discussing draft-phillips-langtags-08,
> on which IESG action is pending (in a week).  There are many
> things that the IESG might do when it makes its decision; in
> prudence, I'll wait to see what they decide.  IMO, discussing
> multiple revisions of a draft through multiple IESG New Last
> Calls isn't the most efficient or effective way to make
> progress.

This is what Addison meant by not trying to achieve consensus.  We are
working to address your concerns, and you spurn them.

>> There would be no RFC 1766 or 3066 if ISO 639 language codes actually
>> captured all of the nuances of language (doh!).
>
> Well, there was a need for separate registered tags and for
> specification of private use tags, so I don't think that's quite
> right. It sounds like 639-3 might provide substantially greater
> coverage.

Private-use whole tags are of no use to recipients who do not understand
the entire tag.  RFC 3066bis tags that include a private-use subtag can
at least be partially understood by such recipients.

ISO DIS 639-3 is not an approved standard yet, so an RFC cannot be based
on it.

>> There is a clear need for script codes for distinguishing certain
>> kinds of Chinese written material...
>
> But none of that applies to an audio file of spoken material,
> where script would be superfluous and, as noted above, would
> lead to loss of backwards compatibility.

Then the generator should not use a script tag, or the recipient should
ignore it.  Is that obscure in some way?

>> It is only *one* of the things addressed by the draft. But it is and
>> remains important. Doug Ewell suggested to me that even if no RA or
>> MA ever reuses a code again, it is still ISO 3166/MA's job is to keep
>> the codes in sync with the current state of the world. Whenever
>> countries split up, join together, or change names, ISO 3166/MA will
>> be there to change the code list. The instability is not all the MA's
>> fault, but we still need to protect against it because of legacy
>> data. The lonely CS example should not become the state of affairs
>> going forwards.
>
> Does the ISO not set ground rules for the 3166/MA?  Could it not
> specify that codes are not to be reused?

Didn't you read any of what Addison wrote?  It is not just about reusing
codes.

>> Matching hasn't actually changed.
>
> I beg to differ. Introduction of a script subtag between language
> and country code changes matters considerably, in a manner which
> breaks backwards compatibility.

Explain where it breaks backwards compatibility, please.

> Please see RFC 2026 sections 7.1, 7.1.1, 7.1.3, and 10.1.
> Note that RFC 3066 strictly complies with those sections, while
> the draft under discussion, by cherry-picking from ISO lists
> for which change control has not been transferred to the IESG,
> does not.

Please read the current work on draft-09, which addresses the "CS"
situation in EXACTLY the way you have been asking for.  Please do not
continue to represent this "CS" issue as a great, intractable, fatal
technical defect in the draft.

>> The current draft REPLACES RFC 3066.
>
> Drafts don't replace RFCs.

Before an RFC is written to replace another RFC, it must first be an
I-D.  It should be obvious that this is what Addison meant.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/