draft-phillips-langtags-08, process, specifications, and extensions

Sat Jan 1 20:06:32 CET 2005

>  Date: 2004-12-30 12:11
>  From: "Doug Ewell" <dewell at adelphia.net>
>  To: ietf-languages at alvestrand.no

> 1.  All tags valid under the generative RFC 3066bis syntax could have
> been registered, and therefore would have been valid, under RFC 3066 as
> well.

Not so. RFC 3066 section 2.2 specifically requires for IANA registered
tags:
1. the primary subtag be "i"
2. the second subtag consist of 3 to 8 characters.

The generative mechanisms have a primary subtag of 2 or 3 letters
and a second subtag of 2 letters.

The two sets of namespaces do not overlap, in whole or in either
the primary or second subtag.

> 2.  RFC 3066 did not require every possible combination of language
> subtag + country subtag to be registered.

None *could* be registered.

> Indeed, Section 2.2 of RFC 
> 3066 specifically says such combinations "do not need to be registered
> with IANA before use."  Yet you criticize RFC 3066bis for allowing
> "en-Latn-US-boont" to be used without being registered as a unit.

Yes, because an RFC 3066 parser cannot make any sense of it.
I.e. the proposed draft lacks "backwards compatibility".

> > RFC 3066 has no review process for subtags. They are what the ISO
> > lists say they are. It does have a review process for IANA
> > registered tags as part of that registration procedure, which
> > (except for private use tags) must be followed before use of a
> > tag not based on ISO language as a primary tag, and optional
> > ISO country as a secondary tag.
> 
> Having to wait for each specific tag to be registered that does not
> consist of language + country has proven to be inadequate.

Inadequate for whom and for what purpose?

Review and registration (in an ideal case) serve the purpose of
ensuring that there is adequate justification for widespread
deployment and that compatibility issues are considered.

> Vendors have 
> gone outside the spec and created "RFC 3066-like" tags to meet important
> needs like script tagging.

A vendor is free to use private-use tags for such purposes.
For that matter, a vendor is free to use whatever he likes so
long as he doesn't claim compliance with relevant Internet
Standards when he goes outside of strict compliance.

> A standard that gives people what they need 
> (and doesn't hurt the rest) is better than one which forces people to
> violate it.

Nobody is forced to do anything.  And the draft as proposed
would cause problems (as noted above w.r.t. lack of
backwards compatibility, and elsewhere).

> > Not so; the ISO language and country codes are certainly subject
> > to scrutiny (but not to second-guessing and cherry-picking). Under
> > RFC 3066, a tag may be generated from the standard ISO tag, or it
> > may be an IANA registered tag (leaving aside private use tags for
> > the moment).  A parser can easily determine what such a tag is; if
> > the primary subtag has 2 or 3 letters, it is an ISO language code.
> > If the second subtag has 2 letters, it is an ISO 3166 country code.
> > Anything else is either private use (primary subtag is x) or is
> > registered as a complete IANA tag, or is an error.
> 
> Is it not the case that RFC 3066bis provides a similar, but expanded,
> ability to determine the type of each subtag based on its length and
> position within the tag?

No, draft-phillips-langtags-08 does not (specific example noted below).

> > [de-AT-1901, incidentally, (as an example) does not meet the RFC 3066
> > requirement of 3 to 8 characters in the second subtag for registration
> > with IANA...].
> 
> Absolutely correct.  The needs for RFC 3066 tags that go beyond language
> + country has gotten to the point where they have been registered in
> violation of the RFC.  Does that not indicate the need for a revision of
> the core specification?

No, it indicates that the review/registration procedure has violated
the rules of syntax specified by BCP, and as a result has caused
problems of a nature similar to those being criticized w.r.t. ISO
MA action (pot to kettle: "you're black").

> > Under the proposed draft, anybody may legally generate
> > a tag such as
> >   sr-Latn-CS-gaulish-boont-guoyu-i-enochian
> > or
> >   sr-Latn-CS-gaulish-boont-guoyu-i-enochian-x-foo
> > with *no* specific registration requirements (i.e. all components
> > are either registered or require no registration). In the latter
> > case, a parser can only determine that it contains a private-use
> > subtag after wading through the other subtags.  In either case,
> > it is difficult (to say the least) for the recipient or his
> > software to determine what the generator of that tag intended to
> > convey.
> 
> First, remove the "i-enochian" piece from your examples.  That is a
> grandfathered whole-tag and cannot be embedded in a tag that contains
> other stuff.  Check the ABNF again.

I did. It meets the ABNF production "extension", which is legal in
the place where it appears in the examples. OK, so there's some
verbiage a half-dozen pages away in the draft that excludes it.
Another case where the ABNF conflicts with the text unnecessarily
("singleton" could have mentioned "i" as well as x in the note,
or (preferably) it could have specified a-h,j-w,y-z).  In any
event, determining whether or not one has an instance of "extension"
or "grandfathered" or "private-use" involves examining *multiple*
subtags.

> Second, it is true that "sr-Latn-CS-gaulish-boont-guoyu[-x-foo]" can be
> legally generated without being registered.  That is intentional.  We
> have seen that registering whole tags for things like script and
> orthographic variant ('-1901' and '-1996') is tantamount to making
> special exceptions.

You mean like reassignment of "CS" was an exception?  Sauce for
the goose...

> They do not solve the general problem.

Part of the problem is that, lacking a charter, the group working
on the draft doesn't have a clear statement of what "the problem"
is, nor the perspective of dealing with the Internet-related issues
(as opposed to vague "general").

> So we had  
> "yi-latn", and then we got "az-Latn" and "sr-Latn" and "uz-Latn", and
> now someone is quite reasonably requesting "be-latn".  These are all
> tags with legitimate needs.

1. They would be OK (but ill-advised; see next item) if prefixed with
   "i" as a primary tag and with a second tag 3 to 8 letters.
2. As script is an orthogonal issue to language, it would be better
   handled by a separate mechanism providing for specification of script
   where necessary (e.g. a hypothetical Content-Script field).
3. In most cases, it is unnecessary as script is clear from the charset
   or range of codes used from the charset. 

> Perhaps someone will make the case that 
> Japanese written in Romaji needs to be specially indicated and will
> write a request for "ja-Latn", and they will be right too.  Allowing
> script subtags to be used generatively, instead of having to be
> individually registered, solves this real problem.

In an inappropriate way. Without consideration for backwards
compatibility.  In violation of the BCP that specified the syntax
and registration procedure.

> It is true that a RFC 3066bis variant subtag can be used with a "prefix"
> (currently equivalent to "primary language subtag") that is not
> recommended for that variant.  So you can write not only "cel-gaulish"
> but also "sr-gaulish".  Perhaps that should be reconsidered.  But even
> under RFC 3066, one could write "sr-IQ", which is also unlikely to
> reflect a real-world situation (not that someone in Pakistan could be
> speaking Serbian, but that "Serbian as spoken in Pakistan" is a discrete
> concept in need of separate tagging).

At least under RFC 3066, "sr-IQ" is clearly a valid language code
followed by a country code.  With "cel-gaulish" under the proposed
draft, it's not clear whether that combination as a unit is
"grandfathered" or a combination of "lang" and "variant". 

> Even writing "haw-US" could be viewed as inappropriate, if it is
> determined that the "United States" variant of Hawaiian is really the
> only variant worth tagging and that plain "haw" should therefore be used
> instead.

I've stated that adding county codes was, in retrospect, a bad
idea.  It might be prudent to see what is covered by 639-3, then
review what is needed in addition to that, and what could be specified
via an orthogonal mechanism.

> But both RFC 3066 with its choice of "haw" vs. "haw-US", and RFC 3066bis
> with its choice of "sr" vs. "sr-Latn" vs. "sr-CS" vs. "sr-Latn-CS",
> allow flexibility in tagging.

RFC 3066 doesn't require "haw-US", and if encountered provides for
matching it (in an "accept" role) with "haw" (as content to be
provided). "sr-Latn" and "sr-Latn-CS" cannot be matched by an
RFC 3066-compliant process to anything, since they do not fit the
RFC 3066 syntax for well-formed language tags.

> They imply an unwritten rule, that tag 
> generators should Tag Content Wisely (perhaps it should not be
> unwritten), and they require tag recipients to show flexibility, and to
> be "liberal in what they accept."  I believe there was a fellow named
> Jon, fairly well respected in the Internet standards community, who said
> that.

"Liberal acceptance" doesn't mean "anything goes".  One cannot
reasonably expect widely deployed implementations of an existing
standard to change overnight.  Yet that is exactly what would be
required by moving the draft under discussion to BCP.  Because
the draft lacks backwards compatibility and because BCP has no
provision for phased roll-in, that would be a grave error.

> If a user writes "sr-Latn-CS-gaulish-boont-guoyu", it is supremely easy
> to tell what each of the subtags means by looking it up in the registry.

The *existing* registry, as used by RFC 3066 implementations? I
don't think so.  How you you expect that the millions of email
UAs that have to parse Content-Language fields are going to get
updated, *overnight* to use a registry that doesn't exist yet?

> (This is NO DIFFERENT from having to look up "en" and "US" in the
> respective ISO standards to tell what they mean, except that there is
> one one source instead of two.)

It is VERY different. The ISO lists exist now, and use of 639
and 3166 codes is widely deployed in language tag parsers.

> "What the generator intends to convey" 
> may always be difficult to ascertain.  As Peter points out, what does
> the generator mean to convey by writing "de-CH" instead of "de"?  Does
> she refer to spelling, vocabulary, level of formality?

A fair question; I've commented in a separate message.

> > Returning to the private use issue; in RFC 3066, as in
> > every other case that I know of where x is used as an indicator
> > of private use for some name, it is used as a prefix of the name,
> > never buried deep inside the name (as provided for by the draft
> > proposal).
> 
> That is a feature, not a bug.  Generators can write "en-US-x-texas" and
> have that tag mean a lot more than "x-en-texas" to recipients who don't
> understand the private-use part.

That specific example ("en-US-x-texas") is legal under RFC 3066
and could be reasonably handled by RFC 3066 implementations. "x-en-texas"
would clearly be a private use tag.  However the draft allows many
variations that lack backwards compatibility (interpretation of qaa,
etc., allowance for generation with 'x' as the second subtag,
immediately after a language code primary subtag, etc.).

> >> The new draft actually provides a framework in which any subtag's
> >> type can be discerned from its position and size, even if the subtag
> >> itself is unrecognized: this is actually *better* than you could
> >> obtain with the existing registry.
> >
> > Not quite; in the examples above one cannot determine what "enochian"
> > is from its size and position alone -- one needs to know that it
> > follows a single character subtag and that the single character is
> > not an x.
> 
> The fact that it follows a single-character subtag is part of its
> "position."

It's considering multiple subtags together, not based on considertion
of one subtag's size or position.

> > A recipient using software that interprets RFC 3066
> > tags isn't going to be able to do anything useful with any
> > hypothetical tag which contains a script subtag that would be
> > produced under the draft rules (if the script subtag were to appear
> > *after* the region sugtag, one could at least match "sr-CS-Latn"[...]
> > to "sr-CS", which an RFC 3066 parser could handle.
> 
> Of course it can.  "Matching" does not have to consist solely of
> stripping subtags from the right.

That is the only form of matching (of a tag to a range) specified
by RFC 3066, and is therefore the only type used by a strictly
RFC 3066-conforming parser.

> > Again returning to private-use, an RFC 3066 parser can (only)
> > determine that a private-use tag is in use if it has x as the primary
> > tag. There are provisions in the draft syntax that break backwards
> > compatibility.
> 
> Where?  Are there existing RFC 3066 tags that have a subtag of 'x'?
> 
> What backward compatibility is broken?  (Specifically, not by
> stipulation.)

"Backwards compatibility" means that a new generator does not
generate something which would be misinterpreted or which is
uninterpretable by an existing implementation of the previous
standard. "sr-x-foo" is permitted by the proposed draft, and
is not valid per RFC 3066.  Interpretation of "qaa-foo" as
equivalent to "x-foo" is inconsistent with RFC 3066.

> This is what Addison meant by not trying to achieve consensus.  We are
> working to address your concerns, and you spurn them.

No, I'm simply stating that there are so many issues involved,
an action on a specific draft in accordance with specific
procedures in progress, that it is best not to conflate
additional drafts with the one under discussion at this time.
If the IESG chooses to issue draft-08 as BCP, there's not
much point in discussing draft-09, is there? If the IESG
decides that the activity should be handled by an IETF WG,
then there are procedural matters that take precedence (composing
a charter, and having it reviewed and approved, appointment of a
WG chair, etc.).

> >> There would be no RFC 1766 or 3066 if ISO 639 language codes actually
> >> captured all of the nuances of language (doh!).
> >
> > Well, there was a need for separate registered tags and for
> > specification of private use tags, so I don't think that's quite
> > right. It sounds like 639-3 might provide substantially greater
> > coverage.
> 
> Private-use whole tags are of no use to recipients who do not understand
> the entire tag.

That is the essence of what "private-use" means.

> RFC 3066bis tags that include a private-use subtag can 
> at least be partially understood by such recipients.

Maybe. Some of them will simply appear as garbage (neither
clearly private-use nor partially intelligible).

> ISO DIS 639-3 is not an approved standard yet, so an RFC cannot be based
> on it.

True, but that doesn't preclude:
a) work on a draft which would go into effect after ISO publication
   of the final standard
b) an RFC that refers to the draft document as a work in progress
   (although that would be somewhat unusual)

> >> There is a clear need for script codes for distinguishing certain
> >> kinds of Chinese written material...
> >
> > But none of that applies to an audio file of spoken material,
> > where script would be superfluous and, as noted above, would
> > lead to loss of backwards compatibility.
> 
> Then the generator should not use a script tag, or the recipient should
> ignore it.  Is that obscure in some way?

How specifically does any existing, deployed RFC 3066 language-tag
parser "ignore" something (a script tag such as within "sr-Latn-CS")
which doesn't conform to RFC 3066 language-tag syntax, without simply
discarding the entire tag as garbage?

> > I beg to differ. Introduction of a script subtag between language
> > and country code changes matters considerably, in a manner which
> > breaks backwards compatibility.
> 
> Explain where it breaks backwards compatibility, please.

See above.