draft-phillips-langtags-08, process, specifications, "stability", and extensions

Sun Jan 2 23:55:43 CET 2005

Bruce Lilly <blilly at erols dot com> wrote:

>> So, for instance, when an author uses "de-CH", what does he intend
>> recipients to understand to be the difference between that and
>> "de-DE" or even "de"? Neither RFC 1766 or RFC 3066 shed any light on
>> this, and ultimately only the author knows for sure.
>
> That's a somewhat different take on the issue; certainly the ability
> to use a generative mechanism (i.e. w/o review/registration of an
> entire tag) can lead to a proliferation of incompatible uses by
> independent generators (and possibly loss of interoperability as a
> result). The draft under discussion would expand use of generative
> mechanisms to encompass all but private-use tags, and thereby expands
> the potential for such incompatibilities and loss of interoperability.

That's an interesting viewpoint.  Giving tag generators more flexibility
is harmful because it expands the potential for them to screw up.  I
suppose that's true, if you look at it that way.

>>> sr-Latn-CS-gaulish-boont-guoyu-i-enochian-x-foo
>>> ...
>> I've shown that this is no different in general that what already
>> exists for RFC 3066 or RFC 1766.
>
> It is certainly different; under RFC 3066 rules such a tag (as a
> whole) would be subject to review and registration.

Your objection apparently is that silly people would be capable of
generating silly tags.

You've obviously worked with Internet standards in the past.  Is it
generally your experience that Internet standards are geared toward
solving 100% of all possible problems?  Would you say that Internet
protocols are more often built with an eye toward allowing knowledgeable
users to do what they need to do, or toward preventing stupid or
malicious users from misusing the protocol?

> No, you seem to have missed the point; there exist RFC 3066
> implementations. Such implementations, using the RFC 3066 rules,
> could match something like "sr-CS-Latn" to "sr-CS", but could
> not match "sr-Latn-CS" to "sr-CS".  By changing the definition of
> the interpretation of the second subtag, the proposed draft fails
> to be compatible with existing deployed implementations (which is
> what is meant by "backwards compatibility", which is a prime
> consideration for Internet protocols).

It is apparently your opinion that RFC 3066 should not be expanded at
all.  Existing RFC 3066 parsers will not be able to parse a script or
variant subtag, regardless of where it is placed within the tag.

>> At this point, I feel confident that it is not a problem to combine
>> script IDs into "language" tags, and this is the consensus of the
>> domain experts that have been discussing this proposed revision for
>> the past year and more.
>
> Evidently w/o considering the implications of and for core Internet
> protocols.  If script *can be* specified in a language tag *between*
> the language code and country code, then a parser must be able to
> recognize that case and deal with it appropriately (which, as noted
> above, existing RFC 3066 implementations in deployed use do not and
> cannot do) at *any* time and in any context (context may not be
> available when a Content-Language field is parsed).  I don't have an
> issue with provision for specification of script where appropriate,
> but for crying out loud, at least do so in a compatible manner (e.g.
> a Content-Script field) rather than a) breaking compatibility with
> deployed protocols and b) burdening applications which need not be
> concerned with script from having to parse script information.

RFC 3066 generative tags contain subtags for language and (optionally)
region.  Applications that need not be concerned with region information
still have to parse it.

RFC 3066bis generative tags contain subtags for language and
(optionally) script, region, variant, extension, and private-use.
Applications that need not be concerned with any of those optional
pieces of information still have to parse them.

There is no difference here, except that RFC 3066bis allows more types
of subtags.  It cannot be OK for RFC 3066 to require parsers to parse
things they may not actively need, and not OK for RFC 3066bis to do the
same, unless the goal here is to freeze RFC 3066 for all time.

>>>> There is a clear need for script codes...
>>>
>>> But none of that applies to an audio file of spoken material,
>>> where script would be superfluous...
>>
>> Not a problem: the proposed revision *allows* for the use of script
>> IDs but does not require them.
>
> Yes, it's a problem. Having allowed them, each parser must be able
> to handle them.

See above.

>> In the case of audio content, one simply would never include a script
>> ID.
>
> But a Content-Language field parser needs to be able to parse *any*
> Content-Language field, without knowledge of whether the content
> that is referred to by that field is audio, video, image, model,
> application, or text.  Generation is easy; printf("%s", whatever); --
> the problem is in parsing, particularly considering the deployed base
> of RFC 3066-compliant parsers.

See above.  RFC 1766-compliant parsers had to be upgraded to allow RFC
3066's 3-letter language subtags.  Of course that did not happen
overnight.  Why was that situation acceptable for 3066 but not for
3066bis?

> No, given a primary subtag which is a language code (and per RFCs
> 1766 and 3066, that's any primary subtag with 2 or more (RFC 3066
> only, more being limited to 3) characters), the second subtag --
> in either RFC 1766 or RFC 3066 language tags -- is always a country
> code and never a script code.  The proposed draft pulls the rug out
> from under existing parsers by changing that.

Peter and others have addressed this by now.  Under RFC 3066, the second
subtag is only guaranteed to be a region ("country") code if it is 2
letters long.

> The issue at hand is the existing deployed base of RFC 3066
> implementations that depend on the matching algorithm specified
> therein (which doesn't work with a script tag interposed between
> language code and country code).

They are implementing an oversimplified and incomplete version of the
RFC 3066 syntax.  They probably don't handle registered (non-generative)
tags correctly either.

> I would think that that's covered by the "difficulties which might
> arise..." part.  In any event, as the ISO seems to be in the process
> of tightening the rules, it would be a more productive and mutually
> beneficial process to convince the ISO to add specific language
> addressing specific issues than to go off in a hissy fit saying (in
> effect) "we're setting up a registry in competition with the ISO lists
> specifically to second-guess the ISO and its MA". [By a process which
> demonstrably doesn't abide by its own rules, I might add.]

You are expressly ignoring the point that Addison and I have already
made.

Countries will ALWAYS come and go, merge and split.  ISO 3166 will
ALWAYS add and delete codes to keep up with this fact of life.

One of the main goals of keeping a registry of region subtags based on,
but independent from, the ISO list of country codes is to prevent
existing tags that contain "obsolete" or "deprecated" or "withdrawn"
codes from suddenly becoming invalid and unrecognizable.

If ISO 3166/MA had replaced YU with some previously unused code, such as
SP, instead of CS, there would STILL have been a problem with existing
data tagged as "something-YU".  Adherence to the ISO 3166 code lists
made these "-YU" tags instantly invalid as of 2003-07-23.  The proposed
RFC 3066bis registry maintains the validity of YU.

This is NOT simply a matter of CS being reused.  It is also a matter of
TP and ZR and YU, and who knows what else in the future, being deleted
from the standard when there are existing langtags that use those codes.
The registry maintains the validity of codes that have been removed from
the ISO standards.

Characterizing this as "going off in a hissy fit," or "competing with"
or "second-guessing" the ISO maintenance agency, is just twisted.

>> The proposed revision does not create Internet-specific versions of
>> ISO standards; it uses IDs drawn from ISO standards with semantics
>> defined in those source standards at the time they were adopted for
>> use in language tags -- the source for the IDs, the symbols and their
>> meanings all reside in the ISO standards...
>
> By cherry-picking, it effectively seeks to establish such a version.

I'm sure I am not the only one who is getting sick of this deliberately
misleading and snide term "cherry-picking."  ISO 3166 country codes are
not "picked" for the registry, one by one, based on personal taste or
whimsy.

In cases where a single ISO code has two meanings, there is a specific
rule in Appendix C indicating an objective cutoff date for choosing
which meaning is to be used in the registry.  Draft-08 states that this
date is 2003-01-01.  That means that AI is used to mean Anguilla (its
new meaning), not French Afars and Issas (its old meaning), because the
new meaning was in effect before 2003-01-01.  CS, however, does mean
Czechoslovakia and not Serbia and Montenegro, because in this case it
was the *old* meaning that was in effect on 2003-01-01.

The rule is not "... the priority of which to make canonical SHALL be
based on whatever suits Addison and Mark's little fancy."

Now, you may claim that the date 2003-01-01 was specifically chosen to
make the CS issue come out in the desired way, and that would be fairly
close to the truth.  It was perceived that the older meaning of CS would
be more prevalent than the new meaning in existing data.  But given your
evidence from the IBM and HP Web sites, plus a general sense that this
cutoff date might be considered arbitrary, it appears possible that
draft-09 will move the date to either 2005-01-01 or the date of
acceptance of the draft as an RFC, whenever that may be.  (I know you
are not interested in talking about draft-09, but there it is.)

> It does not give leave to cherry-pick bits and pieces of an external
> specification.  RFC 3066 does not do so. The draft under discussion
> does.

Doesn't the DNS "cherry-pick bits and pieces" from ISO 3166 by using
".uk" instead of ".gb", and ".yu" instead of ".cs"?

> Has ISO transferred change control to the IETF so that it can declare
> some codes invalid?

Has anyone tried to claim that RFC 3066bis has something to do with
change control of an ISO standard?  Where do you GET this from?

> The ISO, as developers of ISO 639 and 3166, have rights. In
> particular, they have the right to determine what those standards
> specify -- in whole -- and they have the right to revise and amend
> those standards, and are the sole arbiters of what is (and what is
> not) "valid".

ISO cannot, and does not attempt to, prevent the creation of
specifications that are *based on* ISO standards but differ from them in
some documented way, as RFC 3066bis and the Domain Name System do.

More later.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/