draft-phillips-langtags-08, process, specifications, "stability", and extensions

Sun Jan 2 19:21:58 CET 2005

>  Date: 2005-01-01 21:27
>  From: "Peter Constable" <petercon at microsoft.com>
>  To: ietf-languages at alvestrand.no, ietf at ietf.org

> > Separating the specification
> > of language via a field from registration procedure was entirely
> > appropriate, as BCP documents are used for procedures and policies
> > and not for technical specifications.
> 
> I'm not as deeply versed in the distinct purposes of distinct kinds of documents used for Internet specifications as you. It's my understanding that there is a general expectation that BCP documents are used for something like a registration procedure but not for a technical specification.

Details are mostly in RFC 2026, though an understanding of
RFCs 2418 and 3160 may help.  IETF Standards exist primarily
to ensure interoperability.  Technical specifications
generally progress on a 3-step phased roll-in to ensure
that that goal is met; that is the Standards Track path
(Proposed, Draft, full Standard).  The specific requirements
for each level are detailed in RFC 2026; among them are
the existence of multiple independent interoperable
implementations (I'll return to that).  Registration
procedures are generally not suitable for such a phased
approach, as there is usually no way to phase-in a
registry, and a single global registry obviously can't
have multiple implementations.

> We already anticipate future revisions, and it would be a possibility to consider whether a division of the content into distinct documents of different types would be better. In view of the time already taken, the delays incurred, and the fact that there have been products in development that have been assuming the completion of this current round of revision, and think it makes best sense to allow the mixing that has existed since RFC 1766 to persist in a BCP for this round.

I see no reason not to consider separation now, and
several reasons to do so.

> > Again, apart from knowing the preferred divisions of content into different types of IETF documents, I don't think there's a problem in describing one type of matching algorithm in this document provided it is recognized that some applications may require different algorithms.

The distinct types of documents are important; where there
are multiple possible algorithms and there is no one algorithm
with clear technical superiority, multiple Experimental RFCs
are generally issued so that implementors can experiment with
the various algorithms and gain experience which may lead to
selection of one of the algorithms, modifications to one or
more of them, combinations of features, etc.  It would be
inappropriate to specify any one such algorithm as BCP initially,
as BCP does not have provision for generating experience through
phased roll-in or experimentation.

> > And, again, I don't think it would be particularly helpful to delay completion of this revision any further to address an issue of mixing different kinds of content in a BCP that has existed through two versions already

It need not introduce any delay; indeed it may speed the process
of moving the registration process and the tag syntax specification
by removing one objection.

[re. proposal to register only subtags, eliminate review/registration
of tags, potential for proliferation of incompatible, non-interoperable
uses]
> That was a issue I initially voiced when it was first suggested that the registry be a registry of subtags rather than tags. In practice, I'm not sure at this point that there's really a significant greater problem with the new level of generativity than there was before.

I don't see how one can speak of "In practice" in relation to a
change which has not gone into effect.  OTOH, the possibility
of loss of interoperability is clear.

> The reason is that the new elements have quite specific semantic effects on the whole, whereas the semantic impact of a region ID on the whole is less certain: it may imply dialectal variants, spelling variants; it might actually reflect nothing but simply have been inserted because it could. In contrast, there is little question of what effect a script ID such as "Hans" has on the whole. 
[...]
> You're describing a pathological case that is never going to occur, which I don't think is particularly helpful. If the meaning of "guoyu" and of "boont" are clearly documented, it will be evident that something like "sr-boont-guoyu" is basically as meaningless as "sr-guoyu" or as "pl-Hant-TH". 

You're getting hung up on a specific example and missing the
general principle that the example illustrates.  The meaning
of the various subtags is no more clear (in the sense that
you describe) than now; moreover, there is no way that an
existing RFC 1766/3066 implementation is going to magically
"discover" new semantics or accept new syntax.

> Likewise, it really isn't necessary IMO to insist on registration of complete tags vs. subtags just to avoid tags that are not useful, inconsistent with reality, or logically impossible. 

That is not the sole reason for review and registration.

> > No, you seem to have missed the point; there exist RFC 3066
> > implementations. Such implementations, using the RFC 3066 rules,
> > could match something like "sr-CS-Latn" to "sr-CS", but could
> > not match "sr-Latn-CS" to "sr-CS". By changing the definition of
> > the interpretation of the second subtag, the proposed draft fails
> > to be compatible with existing deployed implementations (which is
> > what is meant by "backwards compatibility", which is a prime
> > consideration for Internet protocols).
> 
> Ah, but RFC 3066 does not sanction use of tags like "sr-CS-Latn" without registration, and no such tags are registered. 

Precisely; an RFC 1766/3066 parser, based on the 1766 and
3066 specifications, can expect four classes of language tags:
1. ISO 639 language code as the primary subtag, optionally
   followed by an ISO 3166 country code as the second tag
2. i as the primary tag; complete tag registered
3. x as primary tag; private-use
4. some other IANA-registered complete tag

"sr-CS-Latn" fits category 1. "sr-Latn-CS' fits none.

> Because of the prevelance of implementations that use a left-prefix matching algorithm, it is more useful to combine elements in the order "sr-Latn-CS" rather than "sr-CS-Latn". If "sr-CS-Latn" were used, these implementations would fail to match "sr-CS-Latn" with "sr-Latn", which is actually a greater problem than failing to match "sr-Latn-CS" with "sr-CS".

You are ignoring backwards compatibility.

> > > At this point, I feel confident that it is not a problem to combine
> > script IDs into "language" tags, and this is the consensus of the domain
> > experts that have been discussing this proposed revision for the past year
> > and more.
> > 
> > Evidently w/o considering the implications of and for core Internet
> > protocols. 
> 
> You assume this is w/o such consideration. I think otherwise. I can't say that consideration has been given to every single individual protocol. But consideration has been given to many different protocols and usage scenarios. I think it's appropriate for the onus to be on someone to identify particular problems they feel would exist with protocols that concern them (which is precisely the kind of thing we have last-call announcements for).

I know of three and a half Internet protocols that make use
of language tags, and as far as I can tell, none were considered
prior to this discussion:
1. The Internet message format (STD 11, also RFC 2822 as amended
   by RFC 3282) [Content-Language, Accept-Language fields]
2. MIME (RFCs 2045-2049), which uses encoded-words (RFC 2047 as
   amended by RFC 2231 and errata) to indicate charset and
   language of human-readable text in a manner consistent with
   BCP 18)
3+. HTTP and SIP, which are similar protocols that also may make
   use of RFC 3282 fields.
There are in addition several protocols that transfer content in
STD 11 format, but which do not specifically process language tags
which might be used within such messages.

> > If script *can be* specified in a language tag *between*
> > the language code and country code, then a parser must be able to
> > recognize that case and deal with it appropriately (which, as noted
> > above, existing RFC 3066 implementations in deployed use do not and
> > cannot do) at *any* time and in any context (context may not be
> > available when a Content-Language field is parsed). 
> 
> As described above, I think this argument is invalid.

I think a detailed review of existing implementations is probably
called for prior to further work on the tag syntax; we need to
know precisely where backwards compatibility issues arise.

> > I don't have an
> > issue with provision for specification of script where appropriate,
> > but for crying out loud, at least do so in a compatible manner (e.g.
> > a Content-Script field) rather than a) breaking compatibility with
> > deployed protocols and b) burdening applications which need not be
> > concerned with script from having to parse script information.
> 
> I've stated that the imputed back-compat problem is a non-issue.

You haven't convinced me of that.  Show me source code of an
existing, deployed, RFC 3066 parser that handles "sr-Latn-CS".

> Lot's of consideration was given early on to this.

You haven't convinced me of that either. Indeed, your earlier
comments about 1.8k-octet+ language-tags convinces me that
core Internet protocols have not been considered.

> If you want to press this argument, I think you need to show exactly how a problem would result in realistic usage scenarios.  

I have explained the classes of tags described by RFCs
1766 and 3066, and how the proposed changed syntax permits
tags which do not fit in any of those classes.  In the
interest of interoperability, I believe the onus is on the
proposers of the revised format to demonstrate that existing
deployed implementations will be able to handle the revised
syntax with no loss in functionality (meaning, e.g., that
"sr-Latn-CS" must be recognizable by all such deployed
implementations and be interpreted as equivalent to "sr-CS").

> Can you identify for us an Internet protocol that would not be concerned with script distinctions? 

1. An STD 11 Internet text message in English (no script
   distinctions, everything in ANSI X3.4)
2+. A MIME-part with type audio/32kadpcm or any of the
   other 67 registered audio subtypes, transferable
   within a MIME message conforming to STD 11 or via HTTP.
   Likewise for any of the registered video subtypes which
   may contain audio.

> Can you identify an Internet protocol for which matching algorithms imply that "sr-CS-Latn" makes better sense than "sr-Latn-CS"?

The issue is backwards compatibility with RFC 1766/3066
parsers, and cuts across all Internet protocols using
language-tags.

> > > > > There is a clear need for script codes...
> > >
> > > > But none of that applies to an audio file of spoken material,
> > > > where script would be superfluous...
> > >
> > > Not a problem: the proposed revision *allows* for the use of script IDs
> > but does not require them.
> > 
> > Yes, it's a problem. Having allowed them, each parser must be able
> > to handle them.
> 
> Look, they're already there in registered tags. This draft isn't doing anything new in that regard.

RFC 1766/3066 registered tags are integral tags, and can't
be meaningfully (in the context of a parser) be said to
contain a script subtag; the entire tag needs to be recognized
by a 1766/3066 parser and treated as a unit.  The draft
certainly changes that, in a way which an RFC 1766/3066
parser cannot be expected to cope.

> > > > and, as noted above, would
> > > > lead to loss of backwards compatibility.
> > >
> > > But, as noted above, this is not an issue that is peculiar to the
> > proposed revision -- it already existed in RFC 3066.
> > 
> > No, given a primary subtag which is a language code (and per RFCs
> > 1766 and 3066, that's any primary subtag with 2 or more (RFC 3066
> > only, more being limited to 3) characters), the second subtag --
> > in either RFC 1766 or RFC 3066 language tags -- is always a country
> > code and never a script code.
> 
> Go back and read RFC 3066 again. It does not impose that constraint:
> 
> <quote>
>    The following rules apply to the second subtag:
> 
>    - All 2-letter subtags are interpreted as ISO 3166 alpha-2 country
>      codes from [ISO 3166], or subsequently assigned by the ISO 3166
>      maintenance agency or governing standardization bodies, denoting
>      the area to which this language variant relates.
> 
>    - Tags with second subtags of 3 to 8 letters may be registered with
>      IANA, according to the rules in chapter 5 of this document.
> </quote>
> 
> It must be a country ID *if* it is two letters, but not otherwise.

But registered as a complete tag if not an ISO country code; and
you have yourself noted that "sr-Latn-CS" is not so registered.

> > The proposed draft pulls the rug out
> > from under existing parsers by changing that.
> 
> You are completely mistaken on this point -- the proposed draft does not change the constraint you assumed as that constraint never existed.

No, see discussion above re. "sr-Latn-CS" vs. classes of tags
provided for in RFCs 1766/3066.

> I don't think it's that uncommon to refer to a specification A that makes use of another specification B as an application of B.

Perhaps, but I think it's best to avoid misunderstanding in
technical discussion by being precise in use of terminology.

> > and ignoring the critical importance of
> > backwards compatibility.
> 
> As stated earlier, I quite disagree that back-compat issues have been ignored.

Convince me by demonstrating that all deployed implementations
handle "sr-Latn-CS" at least no differently than "sr-CS-Latn".

> > > Note that there is nothing that prevents other applications from using
> > other matching algorithms, including perhaps something that is able to
> > recognize in "az-AZ" and "az-Latn-AZ" that both involve Azeri and used in
> > Azerbaijan.
> > 
> > The issue at hand is the existing deployed base of RFC 3066
> > implementations that depend on the matching algorithm specified
> > therein (which doesn't work with a script tag interposed between
> > language code and country code).
> 
> You say that these do not work; these implementations will still work, but they will match "sr-Latn" but not "sr-CS" with "sr-Latn-CS". If that is a problem, please explain why.

No, unregistered "sr-Latn" is not a valid RFC 3066 language-tag. Nor
is "sr-Latn-CS".  "sr-CS-Latn" is likely valid (the first two subtags
are legal and have defined interpretation; RFC 3066 says that there
are no requirements (implicitly including registration) other than
syntax for third and subsequent subtags). "sr-CS" is clearly valid
and in use.  An RFC 1766/3066 parser/matcher has a chance of matching
legal "sr-Cs-Latn" containing script designation with legal "sr-CS"
(no script specified). The proposed draft would make "sr-CS-Latn"
illegal and would instead require "sr-Latn-CS" which cannot be
recognized as a valid language tag by an RFC 1766/3066 parser, let
alone matching against "sr-CS".

> > As previously noted, that is a danger recognized by RFC 2026 in
> > activity that does not conform to IETF procedures; it is
> > possible to reach good consensus on the wrong approach.
> 
> Well, that potential was created when RFC 1766 was first approved.

True, but the statute of limitations on issues related to RFC 1766
has long expired. Not so for the currently discussed draft.  There
is an opportunity to get the work on the right track by setting up
an official IETF Working Group.

> Tags like az-Latn could have been registered under the terms of that RFC just as readily as RFC 3066. 
> 
> But you are speaking as though it's a problem that these tags are registered. I have no idea why.

Registration of a complete tag is not itself a problem.  Registration
of a complete tag which incorporates script information is not an
ideal solution to the issue of conveying script information; that
would be more appropriately done using an orthogonal mechanism to
convey the orthogonal information (in which case there would be no
discussion about the ordering within a tag, because the information
would be separate).

> > > 7.1 says...
> 
> > > The proposed revision does not create Internet-specific versions of ISO
> > standards...
> 
> > By cherry-picking, it effectively seeks to establish such a version.
> 
> I would not call what is done "cherry-picking". Any identifier defined in the source standard is valid for use, except in the case that the identifier was previously defined with a different meaning in that ISO standard. That isn't cherry-picking; that is a blindly-applied general principle, created with reasoned motivation: to provide stability.

The effect is certainly cherry-picking, and as noted w.r.t. "sr-CS"
is destabilizing of current deployed usage.

> But speaking of selective usage, have you noticed that RFC 3454 identifies specific characters from ISO/IEC 10646 as prohibited? Various space and control characters are not permitted, INVISIBLE TIMES isn't permitted, END OF AYAH isn't permitted, COMBINING GRAVE TONE MARK isn't permitted... How is what is proposed in this draft any more "cherry-picking" than that?

1. RFC 3454 is not BCP, and isn't being pushed through for immediate
   Standards status without a phased roll-in. The draft under discussion
   has been proposed as BCP which would lack phased roll-in.
2. RFC 3454 does not declare any parts of ISO 10646 as not valid and
   does not call for setting up an IANA registry of code points for the
   purpose of effectively declaring ISO 10646 code points invalid.  The 
   draft under discussion explicitly seeks to set up a registry to
   replace use of ISO standard list.
3. RFC 3454 does not seek to redefine the meaning of any ISO 10646 code
   points.  The draft under discussion does, as specifically noted in
   the case of the ISO 3166 code "CS".

> > > 10.1 states a general policy regarding IP...
> 
> > The ISO, as developers of ISO 639 and 3166, have rights. In particular,
> > they have the right to determine what those standards specify -- in
> > whole -- and they have the right to revise and amend those standards,
> > and are the sole arbiters of what is (and what is not) "valid".
> 
> They certainly have and retain rights over standards for language, script and country identifiers. They do not, however, determine what is valid for use in Internet protocols. Just as it is appropriate for an IETF document RFC 3454 to specify for particular reasons that certain encoded entities of ISO/IEC 10646 are not valid for Stringprep output, so also it is appropriate for an IETF document to specify for particular reasons that certain encoded entities of an ISO standard are not valid for use in language tags used on the Internet.

So, hypothetically, if some other standards body, say W3c were to declare
that "CS" used in a language-tag in an application profile of SGML (i.e.
not an Internet protocol) meant something other than what the draft
under discussion would have it mean while importing the meaning of other
language tag components w/o change, you would have no issue with such
cherry-picking?