draft-phillips-langtags-08, process, specifications, "stability", and extensions

Wed Jan 5 08:42:39 CET 2005

>  Date: 2005-01-03 02:09
>  From: "Peter Constable" <petercon at microsoft.com>

> > Precisely; an RFC 1766/3066 parser, based on the 1766 and
> > 3066 specifications, can expect four classes of language tags:
> > 1. ISO 639 language code as the primary subtag, optionally
> >    followed by an ISO 3166 country code as the second tag
> > 2. i as the primary tag; complete tag registered
> > 3. x as primary tag; private-use
> > 4. some other IANA-registered complete tag
> > 
> > "sr-CS-Latn" fits category 1. "sr-Latn-CS' fits none.
> 
> You are mistaken; "sr-Latn-CS" fits your category 4.

I think not; it is not a registered tag.  There is a possibility
that it could fit through the "no rules apart from the syntactic
ones for the third and subsequent tags" given the registration of
"sr-Latn" (you are correct about that; I missed it).  In that
respect, the choice of examples is poor; consider "en-US-Latn"
(category 1) vs. "en-Latn-US" (no category).

> >  The draft
> > certainly changes that, in a way which an RFC 1766/3066
> > parser cannot be expected to cope.
> 
> [...] RFC 1766/3066 need to be able to deal with tags that contain pieces they don't know about -- the only subtags they can know about are initial subtags of "i", "x" or ISO 639 IDs, or a second subtag consisting of an ISO 3166 code in case the first subtag is and ISO 639 ID.

Right. I.e. they should be able to deal with superfluous stuff
on the right.  But not script tags that suddenly appear between
language code and country code.

> There are lots of other possible subtags permitted by RFC 1766/3066, including subtags that happen to be script IDs from ISO 15924. This draft does not change that in the slightest. 

When in the 3rd or subsequent subtags, or when part of a
tag which is registered in its entirety.  That is certainly
proposed to be changed by the draft, in both respects.

> > Convince me by demonstrating that all deployed implementations
> > handle "sr-Latn-CS" at least no differently than "sr-CS-Latn".
> 
> Why? They should not, be design.

Again, poor choice of example. Consider "en-Latn-US" vs. "en-US-Latn".

If one wants (presumably text) in US English in Latin script, the
latter string is a valid RFC 3066 language tag which matches the
known semantics of "en-US", even if the RFC 3066 parser has no way
of interpreting the 3rd (and any subsequent) subtag(s).  The former
is not a *valid* (neither registered in its entirety, nor beginning
with language code and country code) language-tag, nor could it be
matched by an RFC 3066 parser to anything greater than plain "en",
and that's presuming that such a parser would even attempt to match
a known invalid tag to the set of valid tags.  For the triple of
language/country/script to match usefully in the general case by
RFC 3066 parsers (which are unaware of script in general), the first
and second subtags would have to remain language code and country
code respectively.

> > The proposed draft would make "sr-CS-Latn"
> > illegal and would instead require "sr-Latn-CS" which cannot be
> > recognized as a valid language tag by an RFC 1766/3066 parser, let
> > alone matching against "sr-CS".
> 
> There is no reason why an RFC 1766/3066 parser should not recognize "sr-Latn-CS" as valid since it conforms to the syntax specified.

Perhaps (as mentioned above) due to registered "sr-Latn", but
"en-Latn-US" is unquestionably not *valid* as far as an RFC 3066
parser; it might not be rejected as invalid on grounds of syntax,
but it fits none of the 4 valid categories noted earlier.

> > Registration of a complete tag is not itself a problem.  Registration
> > of a complete tag which incorporates script information is not an
> > ideal solution to the issue of conveying script information; that
> > would be more appropriately done using an orthogonal mechanism to
> > convey the orthogonal information...
> 
> That's one opinion; there are many who hold a different opinion.

See my separate message discussing a hypothetical search for
an example of why text-specific considerations (and that might
include collating order as well as script) should be kept
separate.

> > How is what is proposed in this draft any more "cherry-
> > picking" than that?
> > 
> > 1. RFC 3454 is not BCP, and isn't being pushed through for immediate
> >    Standards status without a phased roll-in. The draft under discussion
> >    has been proposed as BCP which would lack phased roll-in.
> 
> So acceptability of selective usage depends upon whether the document is a BCP or a proposed standard?

No, but it is a difference between the two situations. The phased
roll-in of Standards Track procedure permits correction of errors
(I am not suggesting that RFC 3454 is erroneous, BTW), BCP does
not.

> > 2. RFC 3454 does not declare any parts of ISO 10646 as not valid and
> >    does not call for setting up an IANA registry of code points for the
> >    purpose of effectively declaring ISO 10646 code points invalid.  The
> >    draft under discussion explicitly seeks to set up a registry to
> >    replace use of ISO standard list.
> 
> RFC 3454 does say that some parts of ISO 10646 are not valid in strings output by stringprep implementations. This draft is analogous. If new characters are added to ISO 10646, it is certainly possible that RFC 3454 could be updated to exclude some of those new characters as well; what is proposed in this draft is analogous; the only difference is that the values considered invalid for the given purpose are documented in the IANA registry rather than in an RFC -- which is certainly the easier way to maintain things, though perhaps it's not considered the preferred means of doing this in the IETF context.

No, the RFC 3454 considerations for what is valid are based on
protocol considerations, not on a Quixotic quest for "stability"
of nations.  There are also notable differences in specification
of special cases in an RFC (esp. Standards Track vs. BCP as noted
above) vs. a registry w.r.t. community review and conflict
resolution procedures.

> > 3. RFC 3454 does not seek to redefine the meaning of any ISO 10646 code
> >    points.  The draft under discussion does, as specifically noted in
> >    the case of the ISO 3166 code "CS".
> 
> This draft would not change the meaning of an ISO identifier; it simply does not use the latest assigned meaning[...]

The "latest assigned meaning" in this case is in documented use
on the Internet in language-tags; a change is therefore a change
in meaning from said use on the Internet.

> (Note: the draft itself does not entail that CS in particular should be handled one way or another

I quote:

   region| CS| Czechoslovakia| 2004-06-28| |

which is certainly not the meaning assigned by the ISO 3166
list and as currently used in language-tags on the Internet:

SERBIA AND MONTENEGRO;CS

> > So, hypothetically, if some other standards body [...] you would have no issue with such
> > cherry-picking?
> 
> Well, it would be a concern [...]

And it's a concern in this case.  All the more so since it
would establish a precedent that might well lead to additional
instances such as illustrated by the hypothetical scenario.