RE: draft-phillips-langtags-08, process, specifications, "stability", and extensions

Mon Jan 3 08:09:31 CET 2005

> From: ietf-languages-bounces at alvestrand.no [mailto:ietf-languages-
> bounces at alvestrand.no] On Behalf Of Bruce Lilly

> > Ah, but RFC 3066 does not sanction use of tags like "sr-CS-Latn" without
> registration, and no such tags are registered.
> 
> Precisely; an RFC 1766/3066 parser, based on the 1766 and
> 3066 specifications, can expect four classes of language tags:
> 1. ISO 639 language code as the primary subtag, optionally
>    followed by an ISO 3166 country code as the second tag
> 2. i as the primary tag; complete tag registered
> 3. x as primary tag; private-use
> 4. some other IANA-registered complete tag
> 
> "sr-CS-Latn" fits category 1. "sr-Latn-CS' fits none.

You are mistaken; "sr-Latn-CS" fits your category 4.

> > I've stated that the imputed back-compat problem is a non-issue.
> 
> You haven't convinced me of that.  Show me source code of an
> existing, deployed, RFC 3066 parser that handles "sr-Latn-CS".

It matches the RFC 3066 syntax, and so can be recognized; the notion of language-range is still applicable, and nothing about that tag would prevent language-range handling. In what way could a parser *not* handle it? Even if I had my finger on source code, I can't demonstrate that a fault doesn't exist if you only say "there's a problem in there somewhere".

> > If you want to press this argument, I think you need to show exactly how
> a problem would result in realistic usage scenarios.
> 
> I have explained the classes of tags described by RFCs
> 1766 and 3066, and how the proposed changed syntax permits
> tags which do not fit in any of those classes.

It has been shown that you have described such classes incorrectly.

> In the
> interest of interoperability, I believe the onus is on the
> proposers of the revised format to demonstrate that existing
> deployed implementations will be able to handle the revised
> syntax with no loss in functionality (meaning, e.g., that
> "sr-Latn-CS" must be recognizable by all such deployed
> implementations and be interpreted as equivalent to "sr-CS").

Why is it a requirement that a request for "sr-CS" must match "sr-Latn-CS"? That's quite unreasonable. It's like saying that a bunch of new characters are added to Unicode and existing implementations should recognize strings using the new characters as being equivalent to strings using only existing characters. *That tag represents _new_ functionality.*

> > Look, they're already there in registered tags. This draft isn't doing
> anything new in that regard.
> 
> RFC 1766/3066 registered tags are integral tags, and can't
> be meaningfully (in the context of a parser) be said to
> contain a script subtag;

If one is registered with a script subtag, then they contain a script subtag.

> the entire tag needs to be recognized
> by a 1766/3066 parser and treated as a unit.

And nothing prevents that happening with a tag containing a script subtag.

>  The draft
> certainly changes that, in a way which an RFC 1766/3066
> parser cannot be expected to cope.

Not at all. RFC 1766/3066 need to be able to deal with tags that contain pieces they don't know about -- the only subtags they can know about are initial subtags of "i", "x" or ISO 639 IDs, or a second subtag consisting of an ISO 3166 code in case the first subtag is and ISO 639 ID. There are lots of other possible subtags permitted by RFC 1766/3066, including subtags that happen to be script IDs from ISO 15924. This draft does not change that in the slightest.

> Convince me by demonstrating that all deployed implementations
> handle "sr-Latn-CS" at least no differently than "sr-CS-Latn".

Why? They should not, be design.

> > > The issue at hand is the existing deployed base of RFC 3066
> > > implementations that depend on the matching algorithm specified
> > > therein (which doesn't work with a script tag interposed between
> > > language code and country code).
> >
> > You say that these do not work; these implementations will still work,
> but they will match "sr-Latn" but not "sr-CS" with "sr-Latn-CS". If that
> is a problem, please explain why.
> 
> No, unregistered "sr-Latn" is not a valid RFC 3066 language-tag. Nor
> is "sr-Latn-CS".  "sr-CS-Latn" is likely valid (the first two subtags
> are legal and have defined interpretation; RFC 3066 says that there
> are no requirements (implicitly including registration) other than
> syntax for third and subsequent subtags). "sr-CS" is clearly valid
> and in use. An RFC 1766/3066 parser/matcher has a chance of matching
> legal "sr-Cs-Latn" containing script designation with legal "sr-CS"
> (no script specified). 

In your comments here, you are being rather loose in your assessment of what is or isn't valid. The tag "sr-Latn" is a registered, valid RFC 3066 language tag. The tag "sr-Latn-CS" is not registered, but could be and would be valid if registered. The tag "sr-CS" is certainly valid; I have no idea how widely it is used. The tag "sr-CS-Latn" would be valid if registered, but is not registered (and it is unlikely that, if requested, a consensus could be obtained to register it, given the preference among those involved in reviewing requests for a different ordering of subtags).

*If* "sr-CS-Latn" were registered (it is not), then a language-range matcher *must* match a request of "sr-CS" with content tagged "sr-CS-Latn". In preceisely the same way, if "sr-Latn-CS" were registered, a language-range matcher would, and without modification could, match a request of "sr-Latn" with "sr-Latn-CS".

You cannot say that "sr-Latn-CS" has any less or more likelihood of being handled by existing language-range matchers than "sr-CS-Latn". Either the matchers work per the terms of RFC 3066 or they do not, and RFC 3066 does not indicate that either of these is any less valid than the other.

> The proposed draft would make "sr-CS-Latn"
> illegal and would instead require "sr-Latn-CS" which cannot be
> recognized as a valid language tag by an RFC 1766/3066 parser, let
> alone matching against "sr-CS".

There is no reason why an RFC 1766/3066 parser should not recognize "sr-Latn-CS" as valid since it conforms to the syntax specified.

A language-range matcher should match "sr-Latn-CS" against a request for "sr-Latn", but not "sr-CS". That is by design since a left-prefix matching algorithm is limited in what tags it can match, and it is considered more important to match for script than for regional variations.

> > But you are speaking as though it's a problem that these tags are
> registered. I have no idea why.
> 
> Registration of a complete tag is not itself a problem.  Registration
> of a complete tag which incorporates script information is not an
> ideal solution to the issue of conveying script information; that
> would be more appropriately done using an orthogonal mechanism to
> convey the orthogonal information...

That's one opinion; there are many who hold a different opinion.

> > But speaking of selective usage, have you noticed that RFC 3454
> identifies specific characters from ISO/IEC 10646 as prohibited? Various
> space and control characters are not permitted, INVISIBLE TIMES isn't
> permitted, END OF AYAH isn't permitted, COMBINING GRAVE TONE MARK isn't
> permitted... How is what is proposed in this draft any more "cherry-
> picking" than that?
> 
> 1. RFC 3454 is not BCP, and isn't being pushed through for immediate
>    Standards status without a phased roll-in. The draft under discussion
>    has been proposed as BCP which would lack phased roll-in.

So acceptability of selective usage depends upon whether the document is a BCP or a proposed standard? I cannot see anything in RFC 2026 that suggests that (and it seems pretty odd).

> 2. RFC 3454 does not declare any parts of ISO 10646 as not valid and
>    does not call for setting up an IANA registry of code points for the
>    purpose of effectively declaring ISO 10646 code points invalid.  The
>    draft under discussion explicitly seeks to set up a registry to
>    replace use of ISO standard list.

RFC 3454 does say that some parts of ISO 10646 are not valid in strings output by stringprep implementations. This draft is analogous. If new characters are added to ISO 10646, it is certainly possible that RFC 3454 could be updated to exclude some of those new characters as well; what is proposed in this draft is analogous; the only difference is that the values considered invalid for the given purpose are documented in the IANA registry rather than in an RFC -- which is certainly the easier way to maintain things, though perhaps it's not considered the preferred means of doing this in the IETF context.

> 3. RFC 3454 does not seek to redefine the meaning of any ISO 10646 code
>    points.  The draft under discussion does, as specifically noted in
>    the case of the ISO 3166 code "CS".

This draft would not change the meaning of an ISO identifier; it simply does not use the latest assigned meaning in case a prior ISO-assigned meaning in use on the Internet exists. 

(Note: the draft itself does not entail that CS in particular should be handled one way or another, and the question of the best handling of CS to provide stability on the Internet is open to comment as a separate issue from the draft itself.)

> So, hypothetically, if some other standards body, say W3c were to declare
> that "CS" used in a language-tag in an application profile of SGML (i.e.
> not an Internet protocol) meant something other than what the draft
> under discussion would have it mean while importing the meaning of other
> language tag components w/o change, you would have no issue with such
> cherry-picking?

Well, it would be a concern, though it is their prerogative to do what they want in their specifications. Since W3C has consistently referenced RFC 1766/3066, however, this no more than a purely hypothetical question -- I have no expectation of such a thing ever happening.

Peter Constable