draft-phillips-langtags-08, process, specifications, and extensions

Mon Jan 3 01:47:21 CET 2005

Hi Bruce,

> Even if by some oversight or lapse of judgment the tag
> "en-US" were to be registered, its interpretation by a
> parser would be as an ISO 639 language code followed by
> an ISO 3166 country code.  SUch a registration would
> therefore be pointless.  In practice, therfore, it
> simply wouldn't happen.

I direct you to the sgn-XX registrations. Informative registrations of this sort *have* happened.

> > It would be entirely possible for "en-Latn-US-boont" to be 
> registered under the terms of RFC 3066.
> 
> But it hasn't been. No RFC 3066 parser will therefore find
> that complete tag in its list of IANA registered tags, nor
> will it be able to interpret "Latn" as an ISO 3166 2-letter
> country code.

RFC 3066 parsers already should not interpret "Latn" as an ISO 3166 region code. It isn't two letters long.

As for RFC 3066 parsers being unable to interpret the tag, what do you think happens now? New tags are registered all the time and these don't appear in the putative list of tags inside extant RFC 3066 parsers. The parsers don't know what the tag means, but that doesn't invalidate its use for content in that language or by end users, now does it?

For a concrete example, think about "sl-rozaj", just over a year old. None of the browsers in my browser collection, not even Firefox, knows what that tag means, but all of them accept it and emit it in my Accept-Language header and no web sites have complained about it. Okay, I'm not getting any Resian content back (but then it isn't first in my A-L list either).

> > In what sense would any existing RFC 3066 parser (assumed that 
> it conforms to RFC 3066) not be able to make any more or less 
> sense of that than any other registered tag? 
> 
> You're missing the critical factor: it is NOT a registered
> tag -- an RFC 3066 parser has no way of recognizing it.

An RFC 3066 parser has no way of recognizing a tag registered after the parser's list of tags was created. Therefore RFC 3066 parsers do not, as a rule, reject unknown tags. Making sense of a tag is subjective in the case of generative tags today in any event. The level of sense required of an RFC 3066 parser is generally that it be able to use the remove-from-right matching rule on ranges and tags until if finds a value it "knows".

> > There is no reason to create a separate mechanism. When 
> identifying textual content,
> 
> Language is not exclusively associated with text.  It is also a
> characteristic of spoken (sung, etc.) material (but script is
> not).

Yes, I agree. Script is important to textual applications of language tags, though. The fact that it is not applicable to aural or otherwise signed representations of language has nothing to do with whether scripts might need to be indicated on content that is written.

> Note my use of "or" not "and".  I certainly did not state that the
> information could be obtained from charset alone in all cases.

Groping the text is a very poor mechanism for determining the writing system used. Your suggestion is that we *should* be *forced* to grope the text. It also appears to be your position that we should *not* be given a mechanism whereby users can indicate a script preference when selecting language content.

> The analogous way to handle that in Internet protocols would be
> via Content-Script and Accept-Script where relevant (which they
> would not be for audio media).

I think that's an awful idea. Why should users have to set two headers to get one result?

> Sorry -- saying so doesn't make it so.  I have explained in
> detail that an RFC 1766/3066 parser cannot be expected to
> make sense of unregistered "sr-Latn-CS" etc.  I have pointed
> to specific second subtag length requirements in RFC 3066 for
> registration.

Yes, actually it does when the facts fit. Your details are wrong: parsers cannot make sense of any tag they don't have information about and this does not invalidate the use of said tags. See the sl-rozaj example above. The fact that the parser cannot "make sense" of an unregistered tag doesn't have any implications for end users as a result. The specific subtag length requirements in RFC 3066 you cite are just wrong. Any subtag can be registered, as long as it has the requisite length and content restrictions and draft-langtags doesn't violate these.

> No, a strict RFC 3066 parser will not be able to identify "sr-Latn"
> or "sr-Latn-CS" as valid tags.

No, a strict RFC 3066 parser has to have an up-to-the-second list of registered tags. Unless you've just written an implementation that foolishly does it, no implementations reject unknown tags as long as the tags fit the ABNF requirements of RFC 3066. Draft-langtags utilitizes this fact to its advantage and actually tidies things up a bit.

Look, Bruce, we're not going to agree and Mark and I are not going to change the draft in the manner you appear to be asking for here. We'll see you, I suppose, at the end of Last Call.

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
http://www.webMethods.com

Chair, W3C Internationalization Working Group
http://www.w3.org/International

Internationalization is an architecture. 
It is not a feature.