draft-phillips-langtags-08, process, specifications, and extensions

Peter Constable petercon at microsoft.com
Sun Jan 2 03:58:58 CET 2005

> From: ietf-languages-bounces at alvestrand.no [mailto:ietf-languages-
> bounces at alvestrand.no] On Behalf Of Bruce Lilly

> > 2.  RFC 3066 did not require every possible combination of language
> > subtag + country subtag to be registered.
> None *could* be registered.

That is not correct; there is nothing in RFC 3066 prohibiting that (you are mistaken in thinking that registered tags can only begin with "i-"). In fact, there are several registered tags that take this form:

sgn-BR, sgn-CO, sgn-DE, sgn-DK, sgn-ES, sgn-FR, sgn-GB, sgn-GR, sgn-IE, sgn-IT, sgn-JP, sgn-MX, sgn-NI, sgn-NL, sgn-NO, sgn-PT, sgn-SE, sgn-US, sgn-ZA  

In fact, RFC 3066 explicitly indicates that this is possible:

   This procedure MAY also be used to register information with the IANA
   about a tag defined by this document, for instance if one wishes to
   make publicly available a reference to the definition for a language
   such as sgn-US (American Sign Language).

> > Indeed, Section 2.2 of RFC
> > 3066 specifically says such combinations "do not need to be registered
> > with IANA before use."  Yet you criticize RFC 3066bis for allowing
> > "en-Latn-US-boont" to be used without being registered as a unit.
> Yes, because an RFC 3066 parser cannot make any sense of it.
> I.e. the proposed draft lacks "backwards compatibility".

It would be entirely possible for "en-Latn-US-boont" to be registered under the terms of RFC 3066. In what sense would any existing RFC 3066 parser (assumed that it conforms to RFC 3066) not be able to make any more or less sense of that than any other registered tag?

> > > [de-AT-1901, incidentally, (as an example) does not meet the RFC 3066
> > > requirement of 3 to 8 characters in the second subtag for registration
> > > with IANA...].

There is nothing in RFC 3066 that says a registered tag must have 3 to 8 characters in the second subtag. It simply requires that any tag in which the second subtag is 3 to 8 letters must be registered.

> > Absolutely correct.  The needs for RFC 3066 tags that go beyond language
> > + country has gotten to the point where they have been registered in
> > violation of the RFC.  Does that not indicate the need for a revision of
> > the core specification?
> No, it indicates that the review/registration procedure has violated
> the rules of syntax specified by BCP, and as a result has caused
> problems of a nature similar to those being criticized w.r.t. ISO
> MA action (pot to kettle: "you're black").

Um, this entire sub-thread was based on an invalid premise. No rules of syntax were violated in any review/registration procedure.

> > So we had
> > "yi-latn", and then we got "az-Latn" and "sr-Latn" and "uz-Latn", and
> > now someone is quite reasonably requesting "be-latn".  These are all
> > tags with legitimate needs.
> 1. They would be OK (but ill-advised; see next item) if prefixed with
>    "i" as a primary tag and with a second tag 3 to 8 letters.
> 2. As script is an orthogonal issue to language, it would be better
>    handled by a separate mechanism providing for specification of script
>    where necessary (e.g. a hypothetical Content-Script field).
> 3. In most cases, it is unnecessary as script is clear from the charset
>    or range of codes used from the charset.

There is no reason to create a separate mechanism. When identifying textual content, the identity of the writing system *is* very closely related to the identity of the language variety. Indeed, the writing system is generally going to be of greater importance than distinctions such as dialect or spelling that are reflected by country identifiers. 

It is not adequate to simply say that script can be identified from the charset or range of codes used. In the former regard, a charset of UTF-8 provides no information. In the latter regard, relying on the range of codes used in content does not provide a way to request an HTTP server to return pages that are (say) Azeri in Latin script rather than Cyrillic script. (You have mentioned numerous times the need to respect how language tags are used in Internet protocols; pot to kettle... )

> > Perhaps someone will make the case that
> > Japanese written in Romaji needs to be specially indicated and will
> > write a request for "ja-Latn", and they will be right too.  Allowing
> > script subtags to be used generatively, instead of having to be
> > individually registered, solves this real problem.
> In an inappropriate way. Without consideration for backwards
> compatibility.  In violation of the BCP that specified the syntax
> and registration procedure.

Not inappropriate at all. And all your repeated comments about lack of consideration for backwards compatibility and violation of syntax and procedures of BCP47 have been shown to be invalid.

> RFC 3066 doesn't require "haw-US", and if encountered provides for
> matching it (in an "accept" role) with "haw" (as content to be
> provided). "sr-Latn" and "sr-Latn-CS" cannot be matched by an
> RFC 3066-compliant process to anything, since they do not fit the
> RFC 3066 syntax for well-formed language tags.

Certainly they do; and certainly an RFC 3066 parser will match "sr" with "sr-Latn" or "sr-Latn-CS", and "sr-Latn" with "sr-Latn-CS".

Peter Constable

More information about the Ietf-languages mailing list