Ietf-languages Digest, Vol 24, Issue 5

Sat Dec 11 01:54:59 CET 2004

> RE: New Last Call: 'Tags for Identifying Languages' to BCP
>  Date: 2004-12-10 16:37
>  From: "Peter Constable" <petercon at microsoft.com>
>  To: ietf-languages at alvestrand.no
>  
> Bruce Lilly's message makes several inaccurate statements against the
> proposed draft, and misrepresents some of the changes being made. My
> main concern is that I don't know where it was circulated. I might be
> wrong, but I get the impression it was written with a different audience
> in mind and then copied here.
> 
> 
> 
> > -----Original Message-----
> 
> > > There are problems with the the RFC 3066 definition of generative
> tags,
> > > however. The ISO 639 and ISO 3166 standards are not freely available
> and evolve
> > > over time.
> > 
> > Accessibility has not been a problem for this implementor...
> 
> I agree with Bruce, that accessibility of ISO 639 and ISO 3166 has not
> been the issue. Unfortunately, his comments do not indicate what the
> real issues were.

My comments are in response to the "New Last Call" made on
the ietf-announce list.  They are in response to the text which
accompanied that new last call and the text of
draft-phillips-langtags-08.txt dated November 2002.  The
specific claim that accessibility has been a problem was made in
the text accompanying the new last call (q.v.).  For those not
subscribed to the ietf-announce list, the text of the new last
call can be seen at
http://www1.ietf.org/mail-archive/web/ietf-announce/current/msg00755.html

> > > The largest change in the specification is that it modifies the
> structure of
> > > the language tag registry. Instead of having to obtain lists of
> codes from five
> > > separate external standards...
> 
> > Contrary to the implicit claim, the ISO documents mentioned
> > above comprise two standards (available in two languages each),
> > not "five separate external standards".
> 
> RFC 3066 made reference to ISO 639-1, ISO 639-2 and ISO 3166-1; the
> proposed replacement adds ISO 15924. I would count that as four ISO
> standards. Up-to-date code tables for all four are readily available.

For the purpose of implementation of validation of language-tags,
the ISO 639 list includes both the 2- and 3-character codes in a
single document.  The claim (again from text accompanying the
new last call) states that there is some difference in the draft
proposal from 3066 in that 3066 (the text alleges) requires
"lists of codes from five separate external standards" -- in fact
two lists suffice for 3066 implementations.

> I think this is a serious misrepresentation of the intent of the
> proposal: the draft nowhere suggests, let alone declares, that the
> source ISO standards are irrelevant.

A poor choice of words on my part. The text and draft suggests
that only the proposed new registry should be consulted, and
the draft clearly specifies that the description of all subtags is
to be provide in English (only).

> Rather, the intent of the 
> comprehensive registry is to ensure stability in IETF implementations by
> protecting them from unpredictable changes in ISO standards, such as the
> re-definition of "CS" as a country identifier not long ago.The
> denotation of identifiers listed in the registry is based on their
> definition in the ISO standards, not on an informative descriptor
> provided in the registry;

It's not clear to me that the proposal will provide protection
against the whims of politicians.  If the definition of "CS" as
a country code changes again under the proposed scheme,
how is one to determine specifically what some archived
language-tag referred to at some point in time?  I'm not
particularly concerned about that problem, as I am resigned
to instability associated with anything specified by politicians
(and that includes the UN region codes).

> and as Bruce quite clearly pointed out, those 
> source standards are readily accessible. So the suggestion that
> implementers will no longer have access to French-language names from
> the source ISO standards simply is vacuous.

But if the proposed new registry's description of "CS" says
"foo" and the ISO standard code list says "bar", what's
an implementor supposed to present to a user as *the*
description associated with "CS"?

> As for concerns of Anglo-centricity, I'm sure that the authors had no
> anti-French motive, and would be open to suggestions as to how that
> could be addressed.

One possibility would be two description fields.  But the
registry would need a charset closer to ISO-8859-1 than
to ANSI X3.4 as currently specified.  Or an encoding
scheme.

> Surely, though, this is not a technical argument 
> against the proposal.

Not purely technical, though it presents problems for
existing implementors who provide bilingual support.
Eliminating bilingual descriptions for the language,
country (and UN region) codes leaves implementors
in a quandary.

> > The ABNF in the draft permits all of the following tags which
> > are not legal per the RFC 3066 ABNF:
> >    supercalifragilisticexpialidoceus
> >    y-----
> >    x1234567890abc
> >    a123-xyz
> 
> In fact, none of these is permitted by the ABNF of the draft.

ABNF from the draft:

   Language-Tag = (lang
                   *("-" extlang)
                   ["-" script]
                   ["-" region]
                   *("-" variant)
                   *("-" extension)
                   ["-" privateuse])
                   / privateuse         ; private-use tag
                   / grandfathered      ; grandfathered registrations

   lang            = 2*3ALPHA           ; shortest ISO 639 code
                   / registered-lang
   extlang         = 3ALPHA             ; reserved for future use
   script          = 4ALPHA             ; ISO 15924 code
   region          = 2ALPHA             ; ISO 3166 code
                   / 3DIGIT             ; UN country number
   variant         = ALPHA (4*7alphanum) ; registered variants
                   / DIGIT (3*7alphanum)
   extension       = singleton 1*("-" (2*8alphanum)) ; extension subtag(s)
   privateuse      = "x" 1*("-" (1*8alphanum))       ; private use subtag(s)
   singleton       = ALPHA             ; single letters
                                       ; (except x, which has special meaning)
   registered-lang = 4*8ALPHA           ; registered language subtag
   grandfathered   = ALPHA *(alphanum / "-")  ; grandfathered registration
   alphanum        = (ALPHA / DIGIT)    ; letters and numbers

Note that the RFC 2234 definition of an asterisk in front of
a production (with no adjacent numbers, as is the case in
the "grandfathered" production) means zero or more
repetitions (without upper bound) of the production to the
right of the asterisk. That means that the "grandfathered"
production (which is an alternative in the Language-Tag
production) will match any of the following text tags (comments
to the right separated by a semicolon):
   x  ; ALPHA followed by zero repetitions
   xa ; ALPHA followed by one ALPHA (see alphanum)
   x- ; ALPHA followed by one HYPHEN
   supercalifragilisticexpialidoceus ; ALPHA followed by many ALPHAs
       (see alphanum) (example previously given)
   x1234567890abc ; ALPHA followed by 13 alphanums
       (as previously given)
   a123-xyz ; ALPHA followed by three DIGITs (see alphanum)
       followed by one HYPHEN followed by three ALPHAs
       (example previously given)
   y----- ; ALPHA followed by five HYPHENs (example previously
       given)

I say the ABNF from draft -08 (quoted above) allows those;
you say no.  Either you're looking at different ABNF or one
or more of us doesn't understand ABNF.  If you wish to
convince me that I don't understand it, you'll have to do
better than simply claiming that I'm wrong with no supporting
reasoning.

> > Specifically, the draft allows, and RFC 3066 disallows:
> >    subtags more than 8 octets in length
> 
> This is incorrect. It was true of an earlier draft, but that was
> changed.

The "new last call" was for version -08; I downloaded it
from the URI in the new last call and copied the ABNF
above from that.  My analysis is above.  I await your
rebuttal or retraction.

> >    hyphens which do not separate subtags
> >    zero-length subtags
> 
> These near-equivalent statements are incorrect. No hyphen may be
> permitted without a non-initial sub-tag, and no sub-tag can be an empty
> string.

See the "y-----" example above, based on the published
ABNF. Again, I await your rebuttal or retraction.

> >    primary tags which are not purely alphabetic
> 
> This is incorrect. A primary sub-tag must be 2*3ALPHA or 4*8ALPHA, or
> "i" or "x".

See the "a123-xyz" example above (in RFC 3066 parlance,
the "a123" part is the primary tag, which clearly contains
DIGITs.  One more time, I await your rebuttal or
retraction.