What's the plan for ISO 639-3 and RFC 3066 ter?

Addison Phillips [wM] aphillips at webmethods.com
Tue Aug 17 00:44:45 CEST 2004


> > John (and others on the list), are you happy with 3066
> > bis? Indications of support or opposition (with reasons) would be
> > useful at this juncture in finishing this work.
>
> I support it in principle.  I haven't had a chance to check the current
> draft, but I suspect that any nits I would find, others will find too.
> Still, I'll try to squeeze in another pass.

What's important right now is ensuring reasonable consensus on moving
forward with the present draft-langtags, nits aside. "Nit-picking" is
important, but we can create a draft-06 for minor tweaks to the details and
non-substantive corrections. Please do send any comments you may have when
you get a chance.

I have been ruminating about the problem Peter raised and your notes below.

Had a real brain cramp over the weekend, so, you're right: not only is there
is no problem with ISO 639-3/-5 becoming part of the primary language subtag
set, but it isn't the right question. I'll note that the draft has rules to
deal with any problems that might arise.

Extlangs do present the possibility of some knobbly issues. Ideally, the
changes to incorporate ISO 639-x into a future "RFC 3066ter" would not
present many (or even ANY) changes to the underlying mechanisms and
requirements (so that implementations continue to work). The main changes
should be to define which standards are the source for which portions of the
subtag registry.

Extended languages work within this framework and without major
modifications so long as language tag processors are free to recognize
'zh-nan' as a different language from (that is, not matching) 'nan'. This is
basically what the Default Fallback Pattern says: it is a strict
remove-from-right matching scheme.

Some of the problem here could be mitigated by giving extlangs (or primary
language subtags that can be extlangs) an intended prefix field. That is,
the subtag can be used standalone ('lmn') or with its intended prefix (as in
'oc-lmn'). This makes the information available to a validating processor
and the tag or range could be canonicalized by inserting the prefix as
necessary.

A similar effect can be gained by giving the primary tag an alias (thus, for
example, gsc is an alias of oc-gsc). This makes the collection-sublang tag
canonical (lmn is permitted, but oc-lmn is canonical)

These would make the following cases on a 'ter' processor:

   a. if I request 'lmn', I get content labelled 'oc-lmn' and 'lmn-FR', but
not those labelled 'oc-gsc' or even 'oc'. ('lmn' becomes 'oc-lmn', 'lmn-FR'
becomes 'oc-lmn-FR')
   b. if I request 'oc', I get content labelled 'oc', 'gsc', 'lmn', and
'oc-lmn', etc. (gsc and lmn map to oc-gsc and oc-lmn, for example)

On a 'bis' processor:

   c. if I request 'lmn', I get content labelled 'lmn-FR', but not 'oc-lmn'
   d. if I request 'oc', I get content labelled 'oc', and 'oc-lmn', but not
'gsc' and 'lmn'

The latter of these would require small changes to draft-langtags to allow a
range in an alias.

So now that I'm at this end of the email, I guess things aren't so bad.
There is an approach available that fits within the 3066bis framework and
would work with extlang without breaking 3066bis implementations. The
solution doesn't make all of the current grandfathered tags redundant, but
that isn't strictly necessary, is it? The question is whether these are the
right choices and whether to alter the current draft slightly to deal with
this (or some other) solution.

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility
http://www.webMethods.com
Chair, W3C Internationalization (I18N) Working Group
Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International

Internationalization is an architecture.
It is not a feature.

> -----Original Message-----
> From: John Cowan [mailto:jcowan at reutershealth.com]
> Sent: 2004?8?16? 10:19
> To: Addison Phillips [wM]
> Cc: ietf-languages at alvestrand.no
> Subject: Re: What's the plan for ISO 639-3 and RFC 3066 ter?
>
>
> Addison Phillips [wM] scripsit:
>
> > The question is: does ISO 639-3 supersede ISO 639-2 as the source for
> > three letter codes? Or not?
>
> Mu.
>
> > If 639-3 is a strict superset, then the additional three letter codes
> > could just be admitted as language subtags. In fact, I'm given to
> > understand from Peter's prior explanations that this should be the
> > goal for most of the 639-3 codes.
>
> The situation as I understand it will be as follows:
>
> 1) The ISO 639 standard will draw on a single pool of three-letter codes.
> No three-letter code will be used for more than one purpose.
>
> 2) ISO 639-3 will assign codes to individual languages and
> macro-languages.  These codes will be identical to the existing 639-2
> codes where they exist; where there is no existing 639-2 code, they will
> be identical to Ethnologue 14th edition codes where possible.
>
> 3) ISO 639-5 will assign codes to language collections.  These codes
> will be identical to the existing 639-2 codes where they exist.
>
> 4) ISO 639-3 and ISO 639-5 codes will be disjoint.
>
> 5) ISO 639-2 will specify a subset of (the union of ISO 639-3 and
> ISO 639-5
> codes) that specify languages which meet the restrictions of ISO 639-2
> (basically, that there are at least fifty documents in the language,
> held by at most five organizations).
>
> 6) ISO 639-1 will continue to specify a subset of ISO 639-2, and will
> assign two-letter codes to its members.  Except for a transitional period
> after the promulgation of ISO 639-3 and ISO 639-5, it will effectively
> become a closed collection.
>
> (ISO 639-4 will explain all this, and will not define any codes.)
>
> > The need for extlang subtags would then be muted (and might even be
> > eliminated). Only language codes that had "macro languages" associated
> > with them could be registered as extlangs. In fact, these subtags
> > might be cherry picked on an as-needed basis (rather than having a
> > full-fledged formal source).
>
> ISO 639-3 will provide a mapping between macro-languages and the
> individual languages that are parts of them.  I don't know (and it may
> not have been decided) whether ISO 639-5 will provide a mapping between
> collective codes and the languages covered by them.
>
> > Canonicalizing and matching the tags in this situation would be much
> > more complicated:
> >
> > zh-min-nan // ignore the min problem for a second
> > zh-nan
> > nan
>
> This (and its twin zh-min-bei) are the most complex cases.  The vast
> majority of all macro-languages do not contain other macro-languages
> (as zh contains min).  Indeed, it is doubtful whether ISO 639-3
> will provide nested macro-languages at all.
>
> Let us consider more straightforward cases.  I am assuming throughout
> that Peter's recommendations for changes to 639-2 are accepted.
>
> A) Currently, the macro-language "Occitan" is encoded as oc.  There
> will be four individual languages corresponding to this macro-language
> in 639-3:  Auvergnat (auv), Gascon (gsc), Languedocien (lnc), and
> Limousin (lms).
>
> B) Currently, the collective "Land Dayak languages" is encoded as day.
> (It's not marked as a collective in ISO 639-2, but Peter has proposed
> that it be changed to a collective).  There are 16 languages in this
> collective.
>
> In each case, we have three possibilities:
>
> 1) Allow the individual language codes;
>
> 2) Allow the higher-order code extended by an individual language code;
>
> 3) Allow both 1 and 2 as synonyms.
>
> Accepting 1 means that systems which consume resources labeled oc must
> now also be prepared to consume resources labeled aug, gsc, lnc, and lms.
> Accepting 2 allows normal fallback behavior to work:  oc-aug will be
> recognized as oc automatically.  Accepting 3 means that some normalization
> scheme must be provided.  All three possibilities have drawbacks.
>
> For case B, ISO 639-5 must provide mapping tables (per above) in order
> to make conversion between collective and individual language codes
> practicable.  If this is not done, only possibility 2 will fly.
>
> > John (and others on the list), are you happy with 3066
> > bis? Indications of support or opposition (with reasons) would be
> > useful at this juncture in finishing this work.
>
> I support it in principle.  I haven't had a chance to check the current
> draft, but I suspect that any nits I would find, others will find too.
> Still, I'll try to squeeze in another pass.
>
> Note:  Before leaving on vacation, Peter left us all a present:
> an editor's draft of ISO 639-3, reachable from the last link on
> http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=PCUnicodeDocs
> The draft is 142 pages long, but all except the first 14 and last 6
> pages of the document are just the code to language mapping tables,
> in code order and in language order.
>
> --
> Her he asked if O'Hare Doctor tidings sent from far     John Cowan
> coast and she with grameful sigh him answered that
> www.ccil.org/~cowan
> O'Hare Doctor in heaven was. Sad was the man that word
> www.reutershealth.com
> to hear that him so heavied in bowels ruthful. All
> jcowan at reutershealth.com
> she there told him, ruing death for friend so young,
> algate sore unwilling God's rightwiseness to withsay.   Ulysses, "Oxen"



More information about the Ietf-languages mailing list