New item in ISO 639-2 - Zaza

John Cowan cowan at ccil.org
Thu Aug 24 15:48:48 CEST 2006


Mark Davis scripsit:

>   3. Allow both 1 and 2 as synonyms.
>      - zh-cmn-CN and cmn-CN are both valid and synonymous.

This is completely contrary to the spirit of RFC 3066.  We have
tag synonymy only in a few cases, and only where various RAs and MAs
have made mistakes.  You are talking about providing 350 cases of
gratuitous synonymy.

>      - - means adding more structure (which we have allowed for).
>      - ± automatic fallback (if we canonicalize to the longer form)

Canonicalization is not part of 3066bis, just a recommendation.

>      - - testing for validity is slightly more complicated (need to
>      check that the combination of lang + extlang for the long form is 
>      valid)

Lang+extlang checking is just prefix checking, which is already being
done for variants.

> A big question in my mind is the stability of the macro
> language inclusion relationship. If there is the remotest
> chance that they will change, eg that someday de becomes
> a macro language that includes de, sli, sxu, ltz, vmf,
> etc. (http://www.ethnologue.com/show_family.asp?subid=90073)
> then the only choice we have is #1.

I agree with the hypothetical -- but I think it will remain purely
hypothetical, for this reason:

Macrolanguages exist as a shim between the moderate lumper tendencies
of 639-2 and the extreme splitter tendencies of 693-3.  Wherever 639-2
has lumped language varieties that 639-3 considers distinct languages,
a macrolanguage is created.  The chance that such a well-used code as "de"
will be redefined away from meaning "Standard German" is effectively nil.
And 639-3/RA isn't going to gratuitously create macrolanguages otherwise
-- they are a wart on the standard.

A very thorough multi-year analysis has caught all such cases, and we
can be confident that as of when 639-3/RA joins the RA/JAC there should
be no more lurking undiscovered.

We then have to deal with two kinds of retroactive creation of
macrolanguages:  when 639-2/RA registers a lumped language, as they
have just done, and when 639-3/RA decides on the basis of new evidence
to split one of their existing languages.  I would hope that cases of
the first kind will cease, but cases of the second kind are always a
possibility when dealing with little-known languages -- the Ethnologue
pages are full of notes like "XXX dialect may be a separate language".
Luckily, handling them is easy: the existing language subtag remains in
place, and we add two new 639-3-specified extlang subtags, one for the
newly recognized language and one to cover the mainstream dialects.

When we do get a case of the first kind, under option #1 we must decide
either case by case or once and for all what to do: add the deprecate the
new subtag (as we do with changes to country codes, but without a specific
replacement), or deprecate the existing language subtags and introduce
corresponding extlang tags under the new tag.  Under option #2 we don't
have to do anything special -- but we risk substantial user confusion.

> The more I think about it, the more I like #1. We already have to
> do fallback between language subtags (think no, nb, nn), and this
> recasts the issue into providing additional data so that if I don't
> find language subtag X, I can what is the next best choice Y.

And I still strongly favor #2.  The last thing we want is a situation
where most people continue to use "zh" to tag Mandarin Chinese documents
(the overwhelming majority of all Chinese documents) and some start to use
"cmn".  This isn't a trivial case like the Norwegian one; there are 350
subtags we are talking about here.  We would in effect have to introduce
a major revision to the matching draft in order to make these remappings
part of it -- something I at least had very much hoped to avoid.

No, let most people write "zh", let those who care write "zh-cmn"
(as they can already do, thanks to a grandfathered RFC 3066 tag),
which will fall back to "zh", and let people who use the existing tags
"zh-gan", "zh-wuu", and "zh-yue" continue to have the right results,
but now as part of the standard rather than as a grandfathered exception.
(Some grandfathered tags will have to be deprecated.)

Overall, though, #2 is the conservative choice both in fallback behavior
and for existing language tags.

-- 
They do not preach                              John Cowan
  that their God will rouse them                cowan at ccil.org
    A little before the nuts work loose.        http://www.ccil.org/~cowan
They do not teach
  that His Pity allows them                         --Rudyard Kipling,
    to drop their job when they damn-well choose.   "The Sons of Martha"


More information about the Ietf-languages mailing list