Review period; Nepali and Oriya

Sat Aug 4 22:05:10 CEST 2012

Gordon P. Hemsley wrote:

> I am of the late-night-and-not-well-thought-out opinion that further
> use of extlangs should be discouraged.

Though it may seem tempting, the question of whether the Registry should 
treat Nepali and Oriya the same as Arabic and Chinese really isn't the 
right place for a referendum on whether extlangs are good or bad, or 
should or should not be discouraged. (This was one of the reasons the 
Latvian discussion dragged on for months.)

The rationale in RFC 5645 for choosing certain languages, designated by 
ISO 639-3 as macrolanguages, to host extlangs in BCP 47 was as follows:

"These macrolanguage subtags [initially Arabic, Chinese, Konkani, Malay, 
Swahili, and Uzbek] were already present in the Language Subtag Registry 
and were chosen because they were determined by the LTRU Working Group 
to have been used to represent a single dominant language as well as the 
macrolanguage as a whole."

There was no statement as to whether existing language subtags, 
converted from individual to macrolanguage status by ISO 639-3, would 
also fall into this category and be added to this list. It was perhaps 
thought this would not be a frequent occurrence.

The questions here are (1) whether 'ne' has been used in BCP 47 contexts 
to represent not only Nepali proper, but also Dotyali, and (2) whether 
'or' has been used in BCP 47 contexts to represent not only Oriya 
proper, but also Sambalpuri. The key factor is whether content in 
Dotyali and Sambalpuri has been tagged as if it were Nepali and Oriya, 
respectively, in the same way that various "Chinese" languages have been 
tagged as 'zh'. It's a judgment call, and again, I'm not making any 
recommendation one way or the other. But the goal is to apply the rules 
equally to all languages that are in the same situation.

The decision to adopt an extlang mechanism into BCP 47 was heavily 
debated, and the LTRU group literally took years to arrive at a 
consensus. Deprecating this mechanism should be discussed as a separate 
topic, not piggybacked onto a different topic. It should involve 
rechartering the LTRU Working Group, and should result in a new RFC.

> AIUI, they are redundant registrations that are automatically
> deprecated (in some sense, if perhaps not in name) upon registration.
> The only purpose they seem to serve is to allow macrolanguage and
> microlanguage information to both be explicit in a single tag, and I'm
> not sure how useful that is.

Extlangs (and macrolanguages) exist because there is precedent and 
current practice, not only in data or coding systems but also in 
people's minds, to identify content as (say) "Chinese" even though it 
may be Mandarin or Cantonese or Wu or Hakka or Min Nan or whatever. Some 
processes need to see "Chinese" and others need to see "Mandarin." 
Extlangs allow both to exist in the same tag.

> Is there any data available for current usecases of extlangs which
> don't involve legacy implementations? (I'm assuming that that was the
> primary motivation for including them in the spec. Correct me if I'm
> wrong.)

I doubt there is much useful data on the use of BCP 47 tags in the wild 
at all. (You sometimes see comments that language tagging is so 
haphazard that heuristic analysis yields better results, a bit of a slap 
to those of us who have worked on language tagging for nearly a decade.)

But to answer your question, no, extlangs are not merely for legacy 
implementations. People continue, and will continue, to regard (and tag) 
Mandarin content sometimes as "Chinese" and sometimes as "Mandarin ." 
Indeed, "legacy" implementations (from the RFC 1766 or 3066 days) won't 
be able to parse extlangs anyway.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell