New variant subtags for Serbian language

Sun Nov 17 19:52:23 CET 2013

Milos Rancic <millosh at gmail dot com> wrote:

> Besides different pronunciation (and spelling) of old vowel Jat,
> Serbian language has two standard scripts -- Cyrillic and Latin. As
> that's a kind of rarely used variants -- contrary, it's commonly used
> -- it would be good to have shorter tags for those combinations:
>
> * ec => Ekavian Cyrillic (thus, sr-ec instead of sr-ekavn-cyrl)

That would be "Serbian as used in Ecuador," a potentially absurd but 
valid tag.

> * el => Ekavian Latin (...)
> * jc => Iyekavian Cyrillic (...)
> * jl => Iyekavian Latin (...)

Those would be Serbian as spoken in three hypothetical, currently 
undefined regions.

> Wikipedia is already using those tags: cf. 
> http://sr.wikipedia.org/sr-el/

That is not a defense for BCP 47 purposes, because Wikipedia does not 
use BCP 47.

> In any case, note that the tags for Ekavian and Iyekavian should stay
> *before* the tags for Cyrillic and Latin. You are speaking Ekavian or
> Iyekavian without writing them.

Ekavian Serbian, spoken: sr-ekavn
Ekavian Serbian, written in Cyrillic: sr-Cyrl-ekavn
Ekavian Serbian, written in Latin: sr-Latn-ekavn

These are the rules of BCP 47: language, then script, then region, then 
variant.

As Mark said, variant subtags that indicate script are unnecessary 
because script subtags exist. And no sort of subtag that indicates 
script is appropriate for spoken content anyway.

Variant subtags should indicate a particular variation of the language 
that *cannot be indicated* using any other type of non-private-use 
subtag, whether script or region. ISO 639-3 defines language code 
elements like "Omani Arabic" and "Cypriot Arabic," and the Registry 
incorporates them, not because the additional code elements make for 
shorter and more convenient tagging that "ar-OM" and "ar-CY", but 
because ISO 639-3/RA has determined that those languages are truly 
different from Standard Arabic and not just regional dialects.

So far, we have two variant subtags.

> * Croatian and Bosnian are Iyekavian and Latin. Bosnian standard
> allows Cyrillic, as well. (Bosnian and Serbian Iyekavian have
> differences in ~50 words, as well as the most of those Serbian words
> are correct in Bosnian, but not vice versa.). From the point of
> computational linguistics, it would be good if there is a place to put
> the information that those structures of those particular languages
> are the same.

The IANA Language Subtag Registry isn't the place, though. The Registry 
identifies languages and other aspects that may influence languages so 
that content can be tagged and searches can be constructed. It has 
macrolanguages like 'sh' only because the underlying ISO standards have 
them.

> * Montenegrin official language is still in the phase of development.
> If it's about the language used on official pages of Montenegrin
> government institutions, it is Serbian Iyekavian with two different
> words ("sjutra" instead of "sutra" ["tomorrow"] and "medjed" instead
> of "medved" ["bear"]). If it's about the standard proposed by Doclean
> Academy of Sciences and Arts, then it's about the language system the
> most distant of all other standard languages (it has more phonemes, it
> isn't neo-Shtokavian). Thus, I'd leave this issue until Montenegrins
> make their own decisions. In both variants, Montenegrin could be
> written in Cyrillic and Latin, though Latin is preferred.

"Montenegrin" won't be a language subtag in the Registry unless and 
until ISO 639-3/RA assigns it a code element. The opinion of almost 
everyone who does not have nationalistic skin in the game, including 
Ethnologue, is that "Montenegrin" is either a dialect or simply a 
"variety" of Serbian. It can be represented by "sr-ME", just as 
Australian English is represented by "en-AU".

> * Language systems spoken on the territories of Serbia, Croatia,
> Bosnia and Herzegovina and Montenegro (could be called "Serbo-Croatian
> in wider sense"):
> ** Chakavian (should get ISO 639-3 code, has ISO 639-6 code)
> ** Kaykavian (should get ISO 639-3 code, has ISO 639-6 code)
> ** Torlakian (should get ISO 639-3 code, has ISO 639-6 code)
> ** Shtokavian (should get ISO 639-3 code, has ISO 639-6 code)
> *** Old Shtokavian dialects
> **** Zeta-South Sanjak dialect: basis for Doclean Montenegrin.
> **** ...
> *** New Shtokavian dialects or neo-Shtokavian; could be called
> "Serbo-Croatian in narrower sense".
> **** Ikavian dialects of Western Herzegovina
> **** Iyekavian dialects of Eastern Herzegovina. This is the basic
> dialect for all of the standard languages (except Doclean variant of
> Montenegrin).
> **** Ekavian dialects of Northern [proper] Serbia and Vojvodina. Those
> dialects influenced Serbian Ekavian standard, though Serbian Ekavian
> standard is mostly Ekavian variant of Eastern Herzegovina dialect.

Breaking out the dialects in this way would be a question for ISO 
639-3/RA, not this group. But it would basically involve scrapping all 
their existing code elements for Serbian, Bosnian, Croatian, etc. and 
replacing them with these genetic classifications, so I wouldn't expect 
the RA to make that move any time soon.

For identifying the language of content, or specifying a language for 
search or retrieval, or any of the functions of BCP 47 -- not the study 
of the relationships between languages or their history -- it looks like 
we still have two variant subtags.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell