Generic variants and Armenian dialects (long)

Sun Sep 3 09:54:53 CEST 2006

This has become quite an interesting and complex thread.  I'm going to 
try to break it down into sub-topics and see if that makes anything 
clearer, or makes any decisions easier.

1.  Armenian dialects

Mark has identified two main dialects of Armenian, Western and Eastern, 
which need to be tagged differently for localization purposes.  The 
existence of these two dialects appears to be well attested; the 
Ethnologue entry for Armenian lists more than 30 dialects but ultimately 
narrows its discussion to Western and Eastern.  The two are also 
discussed in Wikipedia, which says:

"The two modern literary dialects, Western (originally associated with 
writers in the Ottoman Empire) and Eastern (originally associated with 
writers in the Russian Empire), removed almost all of their Turkish 
lexical influences in the 20th century, primarily following the genocide 
of the Armenians in Anatolia by the Turks in 1915–1920."

and later:

"Armenian can be subdivided in two major dialectal blocks and those 
blocks into individual dialects, though many of the Western Armenian 
dialects have died due to the effects of the Armenian Genocide.  In 
addition, neither dialect is completely homogeneous: any dialect can be 
subdivided into several subdialects.  While Western and Eastern Armenian 
are often described as different dialects of the same language, some 
subdialects are not readily mutually intelligible.  It is true, however, 
that a fluent speaker of two greatly varying subdialects who are exposed 
to the other dialect over even a short period of time will be able to 
understand the other with relative ease."

This thread from a group called the Armenian Club Forum shows Armenian 
speakers discussing the dialectical differences in terms of 
Arevelahyeren (Eastern) and Arevmtahyeren (Western):

http://forum.armenianclub.com/showthread.php?t=1632&page=2

It seems clear that the distinction is real, and even though it is 
possible to break down the dialects further, that does not prevent us 
from creating variant subtags to identify these two major dialects while 
reserving the right to provide finer distinctions in the future if 
necessary.

It was stated that the Eastern/Western distinction is really a matter of 
usage in Armenia proper vs. the diaspora.  Since the latter group is not 
tied to one particular country or region, use of ordinary region subtags 
("hy-AM" vs. "hy-somewhere else") doesn't seem sufficient.  It certainly 
appears to be more complex than "Armenians in France and California." 
In any case, even if this is just an “Armenia vs. diaspora” distinction, 
there should still be some way to tag it if it linguistically justified, 
which would appear to be true if a basic word like “please” is different 
between the two (Wikipedia says Eastern uses խնդրեմ (khntrem) while 
Western uses յաճիս (hadjis), and other sites list other differences).

2.  Generic variants

The proposals for variants "Western" and "Eastern" came with comments 
strongly implying that they could be used with other languages besides 
Armenian, if those languages have a "Western" and "Eastern" dialect.

One possible danger is that someone will decide to start using them to 
mark ordinary regional distinctions, instead of true dialects or other 
linguistic differences.  For example, it would not make sense to create 
the tags "de-DE-eastern" and "de-DE-western" to distinguish German used 
in the former DDR from that used in pre-unification West Germany, unless 
commonly accepted linguistic varieties called "Western German" and 
"Eastern German" had evolved as a result of the division (which AFAIK 
they have not).

I get the feeling that at least a small part of the motivation for 
proposing these is as a test case to see how variants with multiple 
prefixes will fly.  I understand this curiosity -- there's a part of me 
that wants to see at least one extension RFC, to see what form they 
would take and how the extension registry would be constructed -- but it 
shouldn't really figure into the present proposals.  So far, there is 
only justification to use such subtags for Armenian.

In fact, Section 3.5 of RFC 3066bis implies rather strongly that a 
single variant should NOT have two or more overloaded meanings, 
rendering much of this "multiple prefixes" discussion moot:

“Requests to add a prefix to a variant subtag that imply a different 
semantic meaning will probably be rejected.  For example, a request to 
add the prefix "de" to the subtag 'nedis' so that the tag "de-nedis" 
represented some German dialect would be rejected.  The 'nedis' subtag 
represents a particular Slovenian dialect and the additional 
registration would change the semantic meaning assigned to the subtag. 
A separate subtag SHOULD be proposed instead.”

IIRC, the motivation to allow multiple prefixes was to establish the 
rules for using two or more variants together.  For example, it would be 
senseless to write "sl-nedis-rozaj", because those two variants imply 
different, mutually exclusive dialects.  But if there were a variant 
"splat" whose meaning was orthogonal to "nedis" and "rozaj", then it 
would be appropriate to allow, say, "sl-nedis-splat".  This would be 
achieved by listing "sl-nedis" and "sl-rozaj", as well as "sl", as valid 
prefixes for variant "splat".  This mechanism was NOT intended to 
encourage using the same variant for two or more different languages.

3.  Names for the two proposed subtags

Mark has proposed that the Description fields for these two subtags be 
"western" and "eastern" respectively.  Ignoring for the moment the 
question of overloading these variants for other languages, the premise 
is that this is how the dialects are best known.

Michael made a counter-suggestion of "arevemda" for Western Armenian and 
"arevela" for Eastern Armenian.  (Frank may have a point that "arevmda" 
or "arevmta", without the second 'e', is better Armenian.)  It seems to 
me that a major reason for suggesting these alternative names is to 
prevent the subtags from being reused with other languages.  These 
appear to be simply derived from the Armenian words for "west" and 
"east".

If the restriction against overloading a single variant for different 
languages (Section 3.5, above) is honored, these two variants should 
only be used for Armenian regardless of whether they are called 
"western" or "arevemda" or "poiuytre".  Obviously the last is 
undesirable, and intended for effect; the question should be whether the 
first (English) is more clear for potential users of the subtag than the 
second (Armenian).  I suggest asking actual speakers of Armenian.  I 
concede that the words for "west" and "east" are quite similar in 
Armenian, but then that is true for a great many languages.

My personal preference is for "arevmda" and "arevela" (assuming that 
Frank is right about the superfluous 'e'), on the basis that it will 
discourage inappropriate usage of the subtags while still providing 
meaningful strings.  I greatly dislike "hywest" and "hyeast" (or 
"hyewest" and "hyeeast") since they attempt to discourage inappropriate 
usage by introducing significant visual clutter and redundancy to the 
tag.  We do not require users to write "sl-slnedis" or "de-de1996", and 
we will not require "en-enboont" in the future.

4.  Variants with no prefix, or used with the wrong prefix

Mark brought up what he called the "prefix bug": a variant created 
without a prefix could never have a prefix added, because doing so would 
restrict (not "broaden") the set of allowed prefixes.  Whether this is 
strictly true or not (John claimed it is not), I agree with Addison that 
no variant should, in fact, ever be registered without a prefix; it 
strongly encourages inappropriate usage.  It's hard to envision a 
variant that would be suitable for all languages (concepts like 
"casual," "business," "legal," "sardonic," and "paternalistic" add 
little or no value to language tags as people tend to use them, and 
don't even exist in all languages).  This should be much more clear in 
RFC 3066ter, and yes, I know the LTRU list is the right place to fight 
that battle.

John mentioned that "en-1901" would be an invalid tag.  This was of 
special interest to me since I have written a validating parser (part of 
my tag-generating program which will be freely available as soon as the 
RFCs are published).  The way I read RFC 3066bis, the answer is 
inconclusive.  Section 2.2.9 says a validating parser must “check that 
the [variant sub]tag must match at least one prefix,” which implies that 
the tag is not valid if it does not.  But Sections 2.2.5 and 3.1 speak 
only in terms of variants being “not suitable” or “inappropriate” with 
certain prefixes.  So to me, the normative aspects of this are not 
clear.  My parser, which identifies tags as valid ("green light") or 
invalid ("red light"), also has a third, “yellow light” status for tags 
that are technically valid but ill-advised, such as "en-1901".  This 
also covers cases like using a deprecated subtag ("iw") or explicitly 
specifying a Suppress-Script ("fr-Latn").

Addison asked:

“Also: what happens if we have "tlh-western" and a new subdialect 
"fooish" is registered. Do we do "tlh-western-fooish" or "tlh-fooish"?”

That depends on how the prefixes are defined.  As I wrote above under 
"Generic variants," with Slovenian it was clear that “nedis” and “rozaj” 
were inappropriate together, so each was defined with only “sl” as its 
prefix.  This is specified in Section 2.2.5.  (My parser flags 
“sl-nedis-rozaj” with a yellow light, on the basis of “rozaj” having an 
inappropriate prefix “sl-nedis”, although combinations like 
“sl-Tibt-rozaj” or “sl-JM-rozaj” are fine.)

5.  Comments

At the risk of beating this to death:  I believe the Comments field 
should contain enough information to allow users to select the 
appropriate subtag(s) for their tagging needs, and understand why they 
are appropriate, IN CONJUNCTION WITH THE RFC.  I don't believe it's 
necessary to add a tutorial on how variants are to be used -- that 
information is available in the RFC -- and especially not one that 
contradicts the RFC.

I also would prefer not to see Registry entries burdened with a Comments 
field that explains the obvious:

Type: variant
Subtag: western
Description: Western
Prefix: hy
Comments: Prefix ‘hy’, Western Armenian

In the above case, the Comments field adds no information that was not 
evident from the rest of the entry.  If a variant is given two or more 
prefixes that represent different languages (which should never happen; 
see my comments about Section 3.5 above), making the usage potentially 
confusing, the comments can be added for all languages at that time.

--
Doug Ewell
Fullerton, California, USA
http://users.adelphia.net/~dewell/
Editor, draft-ietf-ltru-initial