Request: Language Code "de-DE-1996"

A. Vine andrea.vine@Sun.COM
Tue, 23 Apr 2002 16:34:12 -0700


Peter_Constable@sil.org wrote:
> 

> >one pair is clearly
> >different in vocabulary but ambiguous as to orthography, the other
> >pair is clearly different in orthography but ambiguous as to vocabulary.
> >The second pair is fairly trivial, because the vocabulary differences
> >are not that large; the first pair might require more effort to
> >construct.
> 
> And my suggestion was that tags distinguishing the first pair are what
> probably are not needed: indicate a vocabulary distinction while remaining
> ambiguous regarding orthography. It seems to me in most reasonably likely
> scenarios, if people are creating a data set that follows certain criteria
> with regard to vocabulary, then they will also be assuming certain criteria
> with regard to orthography.
> 

Sorry, I just don't follow this logic.  The vocabulary of a region/country and
the orthography rules seem pretty orthogonal to me.

My prior point was this:

We won't have 2 tags, one for language (however it may be defined), and a second
one for orthography (however it may be defined), or for that matter, a third tag
for script/writing system, not in the near future.  

Content-language: de
Content-orthography: 1996      {maybe someday}
Content-script: Latin

So all are contained in one tag, packed with as little info or as much info as
the tagger may have.  The tagger might be the originator of the data, or might
not be.  Often times a language tag is added after the fact - there is after
all, a large body of untagged text out there which may subsequently be tagged. 
And thus the assumptions.  I doubt anyone will move backend tags to the front,
so a simple tag of "scouse" is unlikely.  The convention among taggers and tag
readers is to back up the language tag, so de-DE backs to de, and de-DE-1996 is
likely to become de-1996 and then de, or de-DE then de.

So, for example, there is the tag "de", which means all that has been determined
by anyone is that the text is in German, with no info regarding which German or
which orthography.  Then there's de-DE, which is German in Germany, with no info
on the orthography.  Then there is de-1996, which is German but no info which
one, but a definitive determination that it is in the 1996 orthography.  Getting
into the de-DE-1996, de-AT-1996, de-CH-1996 tags,  I assume (and those familiar
with the orthography rules for both or all of the German versions can confirm or
deny) that there is a good amount of commonality, to the point where one could
create German text which is clearly in the new orthography but is still of
indeterminate German.

Now, I'm assuming you were trying to make the distinction between the 1996
orthography for Germany's German vs. that of Austria's German.  The question is,
is the de-AT-1996 pointing to a different orthography from de-DE-1996?  And if
so, shouldn't it be de-1996-AT?  But 3066 doesn't allow that order.  Even given
the de-AT-1996 order, are 1901 and 1996 static tags for all languages?  In other
words, every time "de" appears, it stands for German, "DE" for Germany, so will
"1996" stand for German orthography rules from 1996 no matter where it is used? 
But I don't think 3066 allows for mix and match of the 3rd level tags, and they
all have to be approved.

It all could be clearer, it's true, but with the mechanism and the pre-defined
"languages" and "territories", I think this might be the best we can do.  It's
probably insufficient for linguists and scholars, but plenty sufficient for the
average user.

Andrea