What to do with Gaulish ?

Thu Nov 16 16:04:00 CET 2006

(Replying only to ietf-languages, since I'm not subscribed to the other 
lists that received this mail.)

CE Whitehead <cewcathar at hotmail dot com> wrote:

> Hi, I am troubled by tags like frc, fro, and frm because I am 
> wondering what happens when a person using a search engine asks for 
> pages in French?  Will the frc, fro, frm pages turn up too?  It's 
> quite possible that a person interested in French will be interested 
> in moyen Francais/Middle French (frc) and in Old French (fro) if the 
> search is for someone studying French.

Neither the Language Subtag Registry nor, as far as I can tell, any of 
the ISO 639 family of standards include this type of time-hierarchy 
information.

It's just as possible that an ordinary user looking for "French" text, 
say for business or shopping, may not understand Old French and Middle 
French, and will not want scholarly material in those languages.  It is 
probably best to require the student to indicate them explicitly.

> Also, as I noted, some of the 17th Century new world documents were in 
> Middle French although you all have set the dates as 1400-1600 (those 
> dates can vary a bit; you'd be surprised also at the amount of 
> variation you can get in any given language at any given time before 
> literacy was so widespread).

This comment would be directed to the ISO 639 folks, since RFC 4646 (and 
predecessors) and thus the W3C takes the language descriptions and dates 
directly from ISO 639.

It's well known that the dates aren't exact, and indeed cannot be, since 
in almost all cases linguistic change occurs gradually rather than being 
legislated into existence.

> It's also conceivable that a person might want documents that are 
> written in either a Creole of French and Standard French.
>
> One could of course list all of the languages related to a particular 
> page using the meta content tags; for example for my "Moyen francais" 
> document I could list:
> lang=en, fr, frm

Language tags are defined as representing a single language (unless the 
subtag "mul" is used, which probably provides less information than any 
alternative).  The application-specific structure that *uses* language 
tags -- in this case, the "lang" attribute -- is the way to indicate 
multiple languages.

> Why not also have optional variant tags indicating the century in 
> which a dialect/language was used, for example
>
> 12c (12th century, 1100-1199 A.D.)
> 13c (13th century, 120001299 A.D.)
> 14c
> 15c
> 16c
> 17c
>
> and so forth.

 From a mechanical standpoint, these variants would need to adhere to 
the RFC 4646 syntax for variant subtags: either 5 to 8 letters and 
digits, or 4 if the subtag begins with a digit.  You could propose 
"12cent", "13cent", etc. and these would be syntactically acceptable.

 From a linguistic standpoint, you would need to convince the Language 
Subtag Reviewer that such variants are justified, and not an 
overspecification.  Not all languages changed on a tidy 
century-by-century basis, and of course the base-10 dates "1100-1199" 
are just as arbitrary as "1400-1600" mentioned above.

> These become quite relevant for 17th century European languages which 
> are 'modern' sort of but sometimes vary quite a bit from the modern 
> version of the language (I found this to be the case when dealing with 
> 17th century French in a report coming from the U.S.; some of the 
> features I noted in the 1683 report were reminiscent of Old French, 
> many of Middle French, spellings were sometimes irregular and 
> phonetic; it might be understandable to a speaker of Modern French but 
> so might 16th century French which does get the Middle French tag; 
> elsewhere, in some texts from France, 17th century French appears more 
> like the modern variety; likewise Shakespeare's 16th-17th century 
> English is modern, in fact, as I understand things, his use of English 
> based on Scots dialect made Modern English what it is; but it does 
> vary a bit from English used today).

It would seem difficult (not to say "impossible") for a system of short 
alphabetic codes to accurately reflect subtle historical nuances like 
this.  It's understood that such nuances exist.

> On the same issue, what is going to happen with Arabic, when you get 
> the new subtags, will people still be able to use ar with a country 
> code to indicate the language?  Or is the new subtag to be the only 
> option?  I am not sure which should be the case myself as the dialects 
> are quite different, though most written Arabic is not spoken but 
> standard so these new codes should probably only apply for spoken 
> materials or phonetic transcriptions for the most part.

In RFC 4646bis, "ar" will continue to mean Arabic generally (not 
necessarily restricted to "standard Arabic"), while primary-extended 
pairs like "ar-arz" or "ar-abh" will reflect the dialects coded in ISO 
639-3.

> Of course, having a century tag would not solve everything for 
> languages that vary over time:

Certainly not; language is more complex than that.

> On this note, I'd like to know how to apply for a variant (not a 
> language) subtag, 17c, if I may do so.
> Hope I may.

See RFC 4646, section 3.5, "Registration Procedure for Subtags."  Make 
sure you understand the syntactical restrictions, which are there for a 
reason -- to allow the different types of subtags to be identified based 
on structure and position within the tag.  Also bear in mind that not 
all requests are automatically approved.

http://www.rfc-editor.org/rfc/rfc4646.txt

--
Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14
http://users.adelphia.net/~dewell/
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages