language tag structure

Mon Jan 17 02:15:08 CET 2005

The debate over the revision of RFC 3066 lead to confusion because of the 
rigidity imposed by the Internet standard process and its BCP status. This 
is because RFC 3066 covered various issues which today have widen too much 
to fit into one single document by specialists of only one single 
application. I am not interested in addressing this IESG problem.

It also shown that the Internet, IT industry, applications and application 
areas WG need a stable and consensual tagging system. This is a specific 
need which should be addressed independently in addressing all the 
"customers" needs (XML, IRI, DNS, OPES, etc.). My interest as a CRC project 
developer (common reference center) is a consistent, flexible and open 
enough tagging structure which permits to consistently document the largest 
number of human related aspects.

I am interested in the critics to the following step by step approach. 
Attention: it does not want to write a Draft here, just to work out the 
basis of a common approach to a common consistent solution to different needs.

1. the language tag is to concatenate 5 sub-tags about:

     - the language
     - the scripting
     - the geographical area of use
     - the style
     - the authoritative source/reference

    example : Microsoft Word for French of France I have on my PC right now.
     - the language is French
     - the scripting is Latin
     - the geographical area are France or Belgium or Luxembourg or Canada 
or Monaco or Switzerland (many other countries should show up)
     - the authoritative source/reference is Microsoft (and they miss a 
_lot_ of words)
     - the style options can be personal, official, etc.

2. only the language information is mandatory in a tag, all the other 
information are optional or not depending on the application.

3. the langtag information MUST be independent from the matching algorithm. 
Its role is to support a complete definition of the language by its 
authoritative source. There may be an unlimited number of authoritative 
sources.

3. there can be different presentation formats of the langtags.

     fr-FR         (RFC 3066)
     fr-Latn-FR  (RFC 3066 bis)
     fr-Latn-FR-A-Microsoft
     latn.fra.fr.a.microsoft
     for multilingual considerations, characters used in tags will be 
considered as "0-.z" numerics (lower or upper cases).
     a "0-.9" numeric version should be supported;

4. default code tables used in the tag would be:

     ISO 639     for languages
     ISO 15924  for scriptings
     RFC 1591   for countries (ISO 3166 2 letter code as approved by 
ICANN/ITU as a reflect of the real world through the GAC. This approach 
only gives a better response time to real world adaptation since through 
their ISO, ITU and GAC Membership interested Govs will find the best 
solution). Region can be UN or an ad6hoc table
     A style list should be created. It might be 1 character list (with 
optional sub entries)?
     Authority should be registered through a mailbox name on the mailing 
list of a dedicated Multilingual Task Force. Authority may include a 
year/publication nr subfield information when a same authority has 
published several references. Authority may be also identified as its 
registration number.

     Extensions to these tables will be accepted to address an extended and 
homogenous vision of languages and the support of additional related table 
to form an homogeneous cultural ontology (CulturaMundi) also supporting 
machine languages.

     Subfields should be supported for dialect, particular forms of 
scriptings (I can think of several French scriptings and phonetics), areas, 
styles and vocal accents.

     Other codings could be supported with a prefix. The target is to 
support the largest number of declared or private ID tables.

5. ISO 7000 oriented ICONs

     I have not been able to find today a free ISO 7000 document part which 
would give enough information. I suppose that the best solution would be a 
icon of special shape (easily identified as a language tag: a book?) with a 
color code to indicate the style, encapsulating the location flag and 
marked with the language ISO 639 code in the corresponding scripting. A 
face could indicate the type of voice when the document is vocal 
(identified by a loud-speaker instead of a book?)

6. Registrations and Publication

     Some applications calls for a registration, real life lives with 
descriptions. Nothing opposes that a CRC (IANA or other(s) - like 
manufacturer reference centers) register in toto or in parte their cultural 
matrix, to serve as an application reference or to support special 
requirements. The registration made at the address of a give tag is only 
for the benefit of the application users and not for structural reference 
for developers. This means that a IANA or an other CRC particular 
registration at fr-Latn-FR SHOULD have no influence on the way an 
application software could be designed.

    I intend to use that tagging standard in the INTFILE reporting on the 
DNS top level status. This file is to support information on a per ccTLD 
basis. For example the style "D" could be for IDN tables.

7. IRI

     Before presenting an information Draft that would make the 
lang/culturetags consistent with other taggings (DNS, keywords, access 
engines, OPES triggering, search engines, etc.) I suppose the best place 
for core consistence of the whole set of documents under way is Martin 
Duerst's IRI Draft. And all these documents refer to RFC 3066.

     Martin, I have carefully read your IRI draft (10.txt ?) several times. 
I am not sure I understand everything. This is certainly due to my low IQ. 
But also because some definitions seems to be missing. In particular I have 
not been able to understand exactly what you name a "name-reg" and 
therefore to determine if the proposed IRI format fully supports ML.ML 
domain names (unicode.unicode or xn--.wn-- that some countries and private 
network may want or already have implemented)? I am not sure either if 
Upper/Lower case differentiation can be fully supported should the name-reg 
had always to support them (punnycode is supposed to be able to support 
them - discussed for the LHS support - Upper cases should be supported 
everywhere in the IRI even when not used by the DNS).

I thank you for your comments.
jfc