language tag structure
JFC (Jefsey) Morfin
jefsey at jefsey.com
Mon Jan 17 02:15:08 CET 2005
The debate over the revision of RFC 3066 lead to confusion because of the
rigidity imposed by the Internet standard process and its BCP status. This
is because RFC 3066 covered various issues which today have widen too much
to fit into one single document by specialists of only one single
application. I am not interested in addressing this IESG problem.
It also shown that the Internet, IT industry, applications and application
areas WG need a stable and consensual tagging system. This is a specific
need which should be addressed independently in addressing all the
"customers" needs (XML, IRI, DNS, OPES, etc.). My interest as a CRC project
developer (common reference center) is a consistent, flexible and open
enough tagging structure which permits to consistently document the largest
number of human related aspects.
I am interested in the critics to the following step by step approach.
Attention: it does not want to write a Draft here, just to work out the
basis of a common approach to a common consistent solution to different needs.
1. the language tag is to concatenate 5 sub-tags about:
- the language
- the scripting
- the geographical area of use
- the style
- the authoritative source/reference
example : Microsoft Word for French of France I have on my PC right now.
- the language is French
- the scripting is Latin
- the geographical area are France or Belgium or Luxembourg or Canada
or Monaco or Switzerland (many other countries should show up)
- the authoritative source/reference is Microsoft (and they miss a
_lot_ of words)
- the style options can be personal, official, etc.
2. only the language information is mandatory in a tag, all the other
information are optional or not depending on the application.
3. the langtag information MUST be independent from the matching algorithm.
Its role is to support a complete definition of the language by its
authoritative source. There may be an unlimited number of authoritative
sources.
3. there can be different presentation formats of the langtags.
fr-FR (RFC 3066)
fr-Latn-FR (RFC 3066 bis)
fr-Latn-FR-A-Microsoft
latn.fra.fr.a.microsoft
for multilingual considerations, characters used in tags will be
considered as "0-.z" numerics (lower or upper cases).
a "0-.9" numeric version should be supported;
4. default code tables used in the tag would be:
ISO 639 for languages
ISO 15924 for scriptings
RFC 1591 for countries (ISO 3166 2 letter code as approved by
ICANN/ITU as a reflect of the real world through the GAC. This approach
only gives a better response time to real world adaptation since through
their ISO, ITU and GAC Membership interested Govs will find the best
solution). Region can be UN or an ad6hoc table
A style list should be created. It might be 1 character list (with
optional sub entries)?
Authority should be registered through a mailbox name on the mailing
list of a dedicated Multilingual Task Force. Authority may include a
year/publication nr subfield information when a same authority has
published several references. Authority may be also identified as its
registration number.
Extensions to these tables will be accepted to address an extended and
homogenous vision of languages and the support of additional related table
to form an homogeneous cultural ontology (CulturaMundi) also supporting
machine languages.
Subfields should be supported for dialect, particular forms of
scriptings (I can think of several French scriptings and phonetics), areas,
styles and vocal accents.
Other codings could be supported with a prefix. The target is to
support the largest number of declared or private ID tables.
5. ISO 7000 oriented ICONs
I have not been able to find today a free ISO 7000 document part which
would give enough information. I suppose that the best solution would be a
icon of special shape (easily identified as a language tag: a book?) with a
color code to indicate the style, encapsulating the location flag and
marked with the language ISO 639 code in the corresponding scripting. A
face could indicate the type of voice when the document is vocal
(identified by a loud-speaker instead of a book?)
6. Registrations and Publication
Some applications calls for a registration, real life lives with
descriptions. Nothing opposes that a CRC (IANA or other(s) - like
manufacturer reference centers) register in toto or in parte their cultural
matrix, to serve as an application reference or to support special
requirements. The registration made at the address of a give tag is only
for the benefit of the application users and not for structural reference
for developers. This means that a IANA or an other CRC particular
registration at fr-Latn-FR SHOULD have no influence on the way an
application software could be designed.
I intend to use that tagging standard in the INTFILE reporting on the
DNS top level status. This file is to support information on a per ccTLD
basis. For example the style "D" could be for IDN tables.
7. IRI
Before presenting an information Draft that would make the
lang/culturetags consistent with other taggings (DNS, keywords, access
engines, OPES triggering, search engines, etc.) I suppose the best place
for core consistence of the whole set of documents under way is Martin
Duerst's IRI Draft. And all these documents refer to RFC 3066.
Martin, I have carefully read your IRI draft (10.txt ?) several times.
I am not sure I understand everything. This is certainly due to my low IQ.
But also because some definitions seems to be missing. In particular I have
not been able to understand exactly what you name a "name-reg" and
therefore to determine if the proposed IRI format fully supports ML.ML
domain names (unicode.unicode or xn--.wn-- that some countries and private
network may want or already have implemented)? I am not sure either if
Upper/Lower case differentiation can be fully supported should the name-reg
had always to support them (punnycode is supposed to be able to support
them - discussed for the LHS support - Upper cases should be supported
everywhere in the IRI even when not used by the DNS).
I thank you for your comments.
jfc
More information about the Ietf-languages
mailing list