[Fwd]: Response to Mark's message]

Thu Apr 10 13:26:56 CEST 2003

> >
> > Another means that you actually want to e.g. have exactly the same
> > date format on different platforms.
>
> This is a good example of why this is hopeless. A Java locale contains no
> fewer than 24 possible built-into-the-system patterns (in the DateFormat
> class) per locale. C# is somewhat richer (in part due to the fact that it
> supports non-Gregorian calendars and associates a collection of possible
> calendars with each CultureInfo). A language tag cannot possibly tell you
> which of these you will get back from the system (that's an implementation
> decision). It merely implies that the date format you see when you specify
> 'fr-FR' should be recognizably "French" (and maybe it knows about national
> holidays and the like).

I don't know how locales work with C# or Java. In C++ locale objects contain
"facet" objects that separately decide on such matters as date
serialisation, language, charset, and so on.

Two things from that are applicable to this discussion.

The first is that it allows information about scripts to be completely
orthogonal to information about languages. It's easy to create unusual
combinations (English in Cyrillic etc.). Unusual combinations aren't
actually that unusual, if I was to write some Russian, Hebrew or Japanese
inline with English text I would generally transliterate it into Latin
script (especially since I don't actually know any of those languages).

The second is that this orthogonal quality doesn't preclude "educated
guesses". It's perfectly reasonable IMHO to assume Latin script for en-GB
*as long as you remember that you are making an assumption*.

And back to RFC 3066.

Currently the only method for deducing scripts is either heuristically (look
at the characters used and then deduce that the script used is whatever
script uses those characters) or guessing from the language as in the second
point above. While we all agree that this is not ideal, we have to recognise
that software doing so will continue to exist for some time after a better
solution is available.

Further a solution that places script codes into language codes has some
strangeness. The hierarchy behind tags is imperfect as has been noted (I
think we all agree it's imperfect, though we disagree sharply about how
imperfect) but many of us feel it is of some value.

The use of a script subtag makes this a multiple-inheritance hierarchy.
en-Latn-IE can be considered a "child" of en-IE, en-Latn, en or Latn. Like
someone porting C++ to Java en-Latn-IE squashes this multiple-inheritance
flat. In particular the connection between en-Latn-IE and en-IE is no longer
as clear as it was before.

In particular while spoken language has been spoken of as some kind of bogey
of late we do have a need to handle it correctly. The connection between
en-Latn-IE and spoken en-IE is a lot stronger IMHO than that between
en-Latn-IE and en-Latn-US.

Whatever way I look at this I cannot find myself satisfied by anything that
attempts to push script information into language tags. The connection
between script and languages just isn't all that clear in a lot of cases.
The importance of script in an application doesn't necessarily correspond to
the importance of language in the same app.

I urge that script information be encoded and transmitted separately to
language information. I object to the registrations suggested in
<http://oss.software.ibm.com/icu/dropbox/language_code_mess.html>.