Request: Language Code "de-DE-1996"

Peter_Constable@sil.org Peter_Constable@sil.org
Fri, 26 Apr 2002 00:33:30 -0500


On 04/25/2002 09:53:07 PM Torsten Bronger wrote:

>I didn't think of de-1901-DE because I was influenced by XML and
>associated things, which is certainly not the worst to be influenced
>by.  I must object to de-1901-DE etc. for purely practical issues
>(although principally there may be very good reasons for them).
>
>The XML specification and RFC3066 suggest that the language code may
>be immediately followed by a country code.

This is true, but it does not require this order. Also, I am inclined to
think that we are going to be needing a revision to RFC 3066 before too
long to deal with some additional requirements. Even when we were
discussing drafts we considered the need to incorporate ISO 15924 script
tags, but it was decided to hold off until we understood the wider range of
needs better. I think progress is being made in that direction.


>  The very important
>"en-GB"--"en-US" thing supports this assumption.  Most implementators
>(including myself) realise that by some sort of "longest match".
>Being afraid that subtags may follow that our programs can't cope
>with, we try to match the *beginning* of the tag, i.e. we try to match
>"de-DE", if it fails "de".  So we get as much information as we can.
>
>Consequently a "de-1901-AT" would be interpreted by most applications
>as "de", even if they recognised "de-AT".  And matching an "-AT" at an
>arbitrary position is dangerous...

My concern is that, when we extend mechanisms to gain new functionality, if
we make individual decisions in terms of the limitations of existing
implementations (that are limited in number and weren't designed with that
extended functionality in mind), then we may saddle ourselves with
complications that we regret in the long term. If you read my IUC 21 paper,
you'll see that I'm suggesting that it might be possible to have
constraints on ordering of sub-tags that can facilitate parsing and
interpreting of those components. But if we don't have those constraints,
then no such benefits would be possible. This is very preliminary, and I
haven't explored how significant potential benefits in this regard might
be. It's possible that they may not be very significant, in which case it
may not matter. But on the other hand, if there were some significant
benefits that might be gained, I would hate to see that lost because we got
too eager to complete one handful of registrations before further analysis
could be done, and because we wanted to smooth out interactions with what
might soon be obsolete implementations.

If most of us feel that maintaining the fullest compatibility in this
regard with current software is the more important concern, though, then
perhaps we must do what we must do. But perhaps we can consider this
question: since the 1996 change already creates issues that require (human
or software) processes to be revised, is the behaviour in relation to these
existing software implementations that would result from having subtags
that are probably better in a theoretical sense just one symptom that needs
to be addressed?

Keep in mind that *either way*, processses that match on initial substrings
are going to face problems: if we adopt "de-1901-DE", then existing
processes would fail to match for "de-DE", which we have said is needed.
But if we adopt "de-DE-1901", then processes that operate only in terms of
initial substrings will fail to match on "de-1901", which we have also said
is important (and may before long be the more important). In other words,
such processes are eventually going to fail us, either way.



>Another example: Mozilla, the alpha version of Netscape (well, sort
>of), has a function to see the language of an HTML document with
>right-mouse-click -- "Properties".  "de" yields "German", "de-DE"
>yields "German (Germany)" and "de-DE-1901" yields "German (Germany,
>1901)" which I found very delighting and shows how intuitive agreement
>can work.  :-)

But where you see something good, I see a concern: it has interpreted
"de-DE-1901", which hasn't yet even be registered. What is it actually
*inferring* from that tag? (E.g. is it affecting any process behaviours
based on that?) Hopefully nothing, since there is not yet any valid
inference for it to make. (While it's obviously highly unlikely, it is in
principle possible that a new, usurping registration request could appear
tomorrow that, given adequately convincing argumentation, results in
"de-DE-1901" denoting a form of written German reflecting in spelling the
pandemic disordered speech within the community of a monastic order founded
in 1901. Again, completely unlikely, but the point is that the tag should
not be assumed to mean anything until it is registered -- it might never be
registered. If the result of this dialog were that we came to a consensus
to adopt "de-1901-DE" instead, then "de-DE-1901" would remain undefined. Or
if we discovered tomorrow that there had been a clerical error and that the
year was actually 1902, so we used that in whatever tags we registerd... )
And if there is not yet any valid inference for Mozilla to be making, then
the fact that it interprets "de-DE-1901" as "German (Germany, 1901)" is
actually void of significance.


>However, "de-1901-AT" is a plain "German (1901-AT)" and thus
>misinterpreted.  I strongly think that not only this program behaves
>this way.

But I think this really isn't that significant. Eventually, it is probably
going to be necessary for software to correctly interpret tags along the
lines of "az-Cyrl-GE", or perhaps "sgn-ES-Sgwr-AD". Indeed, even now, it
would be incorrect for something like Mozilla to interpret "sgn-US" as
"Sign Language (USA)", which the pattern in your test suggests, or to for
any process (using initial substrings) to interpret "sgn-BE-fr" as "Sign
Language (Belgium)".


>To sum it up, I don't think that the advantages of de-1901-DE are big
>enough to justify not to follow common practice.

And to sum up in response, it seems to me that arguments either in terms of
existing processes being able to match "de-DE-1901" with "de-DE" or in
terms of existing processes appearing to be able to make sense of
"de-DE-1901" really don't stand since, in the first case, those same
processes will fail to make other matches that we have also identified as
important, and in the latter case because (a) the apparent success in
interpretation is presumptious when tags are not yet registered, (b) the
claim to interpret the potential tags isn't that significant unless it
effects some appropriate, distinct behaviours, but these implementations
cannot possible do so when the meaning of the tag (and thus the appropriate
behaviours) has not yet been registered, and (c) again these processes are
inevitably going to fail to correctly interpret other tags that will be
needed.



>Yet another thing: de-1901 and de-DE-1901 initiate different behaviour
>in my program, but this point may be cleared up already ...

But I have a hunch that, if you (and industry as a whole) are serious about
i18n / L10n / etc., then eventually you'll be revising your implementation
in ways that could allow de-1901 and de-1901-DE to do likewise.


I hope you understand I'm not meaning to pick on your proposals in
particular. They just happened to be what came along first to strech the
envelop in relation to some of the issues I've been exploring with
long-term concerns in mind. I realise my comments and the resulting
discussion are probably resulting in some delay in registration of tags for
the distinctions you want, but hopefully that delay won't be protracted,
and hopefully it will


- Peter


---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>