RFC 3066bis: Philosophical objection (harsh)

Wed Dec 17 01:28:28 CET 2003

Hi Harald,

I'm glad to see you've carefully examined our proposal. Your message brings
up a number of points which require lengthy explanation.

A new draft (draft-phillips-langtags-02.txt) will be sent in later today or
tomorrow. This one should be substantially easier on the eyes, as it was
made using Marshall Rose's XML DTD. It includes corrections to the
non-substantive problems, such as the filename, ABNF, etc.

Let's start with whole tag vs. subtag registration. Whole tag registration
works well when there are only a very few exceptional tags expected or when
atomic tags completely cover the needs of the users. My objections to whole
tag registration, which I think justify going to subtags, are:

1. It is hard to implement the registry as it sits because each tag is a
'holistic' value in an exceptions table. The most common implementations
don't support registered values because it is very hard to maintain such a
table.

2. The pattern of registrations follows a subtag structure which could
benefit from the generative mechanism, given suitable guidelines (which we
have sought to provide).

3. Whole tag registration requires many tags to be registered when the
'subtag' being added must cover a wide range of situations. Consider German
orthographic variation, which has just two subtags and eight registrations.
Consider if we were to build out the Chinese zh-hans/zh-hant set of
registrations we would need more like 18 registrations. Then there was Prof.
Steenwijk's set of subtags for Resian. Here is one small language that might
have 20 or 30 registrations (and only five subtags). If he can document
these (and I suspect he can), then I don't see how we avoid a huge flock of
exceptions like this. The registration regime practically requires that only
a small number of tags be registered, leaving certain obvious tags
"illegal".

4. "Silly subtag generation" should not be an issue. It has always been
possible to create 'silly' tags or at least tags with dubious meaning with
the generative mechanism. 'es-AQ', 'sv-CO', et cetera. The description of
the registry in the draft is designed to capture the meaningful uses that a
subtag can be put to, without limiting the subtag's use in the generative
mechanism. Implementations might limit registered subtags to their
informative uses.

The draft does limit registered subtags in a significant way: you can't
register a script or region code, only a variant or base language code (and
it discourages base language codes). This effectively limits where and how
registered subtags can appear in a tag and prevents random sequences from
being generated. Users must still choose appropriate tags, but then they
must do so even today.

The results are very easy to parse/match/process, even without a current
table of registered values. This should free implementations to provide
better support for the registered values, while simplifying the number and
type of registered values that must be handled.

So I feel that going to subtags is actually a minor change (a policy change
on the use of registered values, which are subtags in structure, if not in
name) that provides for a simpler, more powerful way to use the registry to
everyone's benefit.

---

With regard to your other comments:

Matching and Script. The text is careful not to match en-Latn-US to en-US.
I'm not sure that's a good thing. If there is a valid use for script tags
beyond the very narrow group of current registrations then the script codes
must be put into the infrastructure. Mark and I preserved the strict
right-to-left matching of RFC3066 and kept matching compatibility over
semantic compatibility. This has some consequences, such as en-Latn-US and
en-US not matching. At the same time, we have added to the matching rules,
which basically say: "Use the most exact tag that you can, but no more exact
than is strictly necessary", which effectively says "use en-US, not
en-Latn-US". More guidance here might be provided...

Year. The productive and/or non-productive use of years was experimental and
was based on the German example, plus past proposals to register other
values. We have removed this feature from draft-02 altogether. Note that
this effectively prohibits the use of year subtags with the '####' pattern
(since a registered variant must be five characters long and start with an
alpha value).

Key-Value Pairs. With regard to key-value pairs, the separator characters
like equals were chosen for symmetry with various other protocols. We were
aware of the potential collision of equals with that character's use in
Accept-Language, and based on your and other's objections, draft-02 replaces
EQUAL SIGN with FULL STOP (dot).

Extensions in general. We have contemplated adding rules to make extensions
default ignorable, but that seems overly limiting, at least for a first
pass. The extension mechanism we propose provides a way to pass
language-related metadata in a more structured manner, and even in a
combinatorial manner (using two extension regimes together). Yes, this is
more complex than the current system and we could just stick with "value"
subtags for extensions. But we felt that kay/value provided a powerful
mechanism that could address some of the additional needs of specialized
communities without disturbing the base tags at all.

Undefined Extensions. I envision that external groups with interest in using
the extension mechanism will define the keys and values. It just didn't seem
to make sense to me to saddle IANA with registering those values. A separate
registry for extensions or extension namespaces could be created. I suppose
we could add one...

In particular, if we add the -x- separator, then users could presumably
create private use variants after that separator with whatever value they
desired. It seemed to me like a good idea to provide for some form of
structure in the extensions and that 'keys' might at least define some form
of namespace and reduce the liklihood of collision (as well as following
good practice in labelling data).

I look forward to submitting draft-02 and to your comments on that version.

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility
http://www.webMethods.com
Chair, W3C Internationalization (I18N) Working Group
Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International

Internationalization is an architecture.
It is not a feature.

> -----Original Message-----
> From: ietf-languages-bounces at alvestrand.no
> [mailto:ietf-languages-bounces at alvestrand.no]On Behalf Of Harald Tveit
> Alvestrand
> Sent: lundi 8 decembre 2003 22:21
> To: ietf-languages at alvestrand.no
> Subject: RFC 3066bis: Philosophical objection (harsh)
>
>
> Summary:
>
> I do NOT agree with using a liberal generative syntax for generating
> language tags. I believe we should stick with whole-tag registration, and
> stick to simple rules and guidelines for them, aimed at having, as far as
> possible, only useful tags.
>
> Details:
>
> I think the use of language tags where the sender is free to choose from
> multiple rules and generate subtags at will is harmful to
> interoperability
> and harmful to the end-user.
>
> I believe that the job of the language tags is to register all
> variants for
> which there is a known need for making the distinction between
> the various
> forms in the form of a language tag, and where there is a real reason why
> more powerful means of expressing the user's preferences or the
> properties
> of data are not appropriate.
> Therefore, a system with fewer language tags is better than one with more
> language tags.
>
> I think, in particular, that:
>
> - productive use of script codes hurts the current use of language tags,
> creates potential for harmful confusion for the users, and is therefore a
> Bad Idea.
> Requiring recipients to match en-Latn-US to en-US is wrong.
>
> - the productive use of years is a dangerous source of confusion,
> and that
> year markings without an IANA registration to point out what they are
> supposed to mean is making things easy for a sender at the expense of the
> recipient - something that is not a reasonable tradeoff.
> Requiring recipients to know whether de-1900 and de-1905 can be
> considered
> equal or not, with no further publicly available information, is wrong.
>
> - the use of unregistered, undefined name-value pairs in the extension
> subtag is a dangerously complex and noninteroperable solution to a still
> unidentified problem, and further harms interoperability with
> systems that
> depend on the non-occurence of the = character inside language tags.
> Requiring users who have written code to parse "lang=en" to also parse
> "lang=en-Latn-US-x-undefined=even%20more%20undefined" is wrong.
>
> Having thus harshly denounced 90% of the ideas in this document, I'll end
> this note with a saving grace:
>
> I think the idea of the -x- subtag for separating a registered tag from
> unregistered variants makes sense under the current rules for
> Accept-Range,
> and should be adopted.
>
>                      Harald
> _______________________________________________
> Ietf-languages mailing list
> Ietf-languages at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/ietf-languages