draft-phillips-langtags-08 script subtags and matching

Addison Phillips [wM] aphillips at webmethods.com
Sun Jan 2 06:53:39 CET 2005


Tex,

I see and recognize your concerns: but we did have this debate and have
considered the arguments on both sides. Below I try to reconstruct some of
this (given that I have two little boys bouncing all over the house and
can't quite concentrate).

Addison
>
> Although the format of the tags is consistent, and the matching rules
> unchanged, the behavior that users will see is indeed different.

Yes: we have introduced subtags that represent some additional facets of
language tagging. It would be surprising if users didn't experience some
benefits from it :-)
>
> As a user that specifies tags for languages that are acceptable,
> I do not know
> how you tag your contents, and in particular whether you tag using 3066 or
> 3066bis. A page may be tagged as any of sr, sr-CS, sr-Latn, or sr-Latn-CS.

Or sr-YU or sr-.... I recognize that content tagging and negotiation
requires some additional work with this (or any) design.
>
> Under RFC 3066, if I specify sr-CS I will have returned an sr-CS
> page if it
> exists.
> Under RFC 3066bis, if I continue to specify sr-CS, not knowing
> that the server
> has begun using sr-Latn-CS, I will not have that page returned.

No, Tex. If content is tagged "sr-CS" you'll get the "sr-CS" content if you
request it. But I take it you mean that if you specify "sr-CS" you won't get
any "sr-Latn-CS" and this is also true.

The design is a tradeoff. In the threads on this list that lead to putting
the script subtag between primary language and region, the consensus was
that script was more closely related to the primary language than region is.
An inspection of the six or seven languages known to have multiple scripts
in common usage (and for which script would be an important component) tends
to support this conclusion. Creating tags such as "zh-CN-Hans" makes less
sense than "zh-Hans-CN" in the medium-to-long term.

The problem with the script-last design is that one generally will want to
request contents in a particular script more often than for a particular
country or region. Since we're using Serbian as an example, is there some
reason to believe that this language's use as a minority language in nearby
Balkan countries (Bosnia, Croatia) or regions will not result in content
tagged for this areas? If I search for Serbian content, am I more likely to
care about the country of origin or the script in that case? The fallback
and matching mechanism favors the draft's solution to that problem.

There will be a transition period and this period may create some confusion
because of the selection mismatch. It is a valid discussion we can have
about where to put the script subtag in the language tag, but not, I think,
the validity of placing it. And the argument that it is incompatible to put
the script subtag in the middle is incorrect. It *is* a consideration, but
not one that breaks existing implementations.

Thus your "incompatibility" (in what the user might wish to request) is not
the same as Bruce's "incompatibility" (in which implementations will fail).

>
> Because of this, taggers will be discouraged from using the
> 3066bis scripted
> tags. Their users will not get correct pages until a) most
> browsers support the
> new format and b) users specify scripts in tags.

And (c) content uses the tags. Assuming that this is the main use case of
3066bis language tags.
>
> Given the length of time this will take to propagate thru the
> industry, there
> is a strong disincentive to using rfc 3066bis scripted tags.

18 months from when IE 4.0 was introduced it was the majority browser.
Propagation can be rapid too. Scripted tags will have a strong *incentive*
for use in that they solve an existing, knotty problem. Wanna see it? Go to
IE, bring up "Internet Options" on the "Tools" menu. Click the "Languages"
button. Click "Add" and scroll down to where you find "Serbian (Cyrillic)"
and "Serbian (Latin)" *already present* and using the same tag for
both----"sr".

There's 85% of your installed base, Tex. It's not incompatible and is crying
out for a script subtag.
>
> If upward compatibility is a goal (and it should be) then script
> subtags should
> come after the country subtag.
> i.e. sr-CS-Latn.

Only if it is the right choice (and, again, depending on the definition of
"upward compatibility")
>
> Existing users declaring they can accept sr-CS will continue to
> get the same
> pages, even as the pages are upgraded to the new format. This is
> because sr-CS
> matches the first two subtags of sr-CS-Latn.

Yes, that would be correct.
>
> We would then have compatibility with the tags that can be generated under
> 3066bis, but not for the few tags already registered with script
> as a secondary
> tag.

That's not true, though. The currently registered values are indeterminate
about where the script goes in relation to a region tag (that is, we have
zh-Hans but not zh-Hans-CN or zh-CN-Hans currently).

>Since all tags in the registry are already treated as a
> special case by
> virtue of their being in the registry, I don't see that as a problem.

It isn't a problem no matter what is decided.
>
> Given the prevalence of language-country format tags, it does not
> make sense to
> insert script in the middle.

But that it does make sense for the reason you then proceed to give:

> I do understand that script is more closely related to language
> and therefore
> having country in the middle seems to be an incorrect
> prioritization, but given
> the legacy, the script subtag should be appended and the
> esthetics abandoned.

Aesthetics has little to do with it. draft-langtags is for the long term and
has implications for applications other than Accept-Language on Web pages
(notwithstanding the fact that alternate matching is also possible). With
the exception of Chinese, the languages affected follow the example I gave
above of Serbian: region subtags generally haven't been applied to them in
the browser (other applications, of course, also exist).
>
> tex
>
> "Addison Phillips [wM]" wrote:
> >
> > Bruce wrote:
> > ---
> > No, you seem to have missed the point; there exist RFC 3066
> > implementations. Such implementations, using the RFC 3066 rules,
> > could match something like "sr-CS-Latn" to "sr-CS", but could
> > not match "sr-Latn-CS" to "sr-CS".  By changing the definition of
> > the interpretation of the second subtag, the proposed draft fails
> > to be compatible with existing deployed implementations (which is
> > what is meant by "backwards compatibility", which is a prime
> > consideration for Internet protocols).
> > ---
> >
> > No, your argument is flawed and wrong.
> >
> > The draft does not change the "interpretation of the second
> subtag". The second subtag was never defined to be simply region
> subtags--although they sometimes are.
> >
> > I quote the definition from RFC 3066:
> > ---
> >    The following rules apply to the second subtag:
> >
> >    - All 2-letter subtags are interpreted as ISO 3166 alpha-2 country
> >      codes from [ISO 3166], or subsequently assigned by the ISO 3166
> >      maintenance agency or governing standardization bodies, denoting
> >      the area to which this language variant relates.
> >
> >    - Tags with second subtags of 3 to 8 letters may be registered with
> >      IANA, according to the rules in chapter 5 of this document.
> >
> >    - Tags with 1-letter second subtags may not be assigned except after
> >      revision of this standard.
> >
> >    There are no rules apart from the syntactic ones for the third and
> >    subsequent subtags.
> > ---
> >
> > The second subtag *could* be anything, but tags created under
> the generative mechanism defined two letter subtags following the
> primary language subtag to be region subtags based on ISO 3166.
> This doesn't change with the draft: two-letter subtags are still
> region tags from ISO 3166. We merely define four letter subtags
> to be the script subtag also and prescribe an order that the
> subtags must follow. This doesn't break ANY existing
> implementations, because while iIt is the case that "sr-Latn-CS"
> is not matched to "sr-CS" in existing implementations, neither is
> it matched by those based on the draft.
> >
> > The draft does define some new sources and an order for subtags
> that existing implementations will not recognize, but this hardly
> breaks anything. Matching hasn't changed, so existing
> implementations won't be hurt by the insertion of script subtags
> between the two subtags (unless the matching was not compliant
> with RFC 3066 in the first place).
> >
> > Regards,
> >
> > Addison
> >
> > Addison P. Phillips
> > Director, Globalization Architecture
> > http://www.webMethods.com
> >
> > Chair, W3C Internationalization Working Group
> > http://www.w3.org/International
> >
> > Internationalization is an architecture.
> > It is not a feature.
> >
> > _______________________________________________
> > Ietf-languages mailing list
> > Ietf-languages at alvestrand.no
> > http://www.alvestrand.no/mailman/listinfo/ietf-languages
>
> --
> -------------------------------------------------------------
> Tex Texin   cell: +1 781 789 1898   mailto:Tex at XenCraft.com
> Xen Master                          http://www.i18nGuy.com
>
> XenCraft		            http://www.XenCraft.com
> Making e-Business Work Around the World
> -------------------------------------------------------------



More information about the Ietf-languages mailing list