draft-phillips-langtags-04 /2.4.2 Matching language tags

Thu Jul 1 20:49:06 CEST 2004

Peter Constable wrote:
> > until the match value is truncated to zh. Truncated(zh-TW) matches both
> > zh-hans-CN and zh-hant-TW, and so perhaps returning a less 
> > than optimum document.
> 
> This general problem is one I pointed out when we discussing the de-1996
> stuff. It is a more serious concern in the example you've just raised.
> The Chinese case may not be typical, though: there have been strong
> correlations between countries and script variants for Chinese, but I'm
> not sure there are other such cases. On the other hand, your suggestion
> (treat missing in-between subtags as "don't care") has some sense to it.

Well the optimization that can be made for guessing script from region is not
as important as the fact that the assertion of script in the language range
breaks the matching with language AND region when the labels don't specify
script.
(or if the documents specify script subtags and the match doesn't.)

> 
> > Using the boont example, a search for en-boont would match en-Latn-US-
> > boont,
> > en-Latn-boont, en-US-boont, and en-boont.
> 
> Are all four of those considered valid?

Syntactically? If "boont" is to be registered as a generative subtag, then yes.
(If I understand the proposal correctly.)

Valid according to 2.2.3? I believe so.

All of these examples, except en-Latn-boont are used in the document.

> Well, note that the way the language-range works is that you don't
> truncate the match value; you only truncate the tags in the repository
> metadata: generic results don't conform to a specific request; you try
> to return something as specific or possibly more specific. Of course,
> after you fail to return something that conforms to the request, then
> you may start looking for least-offending options, in which case
> truncation of the match value may certainly apply.

Hmmm. I guess 2.4.2 on matching should clarify the algorithm then.
At first I thought it worked as you did, since the spec says the "tag" is
truncated not the "range",
but it seemed underspecified then.

But bullet "1." says that a match for the whole tag is first sought and then
repeatedly truncated and searched for.
This seems to be the range that is truncated.

If as you say, the documents' labels are to be truncated, and different labels
have specified different numbers of subtags, how much and exactly what is
truncated on each pass?
Do I truncate one subtag on each document? That seems wrong.
Do I truncate the same type of subtag on each pass? That seems better and can
perhaps be optimized so that documents without the subtag type that is being
truncated, do not need to be looked at again during this pass.
But the algorithm should be clarified.

i.e Searching for xx-yy, with documents labeled:
1-2-3-4
5-6-7
8-9
10
If there are no matches to xx-yy, do I strip off 4, 7, 9 and 10 and compare?
Do i strip off 4, compare, (fail), then strip off 3 and 7 and compare, etc.?

All in all, I think I prefer to return the document(s) with the greatest number
of matches by subtag type, even if there are types positioned in-between which
are not specified.

hth
tex

> 
> Peter Constable
> _______________________________________________
> Ietf-languages mailing list
> Ietf-languages at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/ietf-languages

-- 
-------------------------------------------------------------
Tex Texin   cell: +1 781 789 1898   mailto:Tex at XenCraft.com
Xen Master                          http://www.i18nGuy.com

XenCraft		            http://www.XenCraft.com
Making e-Business Work Around the World
-------------------------------------------------------------