draft-phillips-langtags-04 /2.4.2 Matching language tags

Tex Texin tex at xencraft.com
Thu Jul 1 11:40:11 CEST 2004


With respect to the default pattern for fallback matching, I am a little
surprised that some accommodation for script subtags has not been made.

If I understand correctly a language range to match is specified.
It is compared with language tag labels of the documents (resources or
whatever) that are available.
If no match is found, the match value is truncated and the comparisons
repeated.

1) If the match value is zh-TW, and if all of the documents are labeled with
tags that describe the script used, such as zh-hant-TW I will not get a match
until the match value is truncated to zh. Then it matches both zh-hans-CN and
zh-hant-TW, and so perhaps returning a less than optimum document. I think it
is desirable that if language and region subtags match, and the script is
unspecified in the match value, then it should allow matches with any script.

To state this as a rule: Any subtag types that are not specified in a match
value, and positioned between specified subtag types should be considered
"don't cares" or wildcards.

So zh-TW would be considered as zh-*-TW (where I use * as a script wildcard
indicator).

Documents labeled zh-hant-TW or zh-TW would match a language range of zh-TW.
zh-hant or zh or zh-CN or zh-hans-CN would be considered a match only after it
is determined that there are no other documents with region subtag=TW.

Using the boont example, a search for en-boont would match en-Latn-US-boont,
en-Latn-boont, en-US-boont, and en-boont.

2) The other side of the script issue is when the script is specified in the
match value but not specified in the documents.
A search for zh-hant-TW will not match documents labeled zh-TW.
As the match value is truncated to zh-hant there will also be no match.
When it is truncated to zh it will consider zh-TW and zh-CN a match. Even
though zh-TW might be a better choice, the zh-CN document might be returned.

It seems to me that the match algorithm might take into account matches between
more subtags as being better matches.
So zh-TW is a better match to the language-range zh-hant-TW than zh-CN, since
it matches in 2 subtags vs. 1.


Are these realistic concerns? I think so. In this highly integrated world my
applications exchange language identifiers with other applications and
depending on vintage of the apps and how knowledgable their developers were, we
will see all types of match values and labels on documents. i.e. I can't
presume to control these values.

The match algorithm should be more sophisticated to better take into account
the greater information that 3066bis offers.
tex



More information about the Ietf-languages mailing list