Summary of discussion so far on script tags in language tags
Harald Tveit Alvestrand
harald at alvestrand.no
Mon Apr 14 12:58:44 CEST 2003
Having spent several hours trying to catch up with the message flow on this
list, I'll try to summarize the discussion so far...
THE ISSUE: Script information as part of a language tag
QUESTION 1: Should language tags that differ only/mainly in script be
allowed?
YES: Mark Davis and others
- The script is often very important to distinguish between an acceptable
and an unacceptable variant of a text, which is a common current
usage of language tags
- The script sometimes aligns more closely with other distinctions than
other distinguishing features (such as "country")
- Hierarchic left-hand substring match fallback will often give sensible
results when user preferences are stated in terms of language without
script, so the introduction "does no harm"
- Other systems use script as part of their internal identifiers, so
it must also be present in 3066 if equivalence is to be maintained
NO: Jon Hanna and others
- Script is not language. It is an orthogonal feature, and should be
independently represented.
- In the case of text-present, script can easily be identified by looking
at the characters used, so it's not needed
QUESTION 2: If script difference is allowed, how should they be tagged?
Everyone seems to agree that ISO 15924 is the right source of tags for
scripts, although there is some debate engendered from the fact that one
has not agreed to encode "traditional" and "simplified" Chinese as separate
scripts.
Everyone also seems to agree that "lang-script" makes sense in many cases -
ISO 639 code + ISO 15924 code. The more interesting question is when both
country info and script info is needed.
PROPOSAL 1: Lang-Country-Script
- This is a natural extension of RFC 3066
- This provides the right fallback if country variant is more important
than script variant
PROPOSAL 2: Lang-Script-Country
- This provides the right fallback if script variant is more important
than country variant
QUESTION 3: If script difference is allowed, and the choice of question 2
is settled, should the tags be generative or registration-only?
GENERATIVE:
- No need for pre-registration
- All combinations of lang + script that can be generated by other
systems such as MS-Windows have a natural mapping
- Follows the pattern of the lang + country generative mechanism of
ISO 639 / RFC 3066
REGISTRATION-ONLY:
- Generative needs a revision or addendum to 3066 to come into existence
- There are only about 24 interesting combinations anyway
- Lots of the combinations would be meaningless, and lots would be
effective duplicates. Dupes make recipient's task harder.
- The lang+country generative mess only shows that we should not do
this again.
QUESTION 4:
If the mechanism picked by this list can't be used to let language tags
distinguish between Traditional and Simplified Han, have we solved the
problem?
YES - the problem needs solving for Azeri, Serbian and so on
NO - unless we fix the Chinese problem, the solution is unacceptable
A number of other questions, including the need for databases of language
information for all the (non)hierarchical relationships that language tags
do NOT capture, the actual status of TC/SC, whether "locale" is an useless
concept, and the writing traditions of Azerbajan have been touched on in
the debate. But I believe those 4 are the essential ones for this list to
decide.
Suggestion for next steps:
- If you think my summary needs refinement, please reply to this message
suggesting a change to the text.
- If you want to continue the debate, reply to this or another message, but
CHANGE THE SUBJECT LINE.
- Once we're pretty certain we have the right set of questions, I'll send
out a request for a poll on the possible answers. The result will likely be
a list of NAMES for each alternative, not a count - we're looking for a
simple way to survey people's opinions, not for anonymous voting!
Your comments?
Harald
More information about the Ietf-languages
mailing list