script code defaults and 3066:bis

Jeremy Carroll jjc at hplb.hpl.hp.com
Fri Mar 19 12:48:45 CET 2004


Addison:
> (Aside to list: hello, world: if you have comments,
> now would be a very good time to make them.)

So here is a comment. (Not fully baked yet - and I am happy to be told this
is out of scope for RFC 3066 bis, and I need to make a case for the version
after).


I am looking at the script codes and thinking about mutual intelligibility
...


Summary:

Two additional registrys (both fairly straightforward) would add
significantly to the ability to predict the level of mutual intelligibility
between two language tags and hence make the mechanisms of language range
and language tag fallback useful.


Detail:


RFC 3066 and preceeding has comments to the effect that language tag
fallback doesn't work very well, and language ranges might surprise.

I am possibly misunderstanding the situation but the essence of the
non-intelligibility problems is typically to do with scripts.

So the example given is 'az-Latn' versus 'az-Cyrl'

Scanning through the registry of script codes, I thought that the number of
potential partially-mutually-intellgible pairs of script codes was small.
http://www.unicode.org/iso15924/iso15924-codes.html

e.g.
hira hrkt
and
kana hrkt

not sure about
latf latg latn

and I would guess most of the hans, hant, hani and kana share enough to give
people something as a fallback.

But in general it seems that language tag fallback should not fallback when
the script codes are different.

It seems that a relatively small additional registry of pairs of script
codes that are exception to that rule would make language tag fallback with
script codes significantly more useful.

Similarly the advice:
"Avoid using subtags that add no distinguishing information about the
content. For example, the script subtag in 'en-Latn-US' is generally
unnecessary, since nearly all English texts are written in the Latin
script." augmented by a registry with each of the IS0 639 codes and its
default script, with a SHOULD NOT make the default explicit, SHOULD make any
other script explicit, would then make that more useful, and these defaults
could then be used in the language tag fallback.

Note that this suggested registry is sort of the opposite of the RFC 3066
registry - these are combinations that should not be used because the script
code is already the default, rather than these are combinations that should
be used.

If I have understood correctly some legacy data is marked up as zh-TW to
mean zh-hant-TW etc etc. (i.e. the geography code has been used as a proxy
for a script code). The suggested default script code table could then have
an entry
zh-hant-TW, showing that this default is understood for zh-TW; this would
help migrate legacy tags without script information, to a world with script
information.

I think that when combined the techniques of these message would make the
langauge tag fallback a significantly more useful mechanism.


===

It feels like what in W3C speak would be a "postponed issue"
(Unless it's totally broken due to some misunderstanding on my part)

Jeremy





More information about the Ietf-languages mailing list