Script codes in RFC 3066
Martin Duerst
duerst at w3.org
Wed Apr 9 14:47:21 CEST 2003
Hello Mark,
I have been looking at your document at
http://oss.software.ibm.com/icu/dropbox/language_code_mess.html.
Here are my comments/questions:
- A title like this spreads more confusion than necessary. I don't
think that any technical topic is helped by this kind of title.
- For 'Current RFC 3066 Formats', your numbered list ignores the
ability to extend any of the for formats by registration. You
say 'but these are the only important ones in practice'. But
for what you want to do in your paper, the possibility of extension
by registration is extremely important.
- What is the current status of 15924, and expected date of completion?
Does it allow for registrations? Why doesn't it include Hans and
Hant from the start, and how quickly can they be added?
I have searched for 15924 on www.iso.ch, but didn't find
anything. From http://www.evertype.com/standards/iso15924/policies.html,
it seems that it's still in committee draft status. Maybe that
page needs to be updated.
http://www.evertype.com/standards/iso15924/document/fdis15924.pdf
suggests that it is in FDIS stage.
- I'm somewhat confused that you say 'principle in ISO 639 that only
spoken language matters' and the requirement of having 50 (written,
I guess) documents to be able to register a language.
- 'both Cyrillic and Latin are used in Serbia, Azerbaijan, and
Uzbekistan': Having two scripts used in one country isn't an
issue. I guess you wanted to speak about languages, rather than
countries, here.
- Productive use of script codes: Here I think I very much disagree
with you. There are about 100 script codes. There are about
200 country/region codes, and about 500 (and increasing) language
codes. Creating 10,000,000 codes for a currently documented need
of 12 or 25 codes seems like an complete overkill.
One particular concern I have is that once there is a productive
pattern, the assumption that all the slots have to be filled in
seems to spread in an uncontrolled way. I have seen numerous examples
of tags such as 'ja-jp', which in particular as far as language goes,
doesn't give more information than simply 'ja'. I have also seen
software that insisted on always having a country/region code
in a language tag. When I tested it, I would e.g. set the language
to 'he', and then look at the HTML generated and see 'he-us', because
the software was set with 'us' as the default, and I didn't change that.
(needless to say that I didn't try out that software for more than
five minutes).
The recent rejection of a special tag for indicating 'Yiddish written
in Hebrew' is another good datapoint. There is really no need at all
to say that Yiddish is written in Hebrew, because that's obvious
unless there is information to the contrary.
Another point is that while something like az-latn/az-Cryl is very
good for language negotiation (e.g. HTTP Accept-Language/
Content-Language headers), it is really enough to mark up the
actual text (e.g. with xml:lang) with 'az' only, because the
script is self-evident from the characters used.
- You say '[RFC 3066 registration] can take quite a while'. If you have
a perfect proposal, the minimum time it takes is two weeks. The main
thing to know for a successful registration at the IETF is that this
isn't your average bureaucracy where you submit an application, and
then everything else is taken care of (although slowly). Basically,
it is up to the registrant to follow through the process. If there
is no discussion, maybe send another message asking for comments.
If there are proposals for change, requests for more information,
submit a revised proposal. If you don't hear from the reviewer
within two weeks, send a (gentle!) reminder message. If you still
don't hear from him, send another (again gentle) reminder. If the
registration is approved by the reviewer, but doesn't turn up at
IANA after some time, send an inquiry/reminder (again gentle) to
IANA.
- Also, you say 'each case has to be separately registered'. I don't
think there is any problem with sending e.g. 10 registrations in
the same email, if they are related.
- Sequence of tags: 'top script' is obviously a bad idea. As for
middle script vs. bottom script, I think that you have shown
well that for the Chinese case, middle script seems to work
better. But I don't think that this should lead to the conclusion
that this will always the case; there is just not enough actual
cases around.
- Plan of Action: I think point 1 and point 2 (maybe with the
exception of attached country codes where there is only one
country) make a lot of sense. Please go ahead. As I said above,
I'm quite sceptical about 3.
- For registering only zh-Hant (and not zh-Hans), although it is
clear that in terms of numbers, more Chinese are using simplified,
I don't think we should let zh just stand for simplified, because
in that case, we would not have a code for expressing Chinese
independent of scripts. I'm not familiar enough with az, uz, and
sr to judge whether there is enough of a dominance for one script
to just register the other.
- "IUC already supports RFC 3066 codes...": I'm not happy with such
statements. What exactly does IUC do with such codes? What does
it mean that a code is supported?
- The section with the zh-TW-HK and similar examples is very unclear
to me. If you support RFC 3066, then zh-TW-HK is undefined, because
not registered, and zh-TW is Chinese as used in Taiwan. If zh-TW-HK
ever gets registered, it will be a variant of Chinese as used in
Taiwan. Claiming now that HK is a region code doesn't make sense
at all.
- Re. hierarchies and locales: I think the Java locale model
(language-region-variant) and the hierarchical inheritance of
data based on the tag structure is just too inflexible. It may
be a good model for default behavior, but there is definitely
a need for more flexibility, for example for being able to
inherit data from a completely different place if necessary.
- "For example, when typical web software receives an RFC 3066
code, it use[s] it as a locale code." What is 'typical web software?
Neither browsers nor servers use RFC 3066 as locale codes, and
they are the two most typical pieces of Web software I know.
Regards, Martin.
More information about the Ietf-languages
mailing list