Script codes in RFC 3066

Wed Apr 9 14:47:21 CEST 2003

Hello Mark,

I have been looking at your document at
http://oss.software.ibm.com/icu/dropbox/language_code_mess.html.

Here are my comments/questions:

- A title like this spreads more confusion than necessary. I don't
   think that any technical topic is helped by this kind of title.

- For 'Current RFC 3066 Formats', your numbered list ignores the
   ability to extend any of the for formats by registration. You
   say 'but these are the only important ones in practice'. But
   for what you want to do in your paper, the possibility of extension
   by registration is extremely important.

- What is the current status of 15924, and expected date of completion?
   Does it allow for registrations? Why doesn't it include Hans and
   Hant from the start, and how quickly can they be added?
   I have searched for 15924 on www.iso.ch, but didn't find
   anything. From http://www.evertype.com/standards/iso15924/policies.html,
   it seems that it's still in committee draft status. Maybe that
   page needs to be updated.
   http://www.evertype.com/standards/iso15924/document/fdis15924.pdf
   suggests that it is in FDIS stage.

- I'm somewhat confused that you say 'principle in ISO 639 that only
   spoken language matters' and the requirement of having 50 (written,
   I guess) documents to be able to register a language.

- 'both Cyrillic and Latin are used in Serbia, Azerbaijan, and
   Uzbekistan': Having two scripts used in one country isn't an
   issue. I guess you wanted to speak about languages, rather than
   countries, here.

- Productive use of script codes: Here I think I very much disagree
   with you. There are about 100 script codes. There are about
   200 country/region codes, and about 500 (and increasing) language
   codes. Creating 10,000,000 codes for a currently documented need
   of 12 or 25 codes seems like an complete overkill.

   One particular concern I have is that once there is a productive
   pattern, the assumption that all the slots have to be filled in
   seems to spread in an uncontrolled way. I have seen numerous examples
   of tags such as 'ja-jp', which in particular as far as language goes,
   doesn't give more information than simply 'ja'. I have also seen
   software that insisted on always having a country/region code
   in a language tag. When I tested it, I would e.g. set the language
   to 'he', and then look at the HTML generated and see 'he-us', because
   the software was set with 'us' as the default, and I didn't change that.
   (needless to say that I didn't try out that software for more than
   five minutes).

   The recent rejection of a special tag for indicating 'Yiddish written
   in Hebrew' is another good datapoint. There is really no need at all
   to say that Yiddish is written in Hebrew, because that's obvious
   unless there is information to the contrary.

   Another point is that while something like az-latn/az-Cryl is very
   good for language negotiation (e.g. HTTP Accept-Language/
   Content-Language headers), it is really enough to mark up the
   actual text (e.g. with xml:lang) with 'az' only, because the
   script is self-evident from the characters used.

- You say '[RFC 3066 registration] can take quite a while'. If you have
   a perfect proposal, the minimum time it takes is two weeks. The main
   thing to know for a successful registration at the IETF is that this
   isn't your average bureaucracy where you submit an application, and
   then everything else is taken care of (although slowly). Basically,
   it is up to the registrant to follow through the process. If there
   is no discussion, maybe send another message asking for comments.
   If there are proposals for change, requests for more information,
   submit a revised proposal. If you don't hear from the reviewer
   within two weeks, send a (gentle!) reminder message. If you still
   don't hear from him, send another (again gentle) reminder. If the
   registration is approved by the reviewer, but doesn't turn up at
   IANA after some time, send an inquiry/reminder (again gentle) to
   IANA.

- Also, you say 'each case has to be separately registered'. I don't
   think there is any problem with sending e.g. 10 registrations in
   the same email, if they are related.

- Sequence of tags: 'top script' is obviously a bad idea. As for
   middle script vs. bottom script, I think that you have shown
   well that for the Chinese case, middle script seems to work
   better. But I don't think that this should lead to the conclusion
   that this will always the case; there is just not enough actual
   cases around.

- Plan of Action: I think point 1 and point 2 (maybe with the
   exception of attached country codes where there is only one
   country) make a lot of sense. Please go ahead. As I said above,
   I'm quite sceptical about 3.

- For registering only zh-Hant (and not zh-Hans), although it is
   clear that in terms of numbers, more Chinese are using simplified,
   I don't think we should let zh just stand for simplified, because
   in that case, we would not have a code for expressing Chinese
   independent of scripts. I'm not familiar enough with az, uz, and
   sr to judge whether there is enough of a dominance for one script
   to just register the other.

- "IUC already supports RFC 3066 codes...": I'm not happy with such
   statements. What exactly does IUC do with such codes? What does
   it mean that a code is supported?

- The section with the zh-TW-HK and similar examples is very unclear
   to me. If you support RFC 3066, then zh-TW-HK is undefined, because
   not registered, and zh-TW is Chinese as used in Taiwan. If zh-TW-HK
   ever gets registered, it will be a variant of Chinese as used in
   Taiwan. Claiming now that HK is a region code doesn't make sense
   at all.

- Re. hierarchies and locales: I think the Java locale model
   (language-region-variant) and the hierarchical inheritance of
   data based on the tag structure is just too inflexible. It may
   be a good model for default behavior, but there is definitely
   a need for more flexibility, for example for being able to
   inherit data from a completely different place if necessary.

- "For example, when typical web software receives an RFC 3066
   code, it use[s] it as a locale code." What is 'typical web software?
   Neither browsers nor servers use RFC 3066 as locale codes, and
   they are the two most typical pieces of Web software I know.

Regards,     Martin.