confusing labels

Mon Apr 13 05:29:26 CEST 2009

Dear Mr. Klensin,
Another input I find interesting from Don Osborn, calling for
organised versatility in headers and algorithms and to foresee
transition and parallel solutions.

Sincerely.
Xavier Legoff

---

A quick review of coding on BBC World Service pages in diverse
languages at http://www.bbc.co.uk/worldservice/languages/  reveals … a
diversity of charset codes used, with most pages *not* in utf-8.  I
suspect that BBC is anticipating the kinds of systems that users in
each language population will rely on, trying to accommodate the least
sophisticated systems and font repertoires.  Assuming that their read
is accurate (and that they're not just being just conservative about
making the change to utf-8), this would seem to be an interesting
window on how widespread the use of Unicode is or is not at the
present time.  On the other hand, it is worth noting that no
Latin-based orthography is displayed on bbc.co.uk in utf-8, even when
characters beyond Latin-1 are used (Turkish) or should be used
(Hausa). If one had the time, it would be interesting to look also at
other international radio sites - VOA, RFI, Deutsche Welle, Radio
China, etc.

Among the questions I have are whether we can expect that all web
content (at least on high profile international sites) will eventually
go to utf-8 or another Unicode rendering or will various non-Unicode
8-bit standards continue to hold sway in selected areas for some time
to come?  I think that in the "ecology" of localization in a region
such as West Africa, the use or non-use of utf-8 by international
websites for a language like Hausa (which basically is the difference
between being able to use the formal orthography or resorting to an
ASCIIfied transcription as they currently do) certainly has an effect
on the way that that language and others are used in text offline. At
what point does the argument that too many local systems in a region
do not have unicode fonts lose its validity, and at what point should
organizations like BBC take the leadership in use of utf-8 (as it did
a while back with a Unicode font for Urdu)?

BBC lists 32 languages, but two of them - Kinyarwanda and Kirundi -
lead to the same "Great Lakes" page (the two languages are
interintelligible).  Also for the sake of this list, I count
Portuguese only once, even though BBC has  Brazilian and African
varieties separate. Hence the total below comes to 30.

Albanian  charset=windows-1250
Arabic  charset=windows-1256
Azeri  charset=utf-8
Bangla  charset=utf-8
Burmese  charset=utf-8
Chinese  charset=gb2312
English (Caribbean)  charset=iso-8859-1
French  charset=iso-8859-1
Hausa  charset=iso-8859-1
Hindi  charset=utf-8
Indonesian  charset=iso-8859-1
Kinyarwanda (& Kirundi)  charset=iso-8859-1
Kyrgyz  charset=utf-8
Macedonian  charset=windows-1251
Nepali  charset=utf-8
Pashto  charset=utf-8
Persian  charset=utf-8
Portuguese  (both Brazilian and African)  charset=iso-8859-1
Russian  charset=windows-1251
Serbian  charset=windows-1250
Sinhala  charset=utf-8
Somali  charset=iso-8859-1
Spanish  charset=iso-8859-1
Swahili  charset=iso-8859-1
Tamil  charset=utf-8
Turkish  charset=charset=windows-1254
Ukranian  charset=windows-1251
Urdu  charset=utf-8
Uzbek  charset=utf-8
Vietnamese  charset=utf-8

Totals:
13 utf-8
9 iso-8859-1
3 windows-1251
2 windows-1250
1 windows-1254
1 windows-1256
1 gb2312