Solving the UTF-8 problem
dewell at roadrunner.com
Sat Jul 14 23:38:28 CEST 2007
CE Whitehead <cewcathar at hotmail dot com> wrote:
> Here is the view:
> "<p>The title in Chinese is <span lang="zh-Hans"
> (Maybe the characters will come out in your email but they look like
> ??? in my browser; I need to take an image; no decent camera here)
This completely changes the subject, from displaying Latin and Cyrillic
and punctuation characters encoded in UTF-8 in a text file, to
displaying Chinese characters encoded with decimal NCRs in a Web page.
Western versions of Windows 95 are well-known for their lack of support
of East Asian characters. This is irrelevant to the Registry.
Almost every character in the current Registry, and in the draft-4645bis
Registry as it currently stands, is in Windows code page 1252 (Western
European) and in MacRoman. There are a small handful of non-Latin-1
characters that should still be present in a WGL4-compliant font.
It is true that we may include comments in non-Latin scripts such as
Chinese in the future, but as I have said before, I do support
transcribing these into Latin (in addition to keeping the original, if
the requestor so desires). What I do not support is "transcribing"
accented Latin into unaccented Latin. That is remedial, and it is
unnecessary in 2007, and it will only add to the confusion regarding
Arua and Aruá and other pairs.
Here are the steps I followed at work to open a local UTF-8 file using a
stock Windows 95 Rev B machine with Internet Explorer 5:
* Click File, Open
* Click Browse
* Change the "Files of type" setting to "Text Files"
* Select the file
* Click "Open as Web folder"
* Click OK
* Click Yes when asked if you want to see the default view
Using either Courier New or Lucida Console, the two monospace fonts
available in Windows 95 for this purpose, the only characters in the
current Registry which would not display were U+02BB, in the description
for script Ethi, and U+04D8 and U+04D9, in the description for variant
"baku1926". In the 4645bis Registry we will probably also add U+02BC in
the description for language "gwi", to match ISO 639-3. That is not
evidence of wholesale unreadability to Windows 95 users.
I will note that I couldn't open the UTF-8 Registry at langtag.net in
this way, because Stéphane named it language-subtag-registry.utf8. Many
operating systems including Windows, and many browsers including IE,
know what to do with files with a .txt extension but have no idea what
to do with a .utf8 file. I suggest changing the naming convention for
UTF-8 files on langtag.net from "something.utf8" or "something.txt.utf8"
to "something.utf8.txt" to take advantage of this.
> So just make sure to link to the fonts, viewers, etc. when you put up
> the utf-8 version and keep an ansi or ascii version unofficially.
One more time: We will not be the only organization out there with a
UTF-8 file, and we are not responsible for tutoring the remaining part
of the world that doesn't know how to work with UTF-8. There are many
useful "Introduction to UTF-8" sites that can help. I can see posting a
"reduced value" hex-NCR version of the Registry, and I can see posting
an informative link to a tutorial site, but I cannot see making this a
primary activity of the language-tagging group.
Doug Ewell * Fullerton, California, USA * RFC 4645 * UTN #14
More information about the Ietf-languages