Solving the UTF-8 problem

Sat Jul 14 23:38:28 CEST 2007

CE Whitehead <cewcathar at hotmail dot com> wrote:

> Here is the view:
>
> "<p>The title in Chinese is <span lang="zh-Hans"
> xml:lang="zh-Hans">&#20013;&#22269;&#31185;&#23398;&#38498;&#25991;&#29486;&#24773;&#25253;&#20013;&#24515;</span>.</p>"
>
> (Maybe the characters will come out in your email but they look like 
> ??? in my browser; I need to take an image; no decent camera here)

This completely changes the subject, from displaying Latin and Cyrillic 
and punctuation characters encoded in UTF-8 in a text file, to 
displaying Chinese characters encoded with decimal NCRs in a Web page. 
Western versions of Windows 95 are well-known for their lack of support 
of East Asian characters.  This is irrelevant to the Registry.

Almost every character in the current Registry, and in the draft-4645bis 
Registry as it currently stands, is in Windows code page 1252 (Western 
European) and in MacRoman.  There are a small handful of non-Latin-1 
characters that should still be present in a WGL4-compliant font.

It is true that we may include comments in non-Latin scripts such as 
Chinese in the future, but as I have said before, I do support 
transcribing these into Latin (in addition to keeping the original, if 
the requestor so desires).  What I do not support is "transcribing" 
accented Latin into unaccented Latin.  That is remedial, and it is 
unnecessary in 2007, and it will only add to the confusion regarding 
Arua and Aruá and other pairs.

Here are the steps I followed at work to open a local UTF-8 file using a 
stock Windows 95 Rev B machine with Internet Explorer 5:

* Click File, Open
* Click Browse
* Change the "Files of type" setting to "Text Files"
* Select the file
* Click "Open as Web folder"
* Click OK
* Click Yes when asked if you want to see the default view

Using either Courier New or Lucida Console, the two monospace fonts 
available in Windows 95 for this purpose, the only characters in the 
current Registry which would not display were U+02BB, in the description 
for script Ethi, and U+04D8 and U+04D9, in the description for variant 
"baku1926".  In the 4645bis Registry we will probably also add U+02BC in 
the description for language "gwi", to match ISO 639-3.  That is not 
evidence of wholesale unreadability to Windows 95 users.

I will note that I couldn't open the UTF-8 Registry at langtag.net in 
this way, because Stéphane named it language-subtag-registry.utf8.  Many 
operating systems including Windows, and many browsers including IE, 
know what to do with files with a .txt extension but have no idea what 
to do with a .utf8 file.  I suggest changing the naming convention for 
UTF-8 files on langtag.net from "something.utf8" or "something.txt.utf8" 
to "something.utf8.txt" to take advantage of this.

> So just make sure to link to the fonts, viewers, etc. when you put up 
> the utf-8 version and keep an ansi or ascii version unofficially.

One more time:  We will not be the only organization out there with a 
UTF-8 file, and we are not responsible for tutoring the remaining part 
of the world that doesn't know how to work with UTF-8.  There are many 
useful "Introduction to UTF-8" sites that can help.  I can see posting a 
"reduced value" hex-NCR version of the Registry, and I can see posting 
an informative link to a tutorial site, but I cannot see making this a 
primary activity of the language-tagging group.

--
Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14
http://users.adelphia.net/~dewell/
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages