Solving the UTF-8 problem

Doug Ewell dewell at roadrunner.com
Mon Jul 16 16:36:13 CEST 2007


Stephane Bortzmeyer <bortzmeyer at nic dot fr> wrote:

>> I suggest changing the naming convention for UTF-8 files on 
>> langtag.net from "something.utf8" or "something.txt.utf8" to 
>> "something.utf8.txt" to take advantage of this.
>
> I hesitate here because the www.langtag.net Web site sends the proper 
> file type:
>
>  Content-Type: text/plain; charset=utf-8

The HTTP wasn't at issue here; the filename extension was.

> If IE does not understand that the file is plain text encoded in 
> UTF-8, it is broken.
>
> Using the file extension to find its type is both non-standard (why 
> ".txt" instead of ".text"? Where is the registry of file extensions?) 
> and quite old-fashioned.

Broken or not, archaic and quaint or not, this is one of the way 
operating systems have to tell what a file is.  Some systems store this 
information in a separate meta-database, which wouldn't be of much use 
for a hitherto-unknown file pulled off the Internet.  Classic Unix 
systems might read the first few bytes and try to figure it out from 
there.  Windows systems happen to use the extension.

It's easy enough for a knowledgeable user to tell Windows that "utf8" is 
an extension for plain-text files by editing the Windows registry 
(answering your question about where the registry of extensions is).  As 
for "txt", this is indeed the de-facto standard for text files under 
Windows (and not "text"), going back to the MS-DOS days of 3-letter 
extensions.  We can try to change this convention, but I submit that is 
not our job here on ietf-languages.

Jeremy Carroll <jjc at hpl dot hp dot com> wrote:

> My copy of IE seems happy enough
>
> Version 7.0.5730.11CO
>
> on
>
> OS Name Microsoft Windows XP Professional
> Version 5.1.2600 Service Pack 2 Build 2600

Yes, my platform at ome is basically the same, and it works fine there.

> I would suggest listing a few known-to-work configurations, somewhere, 
> ... and as long as the behaviour is standards conformant, and does 
> work on enough platforms, then that's fit for purpose.

"txt" would work on both old and new systems.

> This is not a site intended for the general public requiring some 
> solution for old, old browsers. If it were, they would be a server 
> side solution by looking at the user agent information passed with the 
> http request, and returning say the ascii only version or redirecting 
> to a URL with a .txt extension or whatever. But that is not what this 
> page is for!

I invite Stephane and Jeremy to hash this out with CE Whitehead, who 
says the Registry needs to be compatible with any system that might be 
used "anywhere on earth."  I've suggested a minor file naming change 
that would allow him to read UTF-8 on a Windows 95 machine, and would 
not break on newer equipment.  Come on, guys, this really does not need 
to become an OS war.

--
Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14
http://users.adelphia.net/~dewell/
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages



More information about the Ietf-languages mailing list