Solving the UTF-8 problem

Mon Jul 2 00:58:48 CEST 2007

This is intentionally cross-posted to LTRU and ietf-languages, since it 
deals with both implementation policy and proposed changes to RFC 4646bis 
and 4645bis.

CE Whitehead <cewcathar at hotmail dot com> wrote on ietf-languages:

> I want to update the 1694acad comments field to include a transliteration 
> into Basic Latin (also--perhaps???--to fix the inconsistency as 4eme is 
> missing the accent grave on the e!! :
>
> Comments: 17th century French, as catalogued in the "Dictionnaire de 
> l'acad&#xE9;mie fran&#xE7;oise" ("l'academie francoise"), 4&#xE8;me (4eme) 
> ed. 1694; frequently includes elements of Middle French, as this is a 
> transitional period.

I really, really don't like the direction this is headed.  Ultimately we 
will find ourselves having to provide duplicate Description and Comments 
content for every non-ASCII character in the Language Subtag Registry, 
removing most of the advantage of being able to represent non-ASCII in the 
first place.

What are we going to do when the ISO 639-3 code list is finalized and we 
have to deal with adding the following pairs of languages, whose names 
differ only by diacritical marks?

aru  Arua
arx  Aruá

bfa  Bari
mot  Barí

kgm  Karipúna
kuq  Karipuná

sbe  Saliba
slc  Sáliba

wbf  Wara
tci  Wára

Are we going to include an ASCII version of every name that contains an 
accented letter?  There are several hundred in ISO 639-3.  As CE has shown 
above, we are already strongly considering adding duplicative Comments 
information to get around our own technical limitations.  Will we also have 
to include elaborate Comments fields distinguishing the "real" Arua from the 
one we invented by lopping off the accent from Aruá?

Section 3.1 mentions transcription of non-Latin Description fields into the 
Latin script.  It does not talk about providing a pure-ASCII equivalent for 
every non-ASCII French- or Spanish-language string, and I don't believe that 
was the WG's intention.  Transcriptions are useful when the content is in 
Arabic or Cyrillic or Han, to make the material available to 
Latin-script-only readers.  Providing "transcriptions" like "4#xE8;eme 
(4eme)" merely announces to the world that we can't solve our own technical 
character-encoding problems without resorting to unwieldy kludges.

We really need to take another hard, serious look at maintaining the 
Registry in UTF-8.  The current scheme is one of the biggest sources of 
misunderstanding that newcomers have about the Registry, and one of the 
biggest bones of contention among regular list participants.  We need to 
consider this soon, before the release of the "final" ISO 639-3 data moves 
RFC 4645bis onto the LTRU critical path, because the way it is decided will 
have huge implications for RFC 4645bis.  And we need to consider it WITHOUT 
dragging in side discussions like XML.

As far as I can tell, the objections to moving the Registry to UTF-8 are as 
follows:

1.  UTF-8 doesn't play well with e-mail, which is invaluable for discussing 
changes on the ietf-languages list and sending the changes to IANA (stated 
by several).

Clearly there are problems with the diversity of character encodings as 
e-mail travels from a sender's machine across the Internet and onto each 
recipient's client, or gets processed into Web archive pages.  Even on the 
Unicode mailing list and mail archive pages, where you might think the 
problem would be solved, you can see numerous examples of this.  This isn't 
limited to UTF-8; it is not uncommon to see Latin-1 or (especially) 
Windows-1252 get corrupted or displayed incorrectly.

As Ira McDonald suggested, it would make sense to conduct all preliminary 
discussion using escape sequences or some other mechanism:

"(Note: the name 'Aruá' is actually spelled 'A, r, u, a-with-acute'.)"

but then send the final change to IANA in UTF-8 so they could simply drop it 
into the Registry with little or no editing, as they do now.  We would have 
to figure out some way for the list to confirm that the changes seen on the 
list are those that are sent to IANA, with no alteration other than the 
recoding to UTF-8.  This should not be difficult.  I am willing do whatever 
is deemed necessary on my part to make this work.  We need to figure out how 
to make it work, instead of using it as a reason not to adopt UTF-8.

2.  Converting the Registry to UTF-8 would break existing implementations 
that expect hex NCRs (Addison Phillips).

Addison is correct; any structural change to the Registry will break RFC 
4646-conformant processors.  This is true not only for UTF-8, but also for 
new fields such as "Macrolanguage" or "Modified."  (Section 3.1 says the 
Type "MUST" be one of the seven currently defined values.)

If there are implementations that read and interpret the Registry and will 
choke on non-ASCII input, whose authors are not on one of these mailing 
lists, then we need to get the word out that the format may change, just as 
we would if we decide to add new field values.  Personally I doubt there are 
large numbers of such implementations.

3.  UTF-8 can't be read on some, espcially older, computer systems (Frank 
Ellermann, months ago, and CE Whitehead).

With the continuing adoption of Unicode by OS and software vendors, I really 
can't get behind this argument.  It simply isn't appropriate to "dumb down" 
all computerized text to match the least capable systems that might be 
running somewhere.  This is especially true considering the language names 
listed above.  We don't restrict text to uppercase to maintain compatibility 
with BCDIC and Sinclair ZX81 systems.

A Windows system running Internet Explorer 4.0 or above can display a local 
text file as UTF-8.  According to Wikipedia, of the 78% of Windows machines 
that run IE, fewer than 1% are running a version lower than 6.0.  Support 
for UTF-8 is probably the same, if not better, for non-Microsoft browsers; 
see http://www.alanwood.net/unicode/browsers.html for more information.

In thinking about "display," we should step back and remember why we have a 
Language Subtag Registry at all.  It exists to support BCP 47 language 
tagging by providing a complete list of all subtags that can be used to form 
a language tag, plus all grandfathered tags that can be used on their own. 
It provides some additional information, such as comments, to help tag 
producers and consumers make tagging decisions, but it is not intended to be 
a general compendium of language information, meant for casual browsing.

Another possibility is to have IANA post an official version of the Registry 
in one encoding, such as UTF-8, and additional, unofficial versions in other 
encodings, such as Latin-1 or hex NCRs.  This is the approach chosen by the 
ISO 639-3 Registration Authority.  Potential problems with this approach are 
unintentional mismatches between the versions (I caught one of these 
problems for the ISO 639-3 people recently) and a perception that the 
"simplified" version is actually the official one.

My suggestions, very simply, are:

* For LTRU, to amend RFC 4646bis to change the format of the Registry to 
UTF-8, and to work out the details such as compatibility with existing RFC 
4646 processors and avoidance of UTF-8 in e-mails.

* For ietf-languages, to impose a moratorium on changes to Description and 
Comments fields whose only purpose is to transcribe hex NCRs to ASCII, until 
the matter is resolved within LTRU.

--
Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14
http://users.adelphia.net/~dewell/
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages