Solving the UTF-8 problem
Doug Ewell
dewell at roadrunner.com
Mon Jul 2 00:58:48 CEST 2007
This is intentionally cross-posted to LTRU and ietf-languages, since it
deals with both implementation policy and proposed changes to RFC 4646bis
and 4645bis.
CE Whitehead <cewcathar at hotmail dot com> wrote on ietf-languages:
> I want to update the 1694acad comments field to include a transliteration
> into Basic Latin (also--perhaps???--to fix the inconsistency as 4eme is
> missing the accent grave on the e!! :
>
> Comments: 17th century French, as catalogued in the "Dictionnaire de
> l'académie françoise" ("l'academie francoise"), 4ème (4eme)
> ed. 1694; frequently includes elements of Middle French, as this is a
> transitional period.
I really, really don't like the direction this is headed. Ultimately we
will find ourselves having to provide duplicate Description and Comments
content for every non-ASCII character in the Language Subtag Registry,
removing most of the advantage of being able to represent non-ASCII in the
first place.
What are we going to do when the ISO 639-3 code list is finalized and we
have to deal with adding the following pairs of languages, whose names
differ only by diacritical marks?
aru Arua
arx Aruá
bfa Bari
mot Barí
kgm Karipúna
kuq Karipuná
sbe Saliba
slc Sáliba
wbf Wara
tci Wára
Are we going to include an ASCII version of every name that contains an
accented letter? There are several hundred in ISO 639-3. As CE has shown
above, we are already strongly considering adding duplicative Comments
information to get around our own technical limitations. Will we also have
to include elaborate Comments fields distinguishing the "real" Arua from the
one we invented by lopping off the accent from Aruá?
Section 3.1 mentions transcription of non-Latin Description fields into the
Latin script. It does not talk about providing a pure-ASCII equivalent for
every non-ASCII French- or Spanish-language string, and I don't believe that
was the WG's intention. Transcriptions are useful when the content is in
Arabic or Cyrillic or Han, to make the material available to
Latin-script-only readers. Providing "transcriptions" like "4#xE8;eme
(4eme)" merely announces to the world that we can't solve our own technical
character-encoding problems without resorting to unwieldy kludges.
We really need to take another hard, serious look at maintaining the
Registry in UTF-8. The current scheme is one of the biggest sources of
misunderstanding that newcomers have about the Registry, and one of the
biggest bones of contention among regular list participants. We need to
consider this soon, before the release of the "final" ISO 639-3 data moves
RFC 4645bis onto the LTRU critical path, because the way it is decided will
have huge implications for RFC 4645bis. And we need to consider it WITHOUT
dragging in side discussions like XML.
As far as I can tell, the objections to moving the Registry to UTF-8 are as
follows:
1. UTF-8 doesn't play well with e-mail, which is invaluable for discussing
changes on the ietf-languages list and sending the changes to IANA (stated
by several).
Clearly there are problems with the diversity of character encodings as
e-mail travels from a sender's machine across the Internet and onto each
recipient's client, or gets processed into Web archive pages. Even on the
Unicode mailing list and mail archive pages, where you might think the
problem would be solved, you can see numerous examples of this. This isn't
limited to UTF-8; it is not uncommon to see Latin-1 or (especially)
Windows-1252 get corrupted or displayed incorrectly.
As Ira McDonald suggested, it would make sense to conduct all preliminary
discussion using escape sequences or some other mechanism:
"(Note: the name 'Aruá' is actually spelled 'A, r, u, a-with-acute'.)"
but then send the final change to IANA in UTF-8 so they could simply drop it
into the Registry with little or no editing, as they do now. We would have
to figure out some way for the list to confirm that the changes seen on the
list are those that are sent to IANA, with no alteration other than the
recoding to UTF-8. This should not be difficult. I am willing do whatever
is deemed necessary on my part to make this work. We need to figure out how
to make it work, instead of using it as a reason not to adopt UTF-8.
2. Converting the Registry to UTF-8 would break existing implementations
that expect hex NCRs (Addison Phillips).
Addison is correct; any structural change to the Registry will break RFC
4646-conformant processors. This is true not only for UTF-8, but also for
new fields such as "Macrolanguage" or "Modified." (Section 3.1 says the
Type "MUST" be one of the seven currently defined values.)
If there are implementations that read and interpret the Registry and will
choke on non-ASCII input, whose authors are not on one of these mailing
lists, then we need to get the word out that the format may change, just as
we would if we decide to add new field values. Personally I doubt there are
large numbers of such implementations.
3. UTF-8 can't be read on some, espcially older, computer systems (Frank
Ellermann, months ago, and CE Whitehead).
With the continuing adoption of Unicode by OS and software vendors, I really
can't get behind this argument. It simply isn't appropriate to "dumb down"
all computerized text to match the least capable systems that might be
running somewhere. This is especially true considering the language names
listed above. We don't restrict text to uppercase to maintain compatibility
with BCDIC and Sinclair ZX81 systems.
A Windows system running Internet Explorer 4.0 or above can display a local
text file as UTF-8. According to Wikipedia, of the 78% of Windows machines
that run IE, fewer than 1% are running a version lower than 6.0. Support
for UTF-8 is probably the same, if not better, for non-Microsoft browsers;
see http://www.alanwood.net/unicode/browsers.html for more information.
In thinking about "display," we should step back and remember why we have a
Language Subtag Registry at all. It exists to support BCP 47 language
tagging by providing a complete list of all subtags that can be used to form
a language tag, plus all grandfathered tags that can be used on their own.
It provides some additional information, such as comments, to help tag
producers and consumers make tagging decisions, but it is not intended to be
a general compendium of language information, meant for casual browsing.
Another possibility is to have IANA post an official version of the Registry
in one encoding, such as UTF-8, and additional, unofficial versions in other
encodings, such as Latin-1 or hex NCRs. This is the approach chosen by the
ISO 639-3 Registration Authority. Potential problems with this approach are
unintentional mismatches between the versions (I caught one of these
problems for the ISO 639-3 people recently) and a perception that the
"simplified" version is actually the official one.
My suggestions, very simply, are:
* For LTRU, to amend RFC 4646bis to change the format of the Registry to
UTF-8, and to work out the details such as compatibility with existing RFC
4646 processors and avoidance of UTF-8 in e-mails.
* For ietf-languages, to impose a moratorium on changes to Description and
Comments fields whose only purpose is to transcribe hex NCRs to ASCII, until
the matter is resolved within LTRU.
--
Doug Ewell * Fullerton, California, USA * RFC 4645 * UTN #14
http://users.adelphia.net/~dewell/
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages
More information about the Ietf-languages
mailing list