A proposed solution for descriptions
Doug Ewell
dewell at adelphia.net
Sun Jun 18 06:39:52 CEST 2006
Here's my summary of the discussion so far concerning ASCII fallbacks
for Description fields, and splitting compound Description fields such
as "Foo (Bar)" into two separate fields "Foo" and "Bar". I'm delighted
that so many people have become involved in this discussion. Naturally
I will inject my opinion where it is strong.
Please let me know (gently) if I misrepresent anyone's position.
1. Whether and how to represent non-ASCII
Vidar Larsen would prefer not to have any "dumbed-down" ASCII-only
variants, but only if the Registry were in UTF-8 and/or XML or a similar
format.
I would have preferred for the Registry to be encoded directly in UTF-8,
instead of using these escape sequences, but this was not negotiable as
part of having an IANA registry. We did discuss XML in the LTRU Working
Group and decided the overhead necessary to parse XML was not justified,
compared to record-jar or other text formats.
Ciarán Ó Duibhín pointed out that we need the "dumbed-down" descriptions
for searching, since no search tools are capable of searching the hex
NCRs. This was also Richard Ishida's point.
Mark Crispin said that the entire premise of using hex NCRs in the
Registry was wrongly conceived, and if IANA (or IETF) limits us to ASCII
then we should have stayed with ASCII, and not attempted to represent
non-ASCII characters using the hex-NCR kludge or any other kludge.
Several, most notably Michael Everson, disagree and feel it is important
to include non-ASCII, even in the case of apostrophes where little or no
confusion will result (John Cowan disagreed about the apostrophes).
We already have non-ASCII, represented by hex NCRs, in the Registry. I
don't support making any structural changes in this regard at present.
We can talk about it when the time comes to draft an ISO 639-3-enabled
successor, although changing the format would cause compatibility
problems. In the meantime, anyone is free to convert the Registry into
any format they like, so long as the content remains intact and the
converted version is not represented as "official."
2. Translation of descriptions, and alternative names
Debbie Garside suggested the inclusion of "all 'known names'/alternative
names" as descriptions, but only in the sense that draft standards such
as ISO 639-3 and 639-6 define a "known name" which serves as an ISO
11179 Unique Identifier, plus zero or more "alternative names." This is
not a call for free registration of any imaginable name or nickname,
such as "Down Under" for Australia.
Kent Karlsson expressed a preference for "Book Norwegian" instead of the
ASCII-folded "Norwegian Bokmal", and "New Norwegian" as an additional
alias for "Norwegian Nynorsk". Several people disagreed that the ASCII
version of the Description should be an English translation, especially
in a case like this where no English speaker would use the translated
name to refer to the entity.
Kent also suggested "Ivory Coast" as the ASCII fallback for "Côte d’Ivoire".
While this is, once again, not an ASCII fallback but an English
translation, Peter Constable pointed out that "Ivory Coast" has both
currency and historical usage. (For example, The Times of London and
the New York Times both use "Ivory Coast", which I did not know.) This
might make a reasonable alternative description, and it can be proposed
as such at any time (even now), but we should try to avoid confusing
this with the issue of ASCII fallbacks.
3. Transcription between ASCII and non-ASCII
Keld Jørn Simonsen preferred to transcribe "Bokmål" as "Bokmaal" instead
of "Bokmal", and Kent preferred to transcribe "Volapük" as "Volapyk"
instead of "Volapuk".
While these may be reasonable spellings to speakers of particular
languages, I do not think we want to get into the business of doing the
in-depth research to provide language-dependent transliterations.
(Think of the numerous romanizations of Чайковский.) These are not
intended to be linguistically correct spellings, merely ASCII fallbacks
that make life easier for typists. To that end, we would do well to
follow the practice of the UN Economic Commission for Europe in
providing ASCII fallback names for UN/LOCODE:
"Place names are given, whenever possible, in their national language
versions as expressed in the Roman alphabet using the 26 characters of
the character set adopted for international trade data interchange, with
diacritic signs, when practicable. Diacritic signs may be ignored, and
should not be converted into additional characters (e.g., Göteborg may
be read as Goteborg, rather than Goeteborg, Gothenburg, Gotembourg,
etc.), in order to facilitate reproduction in the national language."
In other words, just drop the diacritic, please. Bokmål becomes Bokmal,
Volapük becomes Volapuk. If we do not do this, we will never reach
agreement on how to derive fallbacks.
Kent objected strongly to "Aland Islands", saying that "Oland would be a
better phonetic match for "Åland" than "Aland", and that "Islands" is
not part of the name. Obviously I disagree on both counts; "Aland" is
the product of diacritic-stripping, which I consider preferable to
language-dependent transliteration, and "Islands" is indeed part of the
ISO 3166 name.
Kent also objected to "Provencal", "Provencal, Old (to 1500)", and
"Reunion", but provided no alternative ASCII fallbacks for any of these.
It's important to keep in mind that when we start talking about ISO
639-3, there are some pairs of language names that differ only in
diacritical marks. For example, Arua and Aruá are two different
languages. In a case like this, we will not want to provide an ASCII
fallback of any sort for Aruá, because that would give us two languages
with the same name.
4. Splitting compound names
Kent objected to splitting "Slave (Athapascan)" into two separate
Descriptions, and as I stated earlier, I agree entirely and withdraw
this suggestion; it was my mistake. We will have many more instances of
this usage in ISO 639-3 and will need to be careful. In general, ISO
639 indicates alternative names with a semicolon ("Spanish; Castilian"),
not with parentheses as ISO 3166 and 15924 do.
Michael Everson asked not to split "Falkland Islands (Malvinas)" and
"Holy See (Vatican City State)" since they have not been shown to cause
confusion as is. While I would prefer to treat this multiple-name
situation consistently, regardless of the type of subtag, I don't plan
to fight hard over it; other issues are more important to me.
Nobody seemed to have any objection to the other splits (e.g.
Han/Hanzi/Kanji/Hanja).
I'll wait a few days for responses, then post a revised set of proposed
modification forms.
--
Doug Ewell
Fullerton, California, USA
http://users.adelphia.net/~dewell/
More information about the Ietf-languages
mailing list