A proposed solution for descriptions

Sun Jun 18 06:39:52 CEST 2006

Here's my summary of the discussion so far concerning ASCII fallbacks 
for Description fields, and splitting compound Description fields such 
as "Foo (Bar)" into two separate fields "Foo" and "Bar".  I'm delighted 
that so many people have become involved in this discussion.  Naturally 
I will inject my opinion where it is strong.

Please let me know (gently) if I misrepresent anyone's position.

1.  Whether and how to represent non-ASCII

Vidar Larsen would prefer not to have any "dumbed-down" ASCII-only 
variants, but only if the Registry were in UTF-8 and/or XML or a similar 
format.

I would have preferred for the Registry to be encoded directly in UTF-8, 
instead of using these escape sequences, but this was not negotiable as 
part of having an IANA registry.  We did discuss XML in the LTRU Working 
Group and decided the overhead necessary to parse XML was not justified, 
compared to record-jar or other text formats.

Ciarán Ó Duibhín pointed out that we need the "dumbed-down" descriptions 
for searching, since no search tools are capable of searching the hex 
NCRs.  This was also Richard Ishida's point.

Mark Crispin said that the entire premise of using hex NCRs in the 
Registry was wrongly conceived, and if IANA (or IETF) limits us to ASCII 
then we should have stayed with ASCII, and not attempted to represent 
non-ASCII characters using the hex-NCR kludge or any other kludge. 
Several, most notably Michael Everson, disagree and feel it is important 
to include non-ASCII, even in the case of apostrophes where little or no 
confusion will result (John Cowan disagreed about the apostrophes).

We already have non-ASCII, represented by hex NCRs, in the Registry.  I 
don't support making any structural changes in this regard at present. 
We can talk about it when the time comes to draft an ISO 639-3-enabled 
successor, although changing the format would cause compatibility 
problems.  In the meantime, anyone is free to convert the Registry into 
any format they like, so long as the content remains intact and the 
converted version is not represented as "official."

2.  Translation of descriptions, and alternative names

Debbie Garside suggested the inclusion of "all 'known names'/alternative 
names" as descriptions, but only in the sense that draft standards such 
as ISO 639-3 and 639-6 define a "known name" which serves as an ISO 
11179 Unique Identifier, plus zero or more "alternative names."  This is 
not a call for free registration of any imaginable name or nickname, 
such as "Down Under" for Australia.

Kent Karlsson expressed a preference for "Book Norwegian" instead of the 
ASCII-folded "Norwegian Bokmal", and "New Norwegian" as an additional 
alias for "Norwegian Nynorsk".  Several people disagreed that the ASCII 
version of the Description should be an English translation, especially 
in a case like this where no English speaker would use the translated 
name to refer to the entity.

Kent also suggested "Ivory Coast" as the ASCII fallback for "Côte d’Ivoire". 
While this is, once again, not an ASCII fallback but an English 
translation, Peter Constable pointed out that "Ivory Coast" has both 
currency and historical usage.  (For example, The Times of London and 
the New York Times both use "Ivory Coast", which I did not know.)  This 
might make a reasonable alternative description, and it can be proposed 
as such at any time (even now), but we should try to avoid confusing 
this with the issue of ASCII fallbacks.

3.  Transcription between ASCII and non-ASCII

Keld Jørn Simonsen preferred to transcribe "Bokmål" as "Bokmaal" instead 
of "Bokmal", and Kent preferred to transcribe "Volapük" as "Volapyk" 
instead of "Volapuk".

While these may be reasonable spellings to speakers of particular 
languages, I do not think we want to get into the business of doing the 
in-depth research to provide language-dependent transliterations. 
(Think of the numerous romanizations of Чайковский.)  These are not 
intended to be linguistically correct spellings, merely ASCII fallbacks 
that make life easier for typists.  To that end, we would do well to 
follow the practice of the UN Economic Commission for Europe in 
providing ASCII fallback names for UN/LOCODE:

"Place names are given, whenever possible, in their national language 
versions as expressed in the Roman alphabet using the 26 characters of 
the character set adopted for international trade data interchange, with 
diacritic signs, when practicable. Diacritic signs may be ignored, and 
should not be converted into additional characters (e.g., Göteborg may 
be read as Goteborg, rather than Goeteborg, Gothenburg, Gotembourg, 
etc.), in order to facilitate reproduction in the national language."

In other words, just drop the diacritic, please.  Bokmål becomes Bokmal, 
Volapük becomes Volapuk.  If we do not do this, we will never reach 
agreement on how to derive fallbacks.

Kent objected strongly to "Aland Islands", saying that "Oland would be a 
better phonetic match for "Åland" than "Aland", and that "Islands" is 
not part of the name.  Obviously I disagree on both counts; "Aland" is 
the product of diacritic-stripping, which I consider preferable to 
language-dependent transliteration, and "Islands" is indeed part of the 
ISO 3166 name.

Kent also objected to "Provencal", "Provencal, Old (to 1500)", and 
"Reunion", but provided no alternative ASCII fallbacks for any of these.

It's important to keep in mind that when we start talking about ISO 
639-3, there are some pairs of language names that differ only in 
diacritical marks.  For example, Arua and Aruá are two different 
languages.  In a case like this, we will not want to provide an ASCII 
fallback of any sort for Aruá, because that would give us two languages 
with the same name.

4.  Splitting compound names

Kent objected to splitting "Slave (Athapascan)" into two separate 
Descriptions, and as I stated earlier, I agree entirely and withdraw 
this suggestion; it was my mistake.  We will have many more instances of 
this usage in ISO 639-3 and will need to be careful.  In general, ISO 
639 indicates alternative names with a semicolon ("Spanish; Castilian"), 
not with parentheses as ISO 3166 and 15924 do.

Michael Everson asked not to split "Falkland Islands (Malvinas)" and 
"Holy See (Vatican City State)" since they have not been shown to cause 
confusion as is.  While I would prefer to treat this multiple-name 
situation consistently, regardless of the type of subtag, I don't plan 
to fight hard over it; other issues are more important to me.

Nobody seemed to have any objection to the other splits (e.g. 
Han/Hanzi/Kanji/Hanja).

I'll wait a few days for responses, then post a revised set of proposed 
modification forms.

--
Doug Ewell
Fullerton, California, USA
http://users.adelphia.net/~dewell/