A proposed solution for descriptions

Sun Jun 18 20:29:55 CEST 2006

Doug Ewell wrote:

> Please let me know (gently) if I misrepresent anyone's position.

I would just like to say (gently) that my position is still slightly misrepresented, Doug wrote:

> Debbie Garside suggested the inclusion of "all 'known 
> names'/alternative names" as descriptions, but only in the 
> sense that draft standards such as ISO 639-3 and 639-6 define 
> a "known name" which serves as an ISO
> 11179 Unique Identifier, plus zero or more "alternative 
> names."  This is not a call for free registration of any 
> imaginable name or nickname, such as "Down Under" for Australia.

What I actually said or indeed meant is all names as represented within the underlying standards should be included in the registry in the EXACT format that they take within the standard.  This means that if a name is presented as Foo (Bar) in the underlying standard then it remains as Foo (Bar).  Three reasons for this, one: consistency, two: very often the bracketed information acts as an additional qualifier, three: the name can be used by other systems as a Unique Identifier (ISO 11179).  From what I can see additional names (within ISO 639-1/2) are delimited by ";" and these should be added as further descriptions.

> 1.  Whether and how to represent non-ASCII
> I would have preferred for the Registry to be encoded 
> directly in UTF-8, instead of using these escape sequences, 
> but this was not negotiable as part of having an IANA 
> registry.  

Where a known name includes a diacritic mark or other character that cannot be represented in ASCII, there should be an ADDITIONAL description field giving the code point in whatever format is agreed.  However, there must always be an ASCII equivalent for human readability.  Please remember that we are not all working for multi-nationals, we are not all programmers/software developers and the whole purpose of standardisation is to make it accessible to all in order that it may stand a chance of being adopted by all; thus creating a standard. 

> Ciarán Ó Duibhín pointed out that we need the "dumbed-down" 
> descriptions for searching, since no search tools are capable 
> of searching the hex NCRs.  This was also Richard Ishida's point.

I think it has been proven that search engines are perfectly capable of dealing with "dumbed down" ASCII versions. Try entering Provencal into Google for instance.  An ASCII description should be part of EVERY record within the registry.  

I object to this format:

-----

Type: language
Subtag: nqo
Description: N&#x2019;Ko
Suppress-Script: Nkoo
Added: 2006-xx-xx

-----

I approve of this format:

Type: language
Subtag: nqo
Description: N'Ko
Description: N&#x2019;Ko
Suppress-Script: Nkoo
Added: 2006-xx-xx

-----

> Mark Crispin said that the entire premise of using hex NCRs 
> in the Registry was wrongly conceived, and if IANA (or IETF) 
> limits us to ASCII then we should have stayed with ASCII, and 
> not attempted to represent non-ASCII characters using the 
> hex-NCR kludge or any other kludge. 
> Several, most notably Michael Everson, disagree and feel it 
> is important to include non-ASCII, even in the case of 
> apostrophes where little or no confusion will result (John 
> Cowan disagreed about the apostrophes).

I agree with Michael. But I think it is important to have both.  Hex NCRs may be ill-conceived but I think it is necessary information given the imitations of ASCII - I don't know enough about alternative formats for displaying this information to comment further.

> 
> 2.  Translation of descriptions, and alternative names
> 
> Kent Karlsson expressed a preference for "Book Norwegian" 
> instead of the ASCII-folded "Norwegian Bokmal", and "New 
> Norwegian" as an additional alias for "Norwegian Nynorsk".  
> Several people disagreed that the ASCII version of the 
> Description should be an English translation, especially in a 
> case like this where no English speaker would use the 
> translated name to refer to the entity.

I think we need to get back to the ISO standards as mentioned previously.

> Kent also suggested "Ivory Coast" as the ASCII fallback for 
> "Côte d’Ivoire". 
> While this is, once again, not an ASCII fallback but an 
> English translation, Peter Constable pointed out that "Ivory 
> Coast" has both currency and historical usage.  (For example, 
> The Times of London and the New York Times both use "Ivory 
> Coast", which I did not know.)  This might make a reasonable 
> alternative description, and it can be proposed as such at 
> any time (even now), but we should try to avoid confusing 
> this with the issue of ASCII fallbacks.

This would open the "flood gates" in my opinion.  I am sure there is both "currency and historical usage" for translation of most of the names in many a number of languages.  Bad move to add it just because it is an English translation.

> 3.  Transcription between ASCII and non-ASCII
> 
> Keld Jørn Simonsen preferred to transcribe "Bokmål" as 
> "Bokmaal" instead of "Bokmal", and Kent preferred to 
> transcribe "Volapük" as "Volapyk" 
> instead of "Volapuk".

In order to introduce a consistent methodology for dealing with these issues, I would suggest just dropping the diacritic mark.  This may not work for some non-ASCII characters but certainly where diacritics are involved it is the best solution for a standard approach IMHO.  Otherwise we are looking at reviewing all instances as opposed to applying a simple set "diacritic rule".

> While these may be reasonable spellings to speakers of 
> particular languages, I do not think we want to get into the 
> business of doing the in-depth research to provide 
> language-dependent transliterations. 

Agreed

> (Think of the numerous romanizations of Чайковский.)  These 
> are not intended to be linguistically correct spellings, 
> merely ASCII fallbacks that make life easier for typists.  To 
> that end, we would do well to follow the practice of the UN 
> Economic Commission for Europe in providing ASCII fallback 
> names for UN/LOCODE:
> 
> "Place names are given, whenever possible, in their national 
> language versions as expressed in the Roman alphabet using 
> the 26 characters of the character set adopted for 
> international trade data interchange, with diacritic signs, 
> when practicable. Diacritic signs may be ignored, and should 
> not be converted into additional characters (e.g., Göteborg 
> may be read as Goteborg, rather than Goeteborg, Gothenburg, 
> Gotembourg, etc.), in order to facilitate reproduction in the 
> national language."

Absoutely agree.

> In other words, just drop the diacritic, please.  Bokmål 
> becomes Bokmal, Volapük becomes Volapuk.  If we do not do 
> this, we will never reach agreement on how to derive fallbacks.

Exactly

> Kent objected strongly to "Aland Islands", saying that "Oland 
> would be a better phonetic match for "Åland" than "Aland", 
> and that "Islands" is not part of the name.  Obviously I 
> disagree on both counts; "Aland" is the product of 
> diacritic-stripping, which I consider preferable to 
> language-dependent transliteration, and "Islands" is indeed 
> part of the ISO 3166 name.

Indeed

> Kent also objected to "Provencal", "Provencal, Old (to 
> 1500)", and "Reunion", but provided no alternative ASCII 
> fallbacks for any of these.

Set a rule for diacritics and follow it... Then there is no need to proffer alternatives and spend days discussing them.

> It's important to keep in mind that when we start talking 
> about ISO 639-3, there are some pairs of language names that 
> differ only in diacritical marks.  For example, Arua and Aruá 
> are two different languages.  In a case like this, we will 
> not want to provide an ASCII fallback of any sort for Aruá, 
> because that would give us two languages with the same name.

WRONG.  There will be one description for the first instance and two for the second. This is perfectly understood as a human or when being parsed so long as a written methodology is included within the standard.  Remember what the alpha2/3 code is for.  I would suggest coming to some sort of tentative agreement on the records proposed by Doug and then tackling this with written rules in RFC3066ter.

> 4.  Splitting compound names
> 
> Kent objected to splitting "Slave (Athapascan)" into two 
> separate Descriptions, and as I stated earlier, I agree 
> entirely and withdraw this suggestion; it was my mistake.  We 
> will have many more instances of this usage in ISO 639-3 and 
> will need to be careful.  In general, ISO
> 639 indicates alternative names with a semicolon ("Spanish; 
> Castilian"), not with parentheses as ISO 3166 and 15924 do.

I think I am right in saying that within 639-3 (and certainly within 639-6) information contained within parentheses is ALWAYS used as qualifiers NOT alternative names. www.sil.org/iso639-3/codes.asp  I am sure Peter will correct me if I am wrong.

> Michael Everson asked not to split "Falkland Islands 
> (Malvinas)" and "Holy See (Vatican City State)" since they 
> have not been shown to cause confusion as is.  While I would 
> prefer to treat this multiple-name situation consistently, 
> regardless of the type of subtag, I don't plan to fight hard 
> over it; other issues are more important to me.

As these names are presented within the underlying ISO standards as above, I am with Michael on this.  However, I would not strenuously object to adding two additional names, splitting the descriptions, provided the original stays intact.  Thus a record such as Holy See (Vatican City State) would have 3 descriptions.  

> Nobody seemed to have any objection to the other splits (e.g. 
> Han/Hanzi/Kanji/Hanja).

I object if the name as represented in the underlying ISO standard is not retained.  I have no real objection to additional descriptions but I think if you are going to do this there needs to be a written rule as to when additional descriptions can/should be added - flood gates and all that!

> I'll wait a few days for responses, then post a revised set 
> of proposed modification forms.

That's my response for what its worth :-)

Debbie Garside
> 
> --
> Doug Ewell
> Fullerton, California, USA
> http://users.adelphia.net/~dewell/
> 
> 
> _______________________________________________
> Ietf-languages mailing list
> Ietf-languages at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/ietf-languages
>