A proposed solution for descriptions

Sun Jun 18 22:37:13 CEST 2006

Debbie Garside <debbie at ictmarketing dot co dot uk> wrote:

>> Debbie Garside suggested the inclusion of "all 'known 
>> names'/alternative names" as descriptions, but only in the sense that 
>> draft standards such as ISO 639-3 and 639-6 define a "known name" 
>> which serves as an ISO 11179 Unique Identifier, plus zero or more 
>> "alternative names."  This is not a call for free registration of any 
>> imaginable name or nickname, such as "Down Under" for Australia.
>
> What I actually said or indeed meant is all names as represented 
> within the underlying standards should be included in the registry in 
> the EXACT format that they take within the standard.  This means that 
> if a name is presented as Foo (Bar) in the underlying standard then it 
> remains as Foo (Bar).  Three reasons for this, one: consistency, two: 
> very often the bracketed information acts as an additional qualifier, 
> three: the name can be used by other systems as a Unique Identifier 
> (ISO 11179).  From what I can see additional names (within ISO 
> 639-1/2) are delimited by ";" and these should be added as further 
> descriptions.

Taking the second reason first: If the bracketed information acts as an 
additional qualifier, then I agree 100% that it should be included as 
part of the description.  This is true in the case of "Slave 
(Athapascan)" and I made a mistake by splitting those out (and have 
withdrawn that suggestion).  It does not seem true in the case of 
"Deseret (Mormon)" or "Falkland Islands (Malvinas)".

Debbie is right that ISO 639 is consistent about using parentheses to 
indicate qualifiers, and semicolons to indicate alternative names.  But 
they are not always consistent about the order of the names, so it 
cannot always be detemined which is the "original" name and which are 
"additional."  ISO 3166 uses parentheses for alternative names (there 
being no "qualifiers" per se) and ISO 15924 uses parentheses for both, 
so some human judgement must still be applied there.

Consistency with the ISO standards was my reason for sticking with the 
exact apostrophe style used in those standards.  Thus we ended up with 
"Gwich´in" from ISO 639 (acute accent used as apostrophe), "N’Ko" from 
ISO 15924 (curly apostrophe), and "Côte d'Ivoire" from ISO 3166 
(straight apostrophe but non-ASCII "o with circumflex").

I still need to spend some additional time studying ISO 11179.  My 
knee-jerk reaction, with regard to using one of the Description fields 
as a Unique Identifier, is that I would hate to be in the situation that 
Unicode and ISO 10646 have found themselves with character names.  They 
are normative and guaranteed to be stable and immutable, and because of 
that there are several wrong or misleading names in the standard, which 
causes much misunderstanding and flamage.

It would probably help if the Description field that is intended to be 
the Unique Identifier could be distinguished from alternative 
descriptions that are included as ASCII fallbacks, typographical 
improvements, historic names, or commonly accepted aliases (like "North 
Korea").  This is not provided for in the approved draft (all 
Description fields are equal regardless of position) and would have to 
wait until the document is revised.

> Where a known name includes a diacritic mark or other character that 
> cannot be represented in ASCII, there should be an ADDITIONAL 
> description field giving the code point in whatever format is agreed. 
> However, there must always be an ASCII equivalent for human 
> readability.

I agree with this, except that we cannot currently distinguish 
"additional" descriptions from the "main" description, as mentioned 
above.

> Please remember that we are not all working for multi-nationals, we 
> are not all programmers/software developers and the whole purpose of 
> standardisation is to make it accessible to all in order that it may 
> stand a chance of being adopted by all; thus creating a standard.

As I stated last week, Mark Crispin's and Richard Ishida's observations 
about text searching were what caused me to change my mind and support 
ASCII fallback descriptions.

> I object to this format:
> ...
> Description: N&#x2019;Ko
> -----
>
> I approve of this format:
> ...
> Description: N'Ko
> Description: N&#x2019;Ko

+1

>> Kent Karlsson expressed a preference for "Book Norwegian" instead of 
>> the ASCII-folded "Norwegian Bokmal", and "New Norwegian" as an 
>> additional alias for "Norwegian Nynorsk".  Several people disagreed 
>> that the ASCII version of the Description should be an English 
>> translation, especially in a case like this where no English speaker 
>> would use the translated name to refer to the entity.
>
> I think we need to get back to the ISO standards as mentioned 
> previously.

+1

>> Kent also suggested "Ivory Coast" as the ASCII fallback for "Côte d’Ivoire". 
>> While this is, once again, not an ASCII fallback but an English 
>> translation, Peter Constable pointed out that "Ivory Coast" has both 
>> currency and historical usage.  (For example, The Times of London and 
>> the New York Times both use "Ivory Coast", which I did not know.) 
>> This might make a reasonable alternative description, and it can be 
>> proposed as such at any time (even now), but we should try to avoid 
>> confusing this with the issue of ASCII fallbacks.
>
> This would open the "flood gates" in my opinion.  I am sure there is 
> both "currency and historical usage" for translation of most of the 
> names in many a number of languages.  Bad move to add it just because 
> it is an English translation.

I am becoming quite worried about the floodgates.  We are taking the 
Description field(s) to be much more prescriptive than Section 3.1 
indicates.

> In order to introduce a consistent methodology for dealing with these 
> issues, I would suggest just dropping the diacritic mark.  This may 
> not work for some non-ASCII characters but certainly where diacritics 
> are involved it is the best solution for a standard approach IMHO. 
> Otherwise we are looking at reviewing all instances as opposed to 
> applying a simple set "diacritic rule".
> ...
> Set a rule for diacritics and follow it... Then there is no need to 
> proffer alternatives and spend days discussing them.

+1

>> It's important to keep in mind that when we start talking about ISO 
>> 639-3, there are some pairs of language names that differ only in 
>> diacritical marks.  For example, Arua and Aruá are two different 
>> languages.  In a case like this, we will not want to provide an ASCII 
>> fallback of any sort for Aruá, because that would give us two 
>> languages with the same name.
>
> WRONG.  There will be one description for the first instance and two 
> for the second. This is perfectly understood as a human or when being 
> parsed so long as a written methodology is included within the 
> standard.

So we would have the following?

Type: language
Subtag: aru
Description: Arua
Added: 200x-xx-xx
...
Type: language
Subtag: arx
Description: Aru&#xE1;
Description: Arua
Added: 200x-xx-xx

That worries me.

> Remember what the alpha2/3 code is for.  I would suggest coming to 
> some sort of tentative agreement on the records proposed by Doug and 
> then tackling this with written rules in RFC3066ter.

Absolutely agree.  We don't have to worry about it now, but we will 
definitely have to worry about it before adding the 639-3-based subtags.

> I think I am right in saying that within 639-3 (and certainly within 
> 639-6) information contained within parentheses is ALWAYS used as 
> qualifiers NOT alternative names. www.sil.org/iso639-3/codes.asp  I am 
> sure Peter will correct me if I am wrong.

I think you are right too.

>> Michael Everson asked not to split "Falkland Islands (Malvinas)" and 
>> "Holy See (Vatican City State)" since they have not been shown to 
>> cause confusion as is.  While I would prefer to treat this 
>> multiple-name situation consistently, regardless of the type of 
>> subtag, I don't plan to fight hard over it; other issues are more 
>> important to me.
>
> As these names are presented within the underlying ISO standards as 
> above, I am with Michael on this.  However, I would not strenuously 
> object to adding two additional names, splitting the descriptions, 
> provided the original stays intact.  Thus a record such as Holy See 
> (Vatican City State) would have 3 descriptions.

Let's put it this way: Is there anyone who strongly *supports* adding 
the names individually?  We could just let this one go.  I'm not 
attached to it.

>> Nobody seemed to have any objection to the other splits (e.g. 
>> Han/Hanzi/Kanji/Hanja).
>
> I object if the name as represented in the underlying ISO standard is 
> not retained.  I have no real objection to additional descriptions but 
> I think if you are going to do this there needs to be a written rule 
> as to when additional descriptions can/should be added - flood gates 
> and all that!

The name as stated in ISO 15924 is "Han (Hanzi, Kanji, Hanja)".  I would 
not have suggested any additional names such as "Chinese writing"  that 
didn't appear in the standard.

> That's my response for what its worth :-)

It is certainly worth a bundle -- much more than an opinion that goes 
unspoken.

--
Doug Ewell
Fullerton, California, USA
http://users.adelphia.net/~dewell/