A proposed solution for descriptions (was: Re: ISO 639 - New item approved - N'Ko)

Sun Jun 11 10:02:16 CEST 2006

Hi,

I'm a longtime lurker on this list. I'm not a linguist, but have  
intimate experience with internet search engines, text matching and  
unicode. I've been working on websearch indexing and query parsing  
for 5+ years, as an employee of FAST (www.alltheweb.com), now Yahoo!.

During indexing, both unicode normalization forms and more ad hoc  
normalizing may be applied to resolve issues of accents on individual  
characters. Case-normalizing is also done.
The same happens during query parsing, but the original form may be  
given more weight in matching.
Unicode normalization forms standardize the character sequence  
involving accents. There is also information in the unicode character  
database to case-convert and decompose the character.
Example word: VOLAPÜK
The character Ü is U+00DC, and has a lowercase mapping to U+00FC, and  
that one has a decomposition entry mapping to U+0075 U+0308. U+0075  
is "plain ASCII" u. So odds are that searching using any of the  
characters along this path, will give a match.

Non-word characters, such as apostrophes, quotes, dashes etc. are in  
general just used to separate words. So, if the input has N&#x2019;Ko  
(such that the actual characters is recognized), it will be indexed  
as N <nonword> Ko, or even just N Ko, and you should be able to get a  
match searching for anything from the original form (using the  
correct character) to queries like
N-Ko
N at Ko
N'Ko  (using the ASCII character)
even "N Ko"

Note, that these mappings does in no way imply that "volapuk" is the  
same as "Volapük", or that "n'ko" is the same as "N’Ko". They are  
only matching mechanisms for the users convenience. Adding them to  
registry, though, would, in my opinion, indicate that they have at  
least a higher degree of equivalence than desired.

One example of undesirable effects would be "Norwegian Bokmål"  
mapping to "Norwegian Bokmal". The first one translates to roughly  
"book speech", while the latter says "book template". Adding this to  
the registry would actually gives web-search uses wanting a "book  
template" a non-relevant result, as there is no way for the web- 
search indexer to know that the instance of the word in this place  
was a normalized version of the previous entry, and not what it  
acually said. I would imagine that similar examples of words changing  
meaning with accent removal can be found in other languages.

The fundamental problem as I see it, is that the registry uses xml  
entities in a non-xml context, thus relying on convention to  
represent unicode characters. For any outside parties, such as web  
crawlers/indexers, this will look like embedded words (containing  
only the characters representing hex-values), and not the desired  
characters. If the characters had been represented in a standard  
form, the search engine would have resolved most of the issues  
raised. (Of course, this does not relate to "local" searching using  
your editor's or OS' search/text match capabilities)

So, to give the best possible search experience (using web-search  
engines), we should either author the registry using (x)html, xml or  
another standardized markup that correctly interprets the entities,  
or use a text encoding that allows the correct characters to be  
represented more directly, such as UTF-8.

The conclusion of this is what web-search engines have been preaching  
for years to content producers. Represent you content in the best  
possible semantic form, and let the search engine worry about  
matching user intent with content. This process is just made worse by  
well meaning content producers trying to be helpful.

Now, onto your concrete suggestions. I support expanding parentheses  
descriptions containing true variants, while leaving parenthesis that  
add qualifiers. I also naturally support fixing the problem/error/ 
typo with 'Amis.
I support replacing semantically wrong characters with the more  
correct alternatives.
I do not support adding "dumbed down" descriptions in an effort to  
normalized the original description, for reasons given above.

But this is of course only relevant if we are able to use a text  
encoding like UTF-8, or go to a markup language that support the  
entities in use. If neither of those options are possible, I will  
just have to live with the ugliness and impurity of the registry.

Sorry for the long post,
--
Vidar Larsen
Search developer
Yahoo! Inc.

Den 11. jun. 2006 kl. 06:13 skrev Doug Ewell:

> Mark Crispin <mrc at CAC dot Washington dot EDU> wrote:
>
>> The problem is that you guys are trying to resolve conflicting  
>> desires into a single name.  Long experience tells me that this  
>> doesn't work, and ultimately forces the registry into wretched  
>> compromises that displease everybody.
>
> Richard Ishida <ishida at w3 dot org> wrote:
>
>> In the case of the actual registry, there currently is no N'Ko  
>> ASCII text, and one would have to type N&#x2019;Ko to get a match,  
>> knowing the right code point to use, and how to represent that as  
>> an NCR. You cannot google that by typing in N'Ko. I don't think  
>> that situation is very helpful to the average user.
>
> Originally I was opposed to adding new Description values to solve  
> this problem, but Mark's and Richard's arguments have thoroughly  
> convinced me that this is necessary, and isn't a slippery slope  
> that would lead to dozens of Description strings for every subtag.   
> I stand corrected, and no, I don't mind being called a flip-flopper.
>
> I hereby propose some changes to the Description fields of 28  
> existing records, based on the following issues that presented  
> themselves more or less in this order.
>
> 1.  With the addition of N'Ko the language, the Registry now has 14  
> subtag records with Description fields that include a non-ASCII  
> character (and therefore a hex NCR).  I propose that for each of  
> these, a corresponding ASCII-only Description be added.  Example:  
> "N&#x2019;Ko" will be joined by "N'Ko".  This applies not only to  
> apostrophes, but to all non-ASCII characters such as accented  
> letters: "Volapük" will be joined by "Volapuk".  This solves most  
> of the problem described by Richard.
>
> 2.  Conversely, those subtags that have a Description with an ASCII  
> apostrophe should have a corresponding Description added with the  
> appropriate non-ASCII directional apostrophe or modifier letter.  
> Example: "Mi'kmaq" will be joined by "Mi&#x2BC;kmaq".  This should  
> answer the concerns of Michael and others that a Description in  
> "the correct characters" be available for all subtags.
>
> 3.  A few names (Gwich'in, Ge'ez) currently have the *wrong* non- 
> ASCII apostrophe.  I propose that these be changed to a more  
> appropriate character, as well as adding the pure-ASCII  
> equivalent.  Example: "Gwich´in" will be deleted and two new  
> Description fields, "Gwich&#x2BC;in" and "Gwich'in", will be  
> added.  This also answers a concern raised by Michael.
>
> 4.  Some subtags were found to have a Description with a second  
> name in parentheses, which is really an alternate name rather than  
> a qualifier of the first name.  In the case of script subtag  
> "Hano", the Description "Hanunoo (Hanun&#xF3;o)" already does what  
> we are trying to achieve: it provides ASCII and non-ASCII  
> equivalents for the same name.  This should be replaced by two new  
> Description fields, "Hanunoo" and "Hanun&#xF3;o".
>
> 5.  Likewise for a Description like "Lepcha (R&#xF3;ng)", it  
> doesn't make sense to repeat the "Lepcha" part simply to provide an  
> ASCII and non-ASCII version of "Róng".  What would make sense would  
> be to split this into three Descriptions: "Lepcha", "R&#xF3;ng",  
> and "Rong".
>
> 6.  For that matter, any Description fields with an alternate name  
> in parentheses (not a qualifier) should really be split into  
> multiple Descriptions, regardless of whether non-ASCII characters  
> are present. Example: "Falkland Islands (Malvinas)" should be split  
> into "Falkland Islands" and "Malvinas".  This is what we did with  
> language subtags, which are separated by semicolons in ISO 639: we  
> converted them to multiple Description fields.  What I propose is  
> that we do this consistently with scripts and regions as well.
>
> Note that items 4 through 6 have no effect on Description fields  
> where the parenthesized portion acts as a qualifier to the  
> unparenthesized portion.  For example, "Cyrillic (Old Church  
> Slavonic variant)" would NOT be split into "Cyrillic" and "Old  
> Church Slavonic variant" since this would make no sense, and would  
> give "Cyrl" and "Cyrs" the same Description.
>
> 7.  Finally, getting back to the apostrophe issue, it appears that  
> the language Amis, represented by the grandfathered tag "i-ami",  
> should not have an apostrophe at all.  This was listed as 'Amis in  
> the RFC 1766 registration form dating back to 1999, and so it was  
> copied that way to the initial RFC 3066bis Registry, but apparently  
> this was a typo or editing error.  I propose changing this to "Amis".
>
> In a separate mail I will present proposed registration forms for  
> all 28 subtags that are affected in one way or another by these  
> issues.  They are severable; each should be considered and  
> discussed by the group on its own merits.  We aren't really  
> constrained by time on this, but we should keep the discussion  
> moving so that the appropriate changes (as agreed by the list) can  
> be made to the Registry.
>
> --
> Doug Ewell
> Fullerton, California, USA
> http://users.adelphia.net/~dewell/
>
>
> _______________________________________________
> Ietf-languages mailing list
> Ietf-languages at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/ietf-languages