Another update to registry

Harald Tveit Alvestrand harald at alvestrand.no
Tue Oct 26 13:01:45 CEST 2004


This is a generic problem with registries that (a) use delimiter 
characters, and (b) import data from sources outside their control or 
without strict character set limitations.

I personally think that the ; convention is ugly, and prefer to use 
verticalbar (|) or colon (:) as a field separator - but in all cases, the 
Right Solution is an escape mechanism.
If you want a simple Perl "split" to work, something ugly like replacing ; 
with (s) and replacing ( with (l) is probably necessary. (this is slightly 
less ugly than using U+XXXX codes)

                Harald

--On 22. oktober 2004 13:13 -0700 Doug Ewell <dewell at adelphia.net> wrote:

> This week ISO 639-2/RA announced a change to the name of the language
> represented by the alpha-2 code "si", from just plain "Sinhalese" to
> "Sinhala; Sinhalese".  Accordingly, I've updated the proposed IANA
> language subtag registry, replacing "Sinhalese" with "Sinhala".
>
> One of the unfortunate aspects of the registry being specified in RFC
> 3066bis as a semicolon-delimited text file is that there is no provision
> for descriptions that contain a semicolon, and ISO 639-2/RA seems to be
> doing this more and more often.  Many of the recently added language
> names consist of two or more names separated by a semicolon:
>
> Filipino; Pilipino
> Classical Newari; Old Newari
> Klingon; tlhIngan-Hol
> Blin; Bilin
> Crimean Tatar; Crimean Turkish
> Limburgish; Limburger; Limburgan
> Low German; Low Saxon; German, Low; Saxon, Low
> Church Slavic; Old Slavonic; Old Church Slavonic; Church Slavonic; Old
> Bulgarian
> etc.
>
> IMHO, the latter two border on the ridiculous; it's probably not
> necessary to offer every possible permutation of a multi-word name.
>
> Nevertheless, even though we know the names of languages in ISO 639 are
> not normative (only the codes are), it would still be nice for the full
> ISO 639 name (including multiple parts) to be used in the registry.  But
> because of the semicolon-delimited format, only one of the multiple
> names can be chosen.  I've chosen the first name in each case rather
> than being arbitrary about it.  The semicolons can't be simply replaced
> by commas; that would wreak havoc on the "Low German" example above.
>
> The only alternative I can think of would be to use quotation marks to
> enclose multi-part names that contain semicolons:
>
> language; si; "Sinhala; Sinhalese"; 2004-07-06; ;
>
> but of course this would require extra processing.
>
> Somewhat related to this, I've also replaced semicolons with commas
> whenever they appear within comments.  This prevents lines like:
>
> region; BU; Myanmar; 2004-07-06; MM; # changed 1989-12-05; formerly
> Burma
>
> from being parsed as seven fields instead of six.
>
> -Doug Ewell
>  Fullerton, California
>  http://users.adelphia.net/~dewell/
>
>
> _______________________________________________
> Ietf-languages mailing list
> Ietf-languages at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/ietf-languages
>
>






More information about the Ietf-languages mailing list