Another update to registry
Harald Tveit Alvestrand
harald at alvestrand.no
Tue Oct 26 13:01:45 CEST 2004
This is a generic problem with registries that (a) use delimiter
characters, and (b) import data from sources outside their control or
without strict character set limitations.
I personally think that the ; convention is ugly, and prefer to use
verticalbar (|) or colon (:) as a field separator - but in all cases, the
Right Solution is an escape mechanism.
If you want a simple Perl "split" to work, something ugly like replacing ;
with (s) and replacing ( with (l) is probably necessary. (this is slightly
less ugly than using U+XXXX codes)
--On 22. oktober 2004 13:13 -0700 Doug Ewell <dewell at adelphia.net> wrote:
> This week ISO 639-2/RA announced a change to the name of the language
> represented by the alpha-2 code "si", from just plain "Sinhalese" to
> "Sinhala; Sinhalese". Accordingly, I've updated the proposed IANA
> language subtag registry, replacing "Sinhalese" with "Sinhala".
> One of the unfortunate aspects of the registry being specified in RFC
> 3066bis as a semicolon-delimited text file is that there is no provision
> for descriptions that contain a semicolon, and ISO 639-2/RA seems to be
> doing this more and more often. Many of the recently added language
> names consist of two or more names separated by a semicolon:
> Filipino; Pilipino
> Classical Newari; Old Newari
> Klingon; tlhIngan-Hol
> Blin; Bilin
> Crimean Tatar; Crimean Turkish
> Limburgish; Limburger; Limburgan
> Low German; Low Saxon; German, Low; Saxon, Low
> Church Slavic; Old Slavonic; Old Church Slavonic; Church Slavonic; Old
> IMHO, the latter two border on the ridiculous; it's probably not
> necessary to offer every possible permutation of a multi-word name.
> Nevertheless, even though we know the names of languages in ISO 639 are
> not normative (only the codes are), it would still be nice for the full
> ISO 639 name (including multiple parts) to be used in the registry. But
> because of the semicolon-delimited format, only one of the multiple
> names can be chosen. I've chosen the first name in each case rather
> than being arbitrary about it. The semicolons can't be simply replaced
> by commas; that would wreak havoc on the "Low German" example above.
> The only alternative I can think of would be to use quotation marks to
> enclose multi-part names that contain semicolons:
> language; si; "Sinhala; Sinhalese"; 2004-07-06; ;
> but of course this would require extra processing.
> Somewhat related to this, I've also replaced semicolons with commas
> whenever they appear within comments. This prevents lines like:
> region; BU; Myanmar; 2004-07-06; MM; # changed 1989-12-05; formerly
> from being parsed as seven fields instead of six.
> -Doug Ewell
> Fullerton, California
> Ietf-languages mailing list
> Ietf-languages at alvestrand.no
More information about the Ietf-languages