ID for language-invariant strings

Fri Mar 14 19:58:30 CET 2008

Peter Constable scripsit:

> Consider for a moment the large number of Unicode character names. Now,
> Unicode treats these as English-language character names, but just
> suppose they were considered language-neutral reference names. 

Okay.

> Now, there are scenarios in which these get presented to users, and as
> a result there are various kinds of linguistic processing that may
> be applicable (stemming, hyphenation...). 

I think you kick the ball between your own goalposts here.  If hyphenation
were applied to Unicode character names, the distinction, obnoxious but
real, between the names TIBETAN LETTER A and TIBETAN LETTER -A might
well be obliterated.  It is precisely because Unicode character names
are fixed and invariable, including the alphabetic case used, that they
ought not to be treated as English.

-- 
John Cowan   http://ccil.org/~cowan  cowan at ccil.org
[P]olice in many lands are now complaining that local arrestees are insisting
on having their Miranda rights read to them, just like perps in American TV
cop shows.  When it's explained to them that they are in a different country,
where those rights do not exist, they become outraged.  --Neal Stephenson