ID for language-invariant strings

Fri Mar 14 23:01:14 CET 2008

On Fri, Mar 14, 2008 at 2:41 PM, Peter Constable <petercon at microsoft.com> wrote:
> If these were tagged with an ID such as "zxn" or "und", then there's no
> particular obstacle in developing an application with throwing those at
> linguistic processes, perhaps primed with some language-detection
> processing. But if they are tagged as "zxx", then you have to go out of
> your way to make sure that "zxx" gets ignored when these are thrown
> at those processes -- which may be in a linked library or in a service off
> in the cloud -- lest they simply return N/A.

I don't understand the problem here. Something that isn't interested
in doing anything when fed text in Cobol probably shouldn't do
anything with these names. Hyphenation should leave them alone; word
wrapping should do something basic, etc. (Converting a space to a new
line is theoretically an error in your names as it might be in program
text, but in both cases, if it's one the author worried about, they
should have set it non-wrapping.) Language detection and "smart"
handling is wrong; changing a font named "Coöperate!"  to
"Co-<NL>operate!" is entirely correct English, but wrong for a
fontname, and certain other languages are much more common at such
things than English.

Níall wrote:
> Would you tag the personal name "Pierre Desjardins" as French?
> While it is clearly French in origin, you're not likely to translate
> it to "Peter Gardener" if localising a document to English.

But you would tag Dostoevsky English, as the Esperantists call him
Dostojevskij, the Estonians Dostojevski, the Germans Dostojewski, and
of course the Russians Достоевский.