Really OT: internationalized email addresses (Was: french orthography (Was: BCP47 Appeals process)

Wed Sep 24 21:03:24 CEST 2008

Mark Crispin scripsit:

> >> The attempt to "internationalize" these tokens will severely damage
> >> their utility as global tokens.  I challenge anyone here to visually
> >> inspect a short text string in Unicode and enter the identical string
> >> on a keyboard.  Nobody, not even the "Unicode experts" can reliably
> >> do that.  In an attempt to work around that, we talk about such things
> >> as "stringprep" and "canonicalization" utterly ignoring the fact that
> >> these are feeble attempts to lock the barn door while the horse it out.
> > True enough: the problem turned out to be bigger than anyone thought.
> 
> WRONG!

	"The tactful way," Rod said quietly, "the polite way to disagree
	with the Senator would be to say, 'That turns out not to be
	the case.'"

> The magnitude of the problem was obvious to anyone who
> understood the issues.  

Back in 1988, no one *did* understand the issues, and so Unicode
had to grow by accretion and a fair amount of trial and error.
That's produced a lot of difficulties, some of which have been
patched up, some of which have to be lived with.  There's always
a tradeoff in such situations between stability and correctness.

> Uh, not quite.  Even in the face of Han unification, there remains an
> enormous duplication of Han characters within Unicode, and it's only
> gotten worse with the SIP.

Normalization Form C deals nicely with that particular problem.

> That's what it is all about.  These tokens are written on paper, spoken
> on the telephone, and broadcast on radio and TV.  In all cases, to be
> useful, someone has to enter it.

They are also made up on the fly.

> It's happening already with criminal organizations (spam, phish, etc.)
> and I've had word leaked to me of governments planning such.

No evidence, in other words.

> You may be able to type it, but you can not, given a visual representation
> of the character, reliably enter the correct character for all too many
> characters.  You can't even do that for Latin.

There is no possible solution to that problem: nobody can tell in
isolation, even in Perfect Cleanicode, what stream of characters caused
this:

		ENGLISH TEXT = txet cibara

-- 
John Cowan   cowan at ccil.org
    "Mr. Lane, if you ever wish anything that I can do, all you will have
        to do will be to send me a telegram asking and it will be done."
    "Mr. Hearst, if you ever get a telegram from me asking you to do
        anything, you can put the telegram down as a forgery."