Really OT: internationalized email addresses (Was: french orthography (Was: BCP47 Appeals process)

Wed Sep 24 19:46:43 CEST 2008

> Let me be clear on this: DNS names, email addresses, etc. are machine
>> tokens and NOT natural language.  They are properly seen as the
>> equivalent of telephone numbers.
> That used to be true.  But now people expect things like ibm.com and
> hypercalvinismthemovie.com to Just Work, and this is an advantage denied
> to people who don't know the romanization system of their country,
> and it's even worse where there is more than one.

That's a bad argument.  My best friend's daughter has her name as her
mobile phone number.  That does not mean that we should abandon the
use of numbers for the telephone network, not even if this is an advantage
denied to people who don't have that benefit of the Japanese language.

The fact that greed allowed a landrush on DNS names in the 1990s does
not change the technical issues.  It does, however, teach about the effect
of the Law of Unintended Consequences.

It's fine to get into a group hug and sing Kumbaya, but that won't make
everybody like you and not want to hurt you.  Similarly, it's fine to produce
a design that says "these are the rules that we will all follow", but if there
are advantages to breaking the rules please be assured that the rules will
be broken.

>> The attempt to "internationalize" these tokens will severely damage
>> their utility as global tokens.  I challenge anyone here to visually
>> inspect a short text string in Unicode and enter the identical string
>> on a keyboard.  Nobody, not even the "Unicode experts" can reliably
>> do that.  In an attempt to work around that, we talk about such things
>> as "stringprep" and "canonicalization" utterly ignoring the fact that
>> these are feeble attempts to lock the barn door while the horse it out.
> True enough: the problem turned out to be bigger than anyone thought.

WRONG!  The magnitude of the problem was obvious to anyone who
understood the issues.  And while there is a general understanding now
that it isn't as simple as "just send UTF-8", the fact is that the advocates
of "internationalizing" these tokens still fail to grasp the true complexity
of the problem.

>> That's not important to my argument.  We're not talking about good or
>> unique fit or even accurate fit.
> Sure you are, when you are trying to guess the correct romanization of
> a local name.  DNS is unforgiving about such things as taipei.gov.tw
> versus taibei.gov.tw, but $BgJKL(B.$B@/I\(B.$BgJ_T(B is unambiguous.  (I may
> have got that wrong; I don't speak Chinese.)

Uh, not quite.  Even in the face of Han unification, there remains an
enormous duplication of Han characters within Unicode, and it's only
gotten worse with the SIP.

>> Similarly, a person, upon receipt of a printed email address or DNS
>> name, 
> "In receipt of" being the critical bit.

That's what it is all about.  These tokens are written on paper, spoken
on the telephone, and broadcast on radio and TV.  In all cases, to be
useful, someone has to enter it.

>> There is a far more sinister agenda at work; to make it impossible
>> for these tokens to be used outside the country.  There will be the
>> "haves", who have both their "internationalized" email address and
>> a global email address using Latin script, and the "have nots" who
>> have only an "internationalized" (translation: domestic only) email
>> address that nobody outside can access.  Don't think for a moment that
>> the stringprep and canonicalization kludges will actually be obeyed.
> Is there any actual evidence of this, or is it just your general belief
> in total depravity?

It's happening already with criminal organizations (spam, phish, etc.)
and I've had word leaked to me of governments planning such.

Fortunately, since the criminals are doing this first, this is ultimately
going to cause the collapse of the entire effort, as more security filters
block IDN and IEA syntax.  And good riddance.

> In any case, I can type any Unicode character I want.

You may be able to type it, but you can not, given a visual representation
of the character, reliably enter the correct character for all too many
characters.  You can't even do that for Latin.

The way we are going, we will end up having to type hex codepoints
(U+xxxx) forever.

-- Mark --

_________________________________________________________________
Get more out of the Web. Learn 10 hidden secrets of Windows Live.
http://windowslive.com/connect/post/jamiethomson.spaces.live.com-Blog-cns!550F681DAD532637!5295.entry?ocid=TXT_TAGLM_WL_domore_092008