Mixing scripts (Re: Unicode versions (Re: Criteria forexceptional characters))

Soobok Lee lsb at lsb.org
Sun Dec 24 16:43:02 CET 2006


On Sun, Dec 24, 2006 at 10:29:36PM +0900, Martin Duerst wrote:
> At 21:39 06/12/21, Soobok Lee wrote:
> 
> >Local character sets might provide  some clues.
> 
> Clues, yes, but not much more.

Yes.

> 
> >They often contain
> >multiple scripts in single local charset in order to serve the need
> >of everyday language life of local language communities.
> >
> >I think this statement is very reasonable:
> > "If it is possible to "localize" an IDN label in any single local charset,
> >   the label should be allowed, however many scripts it spans across.
> >   Even for some of those, UA can display in punycode form."

I am saying from the context of "local users' demand", replying to
Mr Markham who says "If we are certain we can forsee all possible needs
for script mixing now".  What i try to point is "there may be proper 
demand for mixed-script labels from end users if their local charset 
allows such mixture". 

"should be allowed" does not mean "is safe".  It may not be safe and so,
I added the last condition above: "can display in punycode form". 


> 
> See more below for why this is a bad idea.
> 
> >Labels of Simplified Han Ideo + Hangul Syllables  cannot be
> >typed in or displayed in either of KSC5601(Korea) and  GB2312(China).
> >So they should be disallowed somewhere between IDNA,UA and registries.
> 
> The quoted statement above only says what should be allowed,
> so I don't see how it follows that combinations of simplified
> Han and Hangul should be disallowed. And there are quite a few
> simplified Han that are indistinguishable from traditional Han
> (and use just one codepoint),

You need not go so far.
Even Hangul Jamo vowel EU  look alike  CJK TC One.
Both look like long hyphen, and both are in KSC5601.
Even in this case, they should be allowed  in IDNA in principle, but 
had better be displayed in punycode form, since it is not safe.

But, most CJK SC cannot be represented in KSC5601 and Hangul 
cannnot be in GB2312. So there may be NO demand for 
such mixture currently.


> 
> 
> >Greek local charset(iso-8859-7) does not contain any cyrillic char,
> >Cyrillic local charset(iso-8859-5) does not contain any greek char.
> 
> Please do your homework and have another close look at your local
> charset, KSC 5601.
> 
> Similar to JIS X 0208 and GB 2312, it contains not only (full-width
> copies of ASCII) Latin, but also Greek and Cyrillic. Greek is
> handy in math and physics, and Russia is close to all three
> countries, and a few small alphabets didn't really take up
> too much space besides the large number of Hanzi/Kanji/Hanja/Hangul.

As you pointed out,  KSC5601 includes Greek/Cyrillic characters 
but, as ***special character*** sections (not as main scripts)  and
so, it is true that greek+cyrillic mixture have users demand 
due to KSC5601 under my previous suggestion. 
I think special characters in local charset should be excluded in
this context.

So, I modified my suggestion:
"If it is possible to  localize an IDN label in any 
  single local charset which form the label as main scripts, not
  as special characters,
  the label should be regarded to have potential user demands and
  should be allowed,
  however many scripts it spans across.
  Even for some of those, UA can display in punycode form if their 
  native forms of display is not safe."

And Verisign adopted similar local charset based registration filtering
for IDN.com around 1999/2000/2001 (confirmed), but I don't know when 
Verisign lifted up such sanction. Now, they accept directly UTF8 string
as input for IDN.com.

> 
> So the idea of saying "if these appear in the same local charset,
> they must be safe" is a very dangerous one. None of these charsets
> have been tested with something as exposed to serious criminals
> as domain names, and none of these charsets has been designed
> with spoofing issues anywhere in mind.

You might read my sentence in security context.  It is clear that
local charset based filtering is NOT sufficient for anti-spoofing
purpose. So I agree with you about this issue. 

You may remember that I had requested repeatedly labels like 
"p(cyrllic a)ypal"  should be prohibited in old IDN WG around 2001/2002.

Best regards,

Soobok

> 
> Regards,    Martin.
> 
> 
> 
> #-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
> #-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst at it.aoyama.ac.jp     



More information about the Idna-update mailing list