KATS (Korean Agency for Technology and Standards)'s Comments on theUnicode Codepoints and IDNA Internet-Draft

Fri Oct 31 17:43:35 CET 2008

On 31 Oct 2008, at 16:01, Andrew Sullivan wrote:

> I believe I understand the example in this case, and I believe I  
> understand the KATS statement as well.  But none of these appear to  
> me to have answered the fundamental question I had before, which is  
> why these exceptions should be in the _protocol_.  They sound to me  
> like policy.

The answer is that the use of Jamos is unsafe. There are problems with  
normalization -- problems which cannot be fixed, as normalization is  
stabilized. There may be ways around this, via pre-processing before  
normalization or by post-processing the result of normalization, but  
no matter how you look at it, permitting BOTH jamos AND precomposed  
syllables in the protocol is just asking for trouble. That trouble is  
entirely avoidable.

> Even if only the 11k syllables "alone should be permitted in Korean  
> IDN", it does not follow that any of the codepoints that are the  
> subject of this discssion should be excluded at the protocol level.

I see no argument -- beyond devil's advocacy -- here FOR the inclusion  
of jamos at the protocol level. They are dangerous and problematic.  
They are unnecessary since 11K syllables is all that is needed.

> I am particularly uneasy with arguments that depend either on  
> confusability or on the way that one could have encoded these  
> characters in Unicode, if one ran the circus.

My argument does not depend on what I said about what I would do if  
the clock could be turned back. Korean is a nightmare for ordinary  
text processing because of the multiple ways Korean text can be  
represented. We all have to live with that. But we do NOT have to  
burden IDN with this hornet's nest. IDN can, and should be simple.

> This working group explicitly ruled the first of those premises out  
> in its charter.

Phishing is not the only reason for wanting to avoid jamo. The  
spoofability of Old Hangul jamo with modern jamo is, however, a valid  
concern, and not unlike some decisions which were taken regarding  
Arabic IDN.

> The working group's dependence on properties explicitly requires  
> that we accept the Unicode definitions.  Even if we think things  
> should be another way, we're not here to specify The Right Way to  
> encode the writing system of a language.

Nor did I suggest it.

> We're here to "internationalize LDH".

That doesn't mean we're here to accept into the protocol problematic  
and unnecessary things that the user community (Nota bene: The  
Government of the Republic of Korea) doesn't want to have in the  
protocol.

> That one can do nasty and unpleasant things with "iLDH" is plain.  
> One can also do such nasty things with plain LDH, although not as  
> many. This is why operators of zones need to have clear policies for  
> their zones, and why (in my opinion) we ought to be encouraging a  
> default of "disallow".

Do let's disallow Hangul jamo from the protocol, then.

> What I believe we should _not_ do is try selectively to include  
> policy in the protocol.

I don't think this is "selective". It's reactive to real and genuine  
problems with the encoding model(s) for Korean. It would be  
irresponsible for us to put our heads in the sand on principle --  
particularly when the Koreans are asking us not to. Purism in policy  
may taste nice, but does not serve well in the real world.

> For the most part, I believe the current documents have done a good  
> job in that direction.  I believe that if we begin now to DISALLOW  
> characters on the grounds of confusability or likely utility, we're  
> confusing protocol and policy.  I think if we're going to do that,  
> we should look again at the foundation principles of the current  
> work, and perhaps revisit some decisions that I, at least, had hoped  
> were closed.

I don't think you've understood it. There are reasons to disallow Old  
Hangul jamo characters because they are not used in the modern  
language AND they cause phishing problems. However, there are reasons  
to disallow ALLL jamo characters because there are problems with  
various processing models. All of this can be avoided simply: allow  
only the 11K pre-composed syllables in Korean IDN. The rest are for  
historical use (like some Arabic characters which that community does  
not want to see (rightly) in IDN.

I can't think of any reasons to allow jamo characters in the protocol.  
I don't favour trying to be blindly algorithmic about what's in it.

Best regards,
Michael Everson * http://www.evertype.com