Thai unicode code points

Kenneth Whistler kenw at sybase.com
Wed Dec 13 18:45:47 CET 2006


In response to the input from:

> On 10 dec 2006, at 02.25, Domain Guru wrote:
> 
> > I have just read:
> >
> > http://www.ietf.org/internet-drafts/draft-faltstrom-idnabis- 
> > tables-01.txt
...
> > If these were excluded for IDNs, it would destroy the point of Thai  
> > IDNs i.e. half of all thai words wouldn't be "legal" any more.

Patrik responded:

> There are suggestions on how to make changes, but so far it seems the  
> conclusion is that the tools the Unicode Consortium have made  
> available (the triple {script, block, class}) plus bidi properties  
> are not enough for classifying codepoints in what is allowed and not.
> 
> At least this is my personal thinking.

Patrik may feel this, but I certainly would not characterize this
as a "conclusion" on this topic on a consensus on this list.

First of all, the tools made available by the Unicode Consortium
are not the "triple {script, block, class}" plus bidi properties,
but rather, in principle, the entire range of information about
character properties and behavior available in the Unicode
Character Database:

http://www.unicode.org/ucd/

What subset of that information is most usefully brought to bear
on the issue of which characters are appropriate in IDNA (and
StringPrep more generally) is precisely what is at issue here.

Second, IMO, the Block property should simply be dropped henceforth
from this discussion. It is not pertinent or useful, and we should
no longer be advertising for information about triples
involving {script, block, class} at all.

Third, the properties which *do* pertain to the discussion and
are relevant are "class" (= the Unicode property General_Category)
and "script" (= the Unicode property Script). Based on those two
properties we can quite easily define a meaningful and useful
repertoire for inclusion.

Adding to that the two principles for exclusion: unstable under
NFKC(cp) and unstable under casefold(cp), you end up with a
very clear and very defensible set of candidate characters for
IDNA, with the clearest road forward for consensus, with the
minimum amount of item-by-item arguing and politicization based
on potential omission of either important scripts or characters
important for scripts.

> 
> Every time I (or someone else) present a new permutation of the  
> triple, someone else find one or more cases which does not make sense.

I consider this a reaction which is not moving forward quickly
to the right answer on this.

> 
> In this case of Thai, look at the latest table and come with  
> suggestions on values of the triple that should be allowed and not  
> allowed, and I will adjust the latest tables.

Mark's formulation:

0. Start with the empty set.
1. If generalCategory(cp) is [Ll, Lo, Lm, Mn, Mc], add cp
2. If NFKC(cp) != cp, remove cp
3. If casefold(cp) != cp, remove cp
4. If cp is in [-A-Z0-9], add cp

When applied to the assumption that script=Thai is not one
of the historic scripts that are candidates for removal from
the repertoire, results in *precisely* the list that I
posted two weeks ago in:

http://www.unicode.org/~whistler/SPLlLoLmMnMcNdStableCaseNFKC.txt

And to avoid the necessity of having to go there and excerpt it,
I will excerpt it explicitly for everyone here:

00E01 gc=Lo sc=Thai THAI CHARACTER KO KAI
00E02 gc=Lo sc=Thai THAI CHARACTER KHO KHAI
00E03 gc=Lo sc=Thai THAI CHARACTER KHO KHUAT
00E04 gc=Lo sc=Thai THAI CHARACTER KHO KHWAI
00E05 gc=Lo sc=Thai THAI CHARACTER KHO KHON
00E06 gc=Lo sc=Thai THAI CHARACTER KHO RAKHANG
00E07 gc=Lo sc=Thai THAI CHARACTER NGO NGU
00E08 gc=Lo sc=Thai THAI CHARACTER CHO CHAN
00E09 gc=Lo sc=Thai THAI CHARACTER CHO CHING
00E0A gc=Lo sc=Thai THAI CHARACTER CHO CHANG
00E0B gc=Lo sc=Thai THAI CHARACTER SO SO
00E0C gc=Lo sc=Thai THAI CHARACTER CHO CHOE
00E0D gc=Lo sc=Thai THAI CHARACTER YO YING
00E0E gc=Lo sc=Thai THAI CHARACTER DO CHADA
00E0F gc=Lo sc=Thai THAI CHARACTER TO PATAK
00E10 gc=Lo sc=Thai THAI CHARACTER THO THAN
00E11 gc=Lo sc=Thai THAI CHARACTER THO NANGMONTHO
00E12 gc=Lo sc=Thai THAI CHARACTER THO PHUTHAO
00E13 gc=Lo sc=Thai THAI CHARACTER NO NEN
00E14 gc=Lo sc=Thai THAI CHARACTER DO DEK
00E15 gc=Lo sc=Thai THAI CHARACTER TO TAO
00E16 gc=Lo sc=Thai THAI CHARACTER THO THUNG
00E17 gc=Lo sc=Thai THAI CHARACTER THO THAHAN
00E18 gc=Lo sc=Thai THAI CHARACTER THO THONG
00E19 gc=Lo sc=Thai THAI CHARACTER NO NU
00E1A gc=Lo sc=Thai THAI CHARACTER BO BAIMAI
00E1B gc=Lo sc=Thai THAI CHARACTER PO PLA
00E1C gc=Lo sc=Thai THAI CHARACTER PHO PHUNG
00E1D gc=Lo sc=Thai THAI CHARACTER FO FA
00E1E gc=Lo sc=Thai THAI CHARACTER PHO PHAN
00E1F gc=Lo sc=Thai THAI CHARACTER FO FAN
00E20 gc=Lo sc=Thai THAI CHARACTER PHO SAMPHAO
00E21 gc=Lo sc=Thai THAI CHARACTER MO MA
00E22 gc=Lo sc=Thai THAI CHARACTER YO YAK
00E23 gc=Lo sc=Thai THAI CHARACTER RO RUA
00E24 gc=Lo sc=Thai THAI CHARACTER RU
00E25 gc=Lo sc=Thai THAI CHARACTER LO LING
00E26 gc=Lo sc=Thai THAI CHARACTER LU
00E27 gc=Lo sc=Thai THAI CHARACTER WO WAEN
00E28 gc=Lo sc=Thai THAI CHARACTER SO SALA
00E29 gc=Lo sc=Thai THAI CHARACTER SO RUSI
00E2A gc=Lo sc=Thai THAI CHARACTER SO SUA
00E2B gc=Lo sc=Thai THAI CHARACTER HO HIP
00E2C gc=Lo sc=Thai THAI CHARACTER LO CHULA
00E2D gc=Lo sc=Thai THAI CHARACTER O ANG
00E2E gc=Lo sc=Thai THAI CHARACTER HO NOKHUK
00E2F gc=Lo sc=Thai THAI CHARACTER PAIYANNOI
00E30 gc=Lo sc=Thai THAI CHARACTER SARA A
00E31 gc=Mn sc=Thai THAI CHARACTER MAI HAN-AKAT
00E32 gc=Lo sc=Thai THAI CHARACTER SARA AA
00E34 gc=Mn sc=Thai THAI CHARACTER SARA I
00E35 gc=Mn sc=Thai THAI CHARACTER SARA II
00E36 gc=Mn sc=Thai THAI CHARACTER SARA UE
00E37 gc=Mn sc=Thai THAI CHARACTER SARA UEE
00E38 gc=Mn sc=Thai THAI CHARACTER SARA U
00E39 gc=Mn sc=Thai THAI CHARACTER SARA UU
00E3A gc=Mn sc=Thai THAI CHARACTER PHINTHU
00E40 gc=Lo sc=Thai THAI CHARACTER SARA E
00E41 gc=Lo sc=Thai THAI CHARACTER SARA AE
00E42 gc=Lo sc=Thai THAI CHARACTER SARA O
00E43 gc=Lo sc=Thai THAI CHARACTER SARA AI MAIMUAN
00E44 gc=Lo sc=Thai THAI CHARACTER SARA AI MAIMALAI
00E45 gc=Lo sc=Thai THAI CHARACTER LAKKHANGYAO
00E46 gc=Lm sc=Thai THAI CHARACTER MAIYAMOK
00E47 gc=Mn sc=Thai THAI CHARACTER MAITAIKHU
00E48 gc=Mn sc=Thai THAI CHARACTER MAI EK
00E49 gc=Mn sc=Thai THAI CHARACTER MAI THO
00E4A gc=Mn sc=Thai THAI CHARACTER MAI TRI
00E4B gc=Mn sc=Thai THAI CHARACTER MAI CHATTAWA
00E4C gc=Mn sc=Thai THAI CHARACTER THANTHAKHAT
00E4D gc=Mn sc=Thai THAI CHARACTER NIKHAHIT
00E4E gc=Mn sc=Thai THAI CHARACTER YAMAKKAN
00E50 gc=Nd sc=Thai THAI DIGIT ZERO
00E51 gc=Nd sc=Thai THAI DIGIT ONE
00E52 gc=Nd sc=Thai THAI DIGIT TWO
00E53 gc=Nd sc=Thai THAI DIGIT THREE
00E54 gc=Nd sc=Thai THAI DIGIT FOUR
00E55 gc=Nd sc=Thai THAI DIGIT FIVE
00E56 gc=Nd sc=Thai THAI DIGIT SIX
00E57 gc=Nd sc=Thai THAI DIGIT SEVEN
00E58 gc=Nd sc=Thai THAI DIGIT EIGHT
00E59 gc=Nd sc=Thai THAI DIGIT NINE

Now we can iterate this exercise for every modern script,
seeking somebody to come join this list and contribute
to:

for ( sc = firstScript; sc <= lastScript; sc++ )
{
   say (
 "If these were excluded for IDNs, it would destroy the point of %s  
  IDNs i.e. half of all thai words wouldn\'t be \"legal\" any more.", sc);
}

or we can simply delete from the repertoire the list of
historic scripts (which I also provided), finish this exercise,
and get on to the more urgent business of updating
the StringPrep protocol specification.

--Ken



More information about the Idna-update mailing list