Another exception candidate: U+0F0B Tibetan tsek

Kenneth Whistler kenw at sybase.com
Thu Apr 3 01:13:09 CEST 2008


We've been discussing the two Sindhi word abbreviations,
which probably need to be added to the exception list,
so they are PVALID (instead of DISALLOWED as a result
of their Unicode General_Category value).

There is another candidate which has been discussed
a little, offline, as a result of the PAN L10n project
review of the IDNA tables, and reported by Sarmad Hussain.

That candidate is:

U+0F0B TIBETAN MARK INTERSYLLABIC TSHEG

I'll just call it "tsek".

That is the pervasive little triangular mark seen in
Dzongkha and Tibetan text between Tibetan "stacks".
The details of how Tibetan character stacks are constructed
is not really relevant here. The important point for the
IDNAbis table definition is that the Dzongkha reviewers
(in Bhutan) have reported that tsek should be considered
PVALID for IDNs. (And this would be true for Tibetan,
as well.)

The tsek marks a boundary between stacks in the Tibetan
script, but what people need to know is that that boundary
is roughly equivalent to a syllable -- and is definitely
not a regular word boundary.

If you take a look at the Tibetan (u-chen) and Dzongkha
example texts at:

http://www.omniglot.com/writing/tibetan.htm

you can see that they are littered with tsek's -- and
those are considered parts of the words.

The Unicode General_Category for tsek is gc=Po (Punctuation, other),
but that is partly to reflect its syllable boundary function
and to note that it is *not* a regular Tibetan letter.

But for IDNs, it seems pretty clear that the expressed desire
of the Dzongkha reviewers is to have the tsek as PVALID --
because Tibetan text written without tsek's is simply not
very legible to native readers. These are in some sense
equivalent to where we would be for English text if the
English orthography *required* use of hy-phens be-tween syl-la-bles
like that.

So I would suggest that before the next draft of the IDN table
document be posted, that Patrik consider adding U+0F0B
to the exception list, along with the two Sindhi characters
we've been discussing.

--Ken



More information about the Idna-update mailing list