Label separators in Dzongkha (Re: Feedback of PAN L10n project)

Kenneth Whistler kenw at sybase.com
Wed Mar 5 18:34:15 CET 2008


Harald,

This is the result of a confusion regarding the identity
of particular characters in the Tibetan script (and a
variant style of the script used in Bhutan for writing
the Dzongkha language).

U+0F7F TIBETAN SIGN RNAM BCAD is the Tibetan form of the visarga,
used in Tibetan transliteration of Sanskrit words. It is
a combining mark, with alphabetic properties, both in Tibetan,
and for its correspondents in various Indic scripts:

U+0903 DEVANAGARI SIGN VISARGA
U+0983 BENGALI SIGN VISARGA
etc., etc.

The feedback here is based on a *visual* confusion between
this visarga and a Tibetan delimiter punctuation,
U+0F14 TIBETAN MARK GTER TSHEG, which *is* a comma-like
text delimiter. The exact shape of the GTER TSHEG, as
well as other delimiting punctuation (usually with "SHAD"
or "TSHEG" in their names) may vary by style and font
for Tibetan -- and it may well be the case that glyphs
without the little horizontal bar appear in use for Dzongkha.

In any case this is a different *character* from U+0F7F.
The PVALID classification of U+0F7F is correct, and the
use of U+0F14 (or U+0F0D) as a label separator
for the Tibetan script is perfectly consistent with the
existing table. Those are already category DISALLOWED.

I think those category determinations are correct and should not
be changed in the table.

Note, however, that that is distinct from a determination
that U+0F7F should not be used in Dzongkha domain
names. I think such a determination is perfectly
consistent with other decisions to disallow certain PVALID characters
in their language tables.

The real issue I see here (once the misidentification
of U+0F7F as delimiter punctuation is cleared up) is
that Dzongkha (and Tibetan in general) requires the use
of the TSHEG characters (U+0F0B TIBETAN MARK INTERSYLLABIC TSHEG,
in particular). Those are mandatory marks that occur
between syllables, but within words. As such, they are
functionally similar to U+002D HYPHEN-MINUS, but far
more ubiquitous in Tibetan script than "-" is in Latin text.

The IDNFeedbackofPANL10nproject.pdf indicates that U+0F0B
should be allowed in Dzongkha domain names (and the
situation would be no different for Tibetan in general).

Currently U+0F0B is DISALLOWED. Changing that would require
an exception added to Section 2.2.2, Category F.

--Ken

> Thank you very much for this wide-ranging input.
> 
> There are many questions one could ask, but I'll pick one...
> you say that in Dzongkha, the character U+0F7F, which is TIBETAN SIGN 
> RNAM BCAD, should be regarded as a label separator.
> 
> This character is of Unicode class Mc (Spacing_Mark), which class 
> includes such signs as the DEVANAGARI VOWEL SIGN AA. In 
> draft-faltstrom-idnabis-tables-05, this is marked as "PVALID", which is 
> of course incompatible with its use as a separator.
> 
> Do you recommend that TIBETAN SIGN RNAM BCAD be added to the exception 
> list in section 2.2.2 of that draft, with category DISALLOWED?
> 
> This is a very serious and non-reversible step to take - if we get code 
> into browsers that checks for U+0F7F as a disallowed character, it is 
> very hard to get back to using it as a character in labels if we change 
> our minds later.



More information about the Idna-update mailing list