Deprecated characters?

Thu Jul 17 21:54:18 CEST 2008

Patrik asked:

> Question: Some of the codepoints that either are, or are suggested to  
> be, deprecated are PVALID according to the tables document.
> 
> Does that create any problems?

Short answer: No.

Long answer:

This list has a history of anxiousness about Unicode instability,
free-floating from consideration of the actual characters
involved.

In this particular case, all of the characters listed in
PRI #122 (many of which will not end up deprecated, in any
case, I predict -- since this is just a list of possibles
for consideration) are already DISALLOWED for IDNA by the
latest table spec, with the exception of the following:

U+17A3 KHMER INDEPENDENT VOWEL QAQ
U+17A4 KHMER INDEPENDENT VOWEL QAA
U+17D3 KHMER SIGN BATHAMASAT

U+0953 DEVANAGARI GRAVE ACCENT
U+0954 DEVANAGARI ACUTE ACCENT

That's it -- the whole list.

Of those, two of the Khmer characters (17A3, 17D3) are
*already* Deprecated=True in the standard, and 17A4 is
already annotated as discouraged from use in both the
code charts and in the text of the standard. I consider
it quite likely to also be designated as Deprecated=True
as a result of this PRI discussion, for consistency with
the already-deprecated U+17A3.

The two Devanagari accents were tossed into the pot by
the comment of one person on October 12, 2006. There is
nothing formally wrong with them, and the text of the
standard has never discouraged their use. They are allowed
in NFC. Given all this, even though they are of little
actual utility, I consider it unlikely that they will
end up being designated as Deprecated=True, when the UTC
makes its final decision on this.

Given that that is the likely outcome, the entire issue
boils down to 3 Khmer characters.

Of those, two were encoded intended to be used for Pali/Sanskrit
*transliteration* into Khmer, so were of historic, marginal
use anyway, even as intended. But the Cambodian feedback has
been that even in that marginal use, other characters are
preferred. U+17D3 was intended as a combining mark used in
the representation of some *very* rare historic lunar date symbols,
and even that usage has been supplanted by simply encoding
a complete set of the pre-formed symbols (none of which could
be used in IDN's, anyway) -- hence the deprecation of U+17D3.

And I guarantee you that any Khmer ccTLD registrar would never
allow any of these three characters in a valid Khmer domain name
registration.

But that situation is basically not different from the
fact that any Iraqi ccTLD registrar is never going
to allow Sumero-Akkadian domain name registrations, even
though by the current IDNAbis tables, all the Sumero-Akkadian
cuneiform characters are also PVALID.

Nor is it substantially different from the fact that there are a
number of various obsolete Latin and Cyrillic characters that
are also PVALID but which would have preferred spellings
with other Latin or Cyrillic characters of more current
usage. None of those preferred spellings impact decisions
taken about the IDNA protocol definition or table content,
unless people here want to reopen the whole approach to
table definition in terms of generic Unicode properties and
start once again down the road of evaluating Unicode characters
for "appropriateness" in IDN usage, one-by-one based on
their historic status, degree of obsolescence, and detailed
semantics.

I trust there really is no stomach for dropping all the
progress on a table definition which can automatically be
updated for future Unicode versions based on the generic
properties already identified in idnabis-tables.txt.

If so, then the current status of U+17A3 and U+17D3 as
Deprecated=True, the likely outcome that U+17A4 will also
end up Deprecated=True, and the outside chance that
U+0953 and U+0954 will, too, has no real impact on anything
about the current documents for the protocol are worded,
no potential impact on the future maintenance of the table,
and no practical impact on registrar policies.

--Ken