Another exception candidate: U+0F0B Tibetan tsek

Kenneth Whistler kenw at sybase.com
Thu Apr 3 02:23:01 CEST 2008


Paul noted:

> At 4:13 PM -0700 4/2/08, Kenneth Whistler wrote:
> >So I would suggest that before the next draft of the IDN table
> >document be posted, that Patrik consider adding U+0F0B
> >to the exception list, along with the two Sindhi characters
> >we've been discussing.
> 
> Wearing my broken-record-shaped hat:
> 
> At 8:38 AM -0700 4/1/08, Paul Hoffman wrote:
> >However, we need to hear, formally or informally, from The Unicode 
> >Consortium, before we do so.

And while I can't speak formally for the Unicode Consortium,
y'all are depending on me and Mark in this discussion to let
you know what the UTC would *likely* decide in cases like this.

And what I can tell you is that it is *very* unlikely that the
UTC would change the General_Category for U+0F0B from what
it currently is, because there are many *other* applications
that have been using it the way it is for years now.

And as I stated in my note, I, at least, believe that gc=Po
is a correct value for the General_Category for tsek, anyway.
It isn't a Tibetan letter, but it does form part of Tibetan
words.

But as Mark pointed out, we have other examples like that,
even for the English orthography. "'" (U+0027 APOSTROPHE)
and "-" (U+002D HYPHEN-MINUS) are punctuation characters
(and get appropriate Unicode properties), but they form
part of English words nonetheless.

The error here is in expecting a single partition property,
General_Category, however defined, to automagically also
provide exactly correct results for all *other*
application purposes.

The UTC has defined an entire specification, UAX #29,
"Unicode Text Segmentation", to deal with issues of how
to establish boundaries between units like words and
sentences, and even getting roughly correct results
requires extensive augmentation of General_Category
values. We have defined WordBreakProperty.txt and
SentenceBreakProperty.txt just to assist in that
algorithm, for example -- and even then, to get expected
behavior for end users, you need to override and tailor
the default behavior.

Identifiers -- and IDN labels are just a special case of that
general area of functionality -- have similar issues.
Depending on what you want the identifier to do, and
what syntax it is interacting with, you simply start
with the Unicode derived properties for identifiers 
(see UAX #31, "Unicode Identifier and Pattern Syntax"),
and *then* start establishing your rules for additions of
punctuation (like adding "_" for C or "@" for SQL) and
your constraints and exceptions.

> It's fine if the WG decides that the default is to put all 
> exceptional characters into the exceptions list as we discover them, 

I think that is exactly what the WG needs to do -- and is
why we are bringing up these exceptions now.

> but that's a procedural decision that needs to be made by the WG and 
> probably codified in one of the documents. A different decision is to 
> not put any of these types of exceptions into the exceptions list and 
> get the Unicode Consortium to make a change to the underlying tables 
> so that the characters don't have to be treated as exceptions.

That isn't likely to happen. As Kent indicated for the Sindhi
case, and I have just indicated for the Tibetan tsek, these
are not "errors" that the UTC is likely to change. They are
not at all like the discovery of a flat-out typo in some table
entry, for example. They are simply a few of the many, many
edge cases that result from attempting to specify a digital
encoding for all of the writing systems of the world, current
and historic.

> And 
> there is clearly points between these two poles that the WG might 
> adopt.
> 
> In case it isn't clear, I'm not in favor of the IDNA200x exceptions 
> table being used when the change could more logically be made in the 
> Unicode Standard.

I don't think that is the case for the Tibetan tsek. And it won't
generally be the case for the kinds of exceptions that people
may still dig out.

> Doing the latter means that other protocols that 
> need to make the same decisions we do can already have a cleaner base 
> to work from.

This is cart-before-horsing, IMO.

What this WG needs to decide is what the specification for IDNs
will be -- and that includes deciding what characters to
allow and not to allow.

Mark and I have been suggesting all along that you can get
most of the way where you want to be by using the Unicode
identifier rules -- and that is roughly what Patrik is doing
now in the draft-faltstrom-idnabis-tables-05.txt document.
But the people concerned about IDNs need to decide the
uncomfy edge cases: the sharp-s, the Sindhi word abbreviations,
the Tibetan tsek, the middle dot, the Hebrew geresh and gershayim --
all the ones people like to argue about. And then -- precisely
because those are *not* accounted for by generic identifier
rules, you need to write down your agreed-upon exception
list for IDNs.

What Mark and I have also been suggesting for some time is that
*if* the IDN table specification gets all nailed down, so that
it does what the WG wants it to do, including all the exceptional
cases, *then* the UTC would be happy to define and maintain
*another* Unicode character property that would precisely
define that table (including all the exceptional cases) and
would continue to derive that character property in perpetuity,
as part of each additional release of the Unicode Standard
that adds more characters.

That is what you could then rely upon in the future so that:

> other protocols that 
> need to make the same decisions we do can already have a cleaner base 
> to work from.

--Ken






More information about the Idna-update mailing list