Tatweel (and Lm and Tables Section 2.1 generally)

Fri Mar 20 23:04:32 CET 2009

Ken et al,

This has been a useful discussion if only to illustrate the difficulty of adapting the set of property assignments to be inputs to the rule-based form of idna2008. 

If more tatwheel-like examples arise in future written script additions we will no doubt have to find ways to accommodate. We do need some method to assure that DISALLOWED can be calculated somehow so we don't have to manually fix every new table computation when unicode updates happen. 

V 

----- Original Message -----
From: idna-update-bounces at alvestrand.no <idna-update-bounces at alvestrand.no>
To: mark at macchiato.com <mark at macchiato.com>
Cc: idna-update at alvestrand.no <idna-update at alvestrand.no>; kenw at sybase.com <kenw at sybase.com>
Sent: Fri Mar 20 14:50:11 2009
Subject: Re: Tatweel (and Lm and Tables Section 2.1 generally)

Mark said:

> 4. The only reason I proposed Tatweel is that there is a fundamental
> difference: we encoded Tatweel specifically for its display effect, no other
> reason.

Let me elaborate somewhat on that.

Tatweel is *not* a letter. It is a stroke extension used in
a cursive script as part of the mechanism for line justification.
Essentially, it is a calligraphic convention. More properly,
the justification technique is "kashida", and "tatweel" is
a glyph used in this justification technique.

The reason tatweel is encoded as a character *at all*, rather than
being left entirely to the world of fonts and rendering
engines to deal with (as with most of the rest of the details
of cursive script display), is that the Arabic script has
a long history of encoding hacks to get it to work on
computers -- going back decades now. The tatweel itself
got encoded as a character very early on, because all of
those early encoding hacks depended on character cell graphics
to emulate cursive Arabic.

And if for no other reason, tatweel had to be encoded in
Unicode to interoperate with those preexisting character
sets and code pages. Even later Arabic character encodings that
assumed font responsibility for cursive joining have included
a tatweel character -- and the prime examples of that are
Windows 1256, which has 0xDC tatweel (mapped to U+0640 ARABIC
TATWEEL) and ISO 8859-6, with 0xE0 ARABIC TATWEEL.

So if tatweel is not a letter, but a compatibility and
interoperability character for an encoding hack, why
is it General_Category=Lm in Unicode?

Well, General_Category is a procrustean category
in the first place -- it is only a first-level approximation
to the basic category of a character, and could not even
conceivably deal with the fine-grained categorization needed
to cover every mark (and encoding hack) in every writing
system in the world throughout all recorded history, and still
do them justice. For funky edge case characters, we are stuck
with giving it *some* category, and for a variety of stability
reasons the UTC can*not* just keep extending General_Category
with new values for every new kind of character animal that
gets added to the zoo.

So why General_Category=Lm for tatweel? Two specific reasons: first,
nothing else is a better fit. Second, it gives best behavior
for tatweel in the context of a number of algorithms and
identifiers. For example, if a tatweel occurs in the middle
of an Arabic word, you want word selection to automatically
include it, and not break around its edges. Or if someone
uses a tatweel in an identifier context where a tatweel
might conceivably make sense (e.g., as a column header id
for a database query, where someone might want justified
elements), then again, you don't want the identifer to
break at a tatweel. General_Category=Lm provides those
behaviors in the relevant algorithms and definitions, without
requiring further elaborations to make exceptions for tatweel.

But to come back around to the topic at hand, does a
tatweel make sense in a DNS label? The answer is clearly no,
and Mark has made that argument. A tatweel carries no
text content per se. Having a tatweel (or two, or six)
in an Arabic word doesn't result in a different word or
a different "spelling". In a DNS context, for a domain
name, the only conceivable use for a tatweel would be for
a black hat to attempt to fool somebody else about the
identity of the label.

So then the question devolves to:

Should the protocol make U+0640 TATWEEL be DISALLOWED to
begin with? 

or

Should the protocol leave this up to zone administrators
to deal with by policy, as for hundreds of other similar
kinds of problems involving spoofing, etc.?

Mark has made the case for the first choice.

Frankly, in the overall context of all the potential
problems, this particular one is small potatoes, and
it will neither make nor break the protocol to do one
or the other.

But...

John Klensin responded:

> I'm more comfortable doing that if
> we can find a rule or if it is clear that the granularity (or
> some other aspect) of Unicode properties are not sufficient for
> the combination of the particular character and IDN
> applicability.

If you want a "rule" here, the rule would be:

Don't allow non-content, cursive script line justification
extending marks which happen to be encoded as characters
be PVALID for the protocol. 

(Currently that consists of exactly two characters: 
U+0640 ARABIC TATWEEL and U+07FA NKO LAJANYALAN. But there
are other historic scripts known to use kashida justification
techniques, and there are ongoing arguments about whether
to encode script-specific tatweel-analogue characters for
at least two of them, so you *might* see more in the future.)

Is that going to turn into another formal Unicode character
property in the Unicode Character Database that you could
use in the table derivation by rule, instead of having to
plug in {0640, 07FA} as a list in one of the exception
clauses? Well, no. Or rather, I surmise the likelihood
at less than 1%, so the answer might as well be no.
So you either get the right answer by adding these two
to the exception list (along with the others the the IDNA
WG, for one reason or another has added to that list), or
you blow off the issue -- which isn't all that significant,
anyway, in the big picture -- and get on with finishing
the protocol definition.

And John Klensin further mused:

> (1) Do we need to revisit the Lm decision, either placing all of
> those characters in DISALLOWED and making exceptions where
> needed or placing all of them into CONTEXTO and assigning rules
> to specify relevant context if and when that is shown to be
> necessary? 

You *really* do not want to go there. Having this working
group trying to second-guess General_Category=Lm assignments
on a case-by-case basis will simply multiply the tatweel
discussion a dozen times over, and you will end up with
no better end result than if you just left things as they
are right now. And trying to supply CONTEXTO rules for the
use of such characters is the Achilles heel of this entire
specification right now -- adding *anything* else to that
ill-defined part of the spec can do nothing but bog it down
and *decrease* its quality, IMO.

--Ken

_______________________________________________
Idna-update mailing list
Idna-update at alvestrand.no
http://www.alvestrand.no/mailman/listinfo/idna-update