Tatweel (and Lm and Tables Section 2.1 generally)

John C Klensin klensin at jck.com
Fri Mar 20 18:26:37 CET 2009



--On Thursday, March 19, 2009 13:44 -0700 Mark Davis
<mark at macchiato.com> wrote:

> I propose that we make U+0640 ( ‎ـ‎ ) ARABIC TATWEEL (aka
> kashida) be DISALLOWED, adding it to
> http://tools.ietf.org/html/draft-ietf-idnabis-tables-05#sectio
> n-2.6. Currently it is PVALID, but it does not carry semantics
> by any Arabic-Script orthography, and its only value is for
> spoofing.
> 
> For example: جوجل can be written with extra kashidas as
> جـوجل or as جوجـل by inserting a kashida after the
> first or third character. This is very hard for users to
> detect. We added it to Unicode for use in manual
> justification, but has no place in IDNA.
> 
> (http://en.wikipedia.org/wiki/Kashida,
> http://unicode.org/cldr/utility/character.jsp?a=0640)

Mark,

First of all, I'm sympathetic to this recommendation and note
that the ASIWG recommended to us last fall that the character be
DISALLOWED.  But I also don't want to go any further down the
path toward picking out particular characters for special
treatment than we have to.  I'm more comfortable doing that if
we can find a rule or if it is clear that the granularity (or
some other aspect) of Unicode properties are not sufficient for
the combination of the particular character and IDN
applicability.

I also have a vague recollection of a discussion about putting
all of Lm into "contextual rule required", assigning characters
with that property value rules only if there was a substantial
reason for permitting them.   We didn't do that for, if I
recall, several reasons, but we did classify some of those
characters that way by exception.  I no longer understand some
of those exceptions and why they did not cause others, which, of
course, is one of the reasons why rule are better than lists.
For example, we made 

   U+02B9 MODIFIER LETTER PRIME

CONTEXTO (because of its effective use in historical Greek
numerals according to Tables, Appendix A.6), but treat

	02BA;MODIFIER LETTER DOUBLE PRIME
	02BB;MODIFIER LETTER TURNED COMMA
	02BC;MODIFIER LETTER APOSTROPHE
	02BD;MODIFIER LETTER REVERSED COMMA
	02BE;MODIFIER LETTER RIGHT HALF RING
	02BF;MODIFIER LETTER LEFT HALF RING
	02C0;MODIFIER LETTER GLOTTAL STOP
	02C1;MODIFIER LETTER REVERSED GLOTTAL STOP

As PVALID because Lm is generally acceptable, despite the
observation that some of those characters are are quite
problematic in terms of confusion with punctuation, etc.

So, your suggestion about Tatweel raises two other questions for
me:

(1) Do we need to revisit the Lm decision, either placing all of
those characters in DISALLOWED and making exceptions where
needed or placing all of them into CONTEXTO and assigning rules
to specify relevant context if and when that is shown to be
necessary?  Note that the latter would permit some migration
later, just by adding rules, while the former would ban any
character for which we do not identify the need for an
exception.  On the other hand, because CONTEXTO does not require
the same lookup-time check as DISALLOWED, it is a much weaker
check than what you have suggested above (and the ASIWG
recommended, at least if I recall that recommendation).

FWIW, after skimming through the list of characters with General
Category Lm that are not mapped out by NFKC (i.e., those that
would  not appear in IDNA input or that would be DISALLOWED for
other reasons), removing Lm from the LetterDigits rule of Tables
Section 2.1 and then making exceptions as needed for particular
characters seems to me to be more with our supposed "inclusion
list" model than including all of them and then excluding some
on a character-by-character basis.

(2) If the conclusion from the above is that we should leave
Lm's treatment in Tables Section 2.1 alone, would you recommend
that we exceptionally make  some or all of 02BA-02C1 DISALLOWED
along with Tatweel (and why or why not)?

   john










More information about the Idna-update mailing list