Tatweel (and Lm and Tables Section 2.1 generally)
mark at macchiato.com
Fri Mar 20 20:42:03 CET 2009
I'll try to be brief.
1. Just some points of information (you may be aware of this, but for
1a. Letter Modifiers are not the same as characters having "MODIFIER LETTER"
in their name. They overlap by 143 code points, but are different by 108
code points. See:
(clicking on the Only in A, Only in B, and In both A and B takes you to the
1b. Characters that are really not used in orthographies are already
separated by property into the Sk group, which are already not being
included in IDNA2008.
2. One could go through all of the Lm characters and do a detailed analysis
of whether each would be needed or not, but that would take sizable time and
effort, and be (of course) contentious. Unless we bring in experts it is
quite difficult to tell what the usage of particular characters is.
Moreover, Lm characters are not that much different in kind than normal
letters, of which there are some 100,000 to go through. It is quite tricky
to determine modern usage: cf
3. We've always known that IDNA2008 simply cannot solve the visual
confusability problem. Checking the Lm characters would be only nibbling at
the edges; it really doesn't touch the 99.99% cases. And trying to do this
in the protocol is a very heavy hammer - browsers and other clients need to
do far more sophisticated contextual analysis than what the protocol can (or
should) supply; they also have far more information available, such as the
user's display language. Moreover, they can take action that the protocol
can't, like warning about suspicious-looking URLs but allowing users to
continue who what to.
4. The only reason I proposed Tatweel is that there is a fundamental
difference: we encoded Tatweel specifically for its display effect, no other
On Fri, Mar 20, 2009 at 10:26, John C Klensin <klensin at jck.com> wrote:
> --On Thursday, March 19, 2009 13:44 -0700 Mark Davis
> <mark at macchiato.com> wrote:
> > I propose that we make U+0640 ( ـ ) ARABIC TATWEEL (aka
> > kashida) be DISALLOWED, adding it to
> > http://tools.ietf.org/html/draft-ietf-idnabis-tables-05#sectio
> > n-2.6. Currently it is PVALID, but it does not carry semantics
> > by any Arabic-Script orthography, and its only value is for
> > spoofing.
> > For example: جوجل can be written with extra kashidas as
> > جـوجل or as جوجـل by inserting a kashida after the
> > first or third character. This is very hard for users to
> > detect. We added it to Unicode for use in manual
> > justification, but has no place in IDNA.
> > (http://en.wikipedia.org/wiki/Kashida,
> > http://unicode.org/cldr/utility/character.jsp?a=0640)
> First of all, I'm sympathetic to this recommendation and note
> that the ASIWG recommended to us last fall that the character be
> DISALLOWED. But I also don't want to go any further down the
> path toward picking out particular characters for special
> treatment than we have to. I'm more comfortable doing that if
> we can find a rule or if it is clear that the granularity (or
> some other aspect) of Unicode properties are not sufficient for
> the combination of the particular character and IDN
> I also have a vague recollection of a discussion about putting
> all of Lm into "contextual rule required", assigning characters
> with that property value rules only if there was a substantial
> reason for permitting them. We didn't do that for, if I
> recall, several reasons, but we did classify some of those
> characters that way by exception. I no longer understand some
> of those exceptions and why they did not cause others, which, of
> course, is one of the reasons why rule are better than lists.
> For example, we made
> U+02B9 MODIFIER LETTER PRIME
> CONTEXTO (because of its effective use in historical Greek
> numerals according to Tables, Appendix A.6), but treat
> 02BA;MODIFIER LETTER DOUBLE PRIME
> 02BB;MODIFIER LETTER TURNED COMMA
> 02BC;MODIFIER LETTER APOSTROPHE
> 02BD;MODIFIER LETTER REVERSED COMMA
> 02BE;MODIFIER LETTER RIGHT HALF RING
> 02BF;MODIFIER LETTER LEFT HALF RING
> 02C0;MODIFIER LETTER GLOTTAL STOP
> 02C1;MODIFIER LETTER REVERSED GLOTTAL STOP
> As PVALID because Lm is generally acceptable, despite the
> observation that some of those characters are are quite
> problematic in terms of confusion with punctuation, etc.
> So, your suggestion about Tatweel raises two other questions for
> (1) Do we need to revisit the Lm decision, either placing all of
> those characters in DISALLOWED and making exceptions where
> needed or placing all of them into CONTEXTO and assigning rules
> to specify relevant context if and when that is shown to be
> necessary? Note that the latter would permit some migration
> later, just by adding rules, while the former would ban any
> character for which we do not identify the need for an
> exception. On the other hand, because CONTEXTO does not require
> the same lookup-time check as DISALLOWED, it is a much weaker
> check than what you have suggested above (and the ASIWG
> recommended, at least if I recall that recommendation).
> FWIW, after skimming through the list of characters with General
> Category Lm that are not mapped out by NFKC (i.e., those that
> would not appear in IDNA input or that would be DISALLOWED for
> other reasons), removing Lm from the LetterDigits rule of Tables
> Section 2.1 and then making exceptions as needed for particular
> characters seems to me to be more with our supposed "inclusion
> list" model than including all of them and then excluding some
> on a character-by-character basis.
> (2) If the conclusion from the above is that we should leave
> Lm's treatment in Tables Section 2.1 alone, would you recommend
> that we exceptionally make some or all of 02BA-02C1 DISALLOWED
> along with Tatweel (and why or why not)?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Idna-update