Tatweel

Mark Davis mark at macchiato.com
Fri Mar 20 18:19:56 CET 2009


Mark


On Fri, Mar 20, 2009 at 05:19, Vint Cerf <vint at google.com> wrote:

> ...
>
> Mark,
>
> One of the many concerns I have heard raised on this list relates to
> character-by-character assessment of Unicode as it applies to IDNs. I think
> few people wish to produce IDNA tables that way. I don't dispute your
> reasoning to exclude (I don't know enough about Arabic to do so) but I am
> wondering whether there is a way to do this that is rule-based or context
> based or something that exercises the mechanisms of IDNA2008?


Note that we already have a number of such exceptional characters, those in
(F) http://tools.ietf.org/html/draft-ietf-idnabis-tables-05#section-2.6,
that are singled out. And categories D, H, and I are also really
exceptional. They happen to be describable with properties, but their
inclusion is based on other reasons than some reason connected with the
meaning of those properties.

However, this was a very good suggestion. Using properties often reveals
other cases, and it does so in this case. The intersection of  two
properties picks out that character plus another that is likely to behave
the same way (I need to get confirmation of this).

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[:Extender=True:]%26[:Joining_Type=Join_Causing:]]

Here is the result, for those whose emailers don't support links:
Arabic - *Based on ISO 8859-6*
U+0640<http://unicode.org/cldr/utility/character.jsp?a=0640>( ‎ـ‎ )
ARABIC TATWEEL
NKo - *Letter extender*
U+07FA<http://unicode.org/cldr/utility/character.jsp?a=07FA>( ‎ߺ‎ )
NKO LAJANYALAN

I'll also propose these for Table 4 in
http://www.unicode.org/reports/tr31/#Specific_Character_Adjustments

FYI: the following characters that are added by Tables (F)
http://tools.ietf.org/html/draft-ietf-idnabis-tables-05#section-2.6 are not
in the candidates in Table 3 of UAX #31. As I recall, all of them are
somewhat dubious.

Greek And Coptic - *Numeral signs*
U+0375<http://unicode.org/cldr/utility/character.jsp?a=0375>( ͵ )
GREEK LOWER NUMERAL SIGN
Arabic - *Signs for Sindhi*
U+06FD<http://unicode.org/cldr/utility/character.jsp?a=06FD>( ‎۽‎ )
ARABIC SIGN SINDHI AMPERSAND
U+06FE <http://unicode.org/cldr/utility/character.jsp?a=06FE> ( ‎۾‎ ) ARABIC
SIGN SINDHI POSTPOSITION MEN
Tibetan - *Marks and signs*
U+0F0B<http://unicode.org/cldr/utility/character.jsp?a=0F0B>( ་ )
TIBETAN MARK INTERSYLLABIC TSHEG
Katakana - *Conjunction and length marks*
U+30FB<http://unicode.org/cldr/utility/character.jsp?a=30FB>( ・ )
KATAKANA MIDDLE DOT

Patrik,

Tthe listing of characters in
http://tools.ietf.org/html/draft-ietf-idnabis-tables-05#section-2.6 could be
improved to make it clear what is going on. As it is, it maps characters to
PVALID, CONTEXTO, and/or DISALLOWED. It would be handier to have different
sections in F. No substantive change, but makes it easier to understand

*PVALID: // would otherwise have been DISALLOWED
*

   00DF; PVALID     # LATIN SMALL LETTER SHARP S
   03C2; PVALID     # GREEK SMALL LETTER FINAL SIGMA
   06FD; PVALID     # ARABIC SIGN SINDHI AMPERSAND
   06FE; PVALID     # ARABIC SIGN SINDHI POSTPOSITION MEN
   0F0B; PVALID     # TIBETAN MARK INTERSYLLABIC TSHEG
   3007; PVALID     # IDEOGRAPHIC NUMBER ZERO

*CONTEXTO: // would otherwise have been DISALLOWED
*   00B7; CONTEXTO   # MIDDLE DOT
   0375; CONTEXTO   # GREEK LOWER NUMERAL SIGN (KERAIA)
   05F3; CONTEXTO   # HEBREW PUNCTUATION GERESH
   05F4; CONTEXTO   # HEBREW PUNCTUATION GERSHAYIM
   30FB; CONTEXTO   # KATAKANA MIDDLE DOT
*CONTEXTO: // would otherwise have been PVALID
*   002D; CONTEXTO   # HYPHEN-MINUS
   02B9; CONTEXTO   # MODIFIER LETTER PRIME
   0660; CONTEXTO   # ARABIC-INDIC DIGIT ZERO
   0661; CONTEXTO   # ARABIC-INDIC DIGIT ONE
   0662; CONTEXTO   # ARABIC-INDIC DIGIT TWO
   0663; CONTEXTO   # ARABIC-INDIC DIGIT THREE
   0664; CONTEXTO   # ARABIC-INDIC DIGIT FOUR
   0665; CONTEXTO   # ARABIC-INDIC DIGIT FIVE
   0666; CONTEXTO   # ARABIC-INDIC DIGIT SIX
   0667; CONTEXTO   # ARABIC-INDIC DIGIT SEVEN
   0668; CONTEXTO   # ARABIC-INDIC DIGIT EIGHT
   0669; CONTEXTO   # ARABIC-INDIC DIGIT NINE
   06F0; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT ZERO
   06F1; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT ONE
   06F2; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT TWO
   06F3; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT THREE
   06F4; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT FOUR
   06F5; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT FIVE
   06F6; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT SIX
   06F7; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT SEVEN
   06F8; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT EIGHT
   06F9; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT NINE
   0483; CONTEXTO   # COMBINING CYRILLIC TITLO
   3005; CONTEXTO   # IDEOGRAPHIC ITERATION MARK
   303B; CONTEXTO   # VERTICAL IDEOGRAPHIC ITERATION MARK

*DISALLOWED: // would otherwise have been PVALID
*   302E; DISALLOWED # HANGUL SINGLE DOT TONE MARK
   302F; DISALLOWED # HANGUL DOUBLE DOT TONE MARK




>
> vint
>
>
>
> Vint Cerf
> Google
> 1818 Library Street, Suite 400
> Reston, VA 20190
> 202-370-5637
> vint at google.com
>
>
>
>
>
> On Mar 20, 2009, at 7:00 AM, Alireza Saleh wrote:
>
>  I don't see why we should not just let the registry have the authority
>> to do this? If you want to disallow this at the protocol level, you
>> should also consider  disallowing the Low rise 'U+005F'  and
>> Hyphen-minus U+002D because these have also the same shape as Tatweel
>> specially when they come in between of non-joining characters. My
>> opinion is to limit protocol prohibitions to absolutely necessary cases.
>>
>> Alireza
>>
>> Mark Davis wrote:
>>
>>> I propose that we make U+0640 ( ‎ـ‎ ) ARABIC TATWEEL (aka kashida) be
>>> DISALLOWED, adding it to
>>> http://tools.ietf.org/html/draft-ietf-idnabis-tables-05#section-2.6.
>>> Currently it is PVALID, but it does not carry semantics by any
>>> Arabic-Script orthography, and its only value is for spoofing.
>>>
>>> For example: جوجل can be written with extra kashidas as جـوجل or as
>>> جوجـل by inserting a kashida after the first or third character. This
>>> is very hard for users to detect. We added it to Unicode for use in
>>> manual justification, but has no place in IDNA.
>>>
>>> (http://en.wikipedia.org/wiki/Kashida,
>>> http://unicode.org/cldr/utility/character.jsp?a=0640)
>>>
>>> Mark
>>> _______________________________________________
>>> Idna-update mailing list
>>> Idna-update at alvestrand.no
>>> http://www.alvestrand.no/mailman/listinfo/idna-update
>>>
>>>
>> _______________________________________________
>> Idna-update mailing list
>> Idna-update at alvestrand.no
>> http://www.alvestrand.no/mailman/listinfo/idna-update
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20090320/49292742/attachment-0001.htm 


More information about the Idna-update mailing list