Tatweel
Mark Davis
mark at macchiato.com
Fri Mar 20 18:19:56 CET 2009
Mark
On Fri, Mar 20, 2009 at 05:19, Vint Cerf <vint at google.com> wrote:
> ...
>
> Mark,
>
> One of the many concerns I have heard raised on this list relates to
> character-by-character assessment of Unicode as it applies to IDNs. I think
> few people wish to produce IDNA tables that way. I don't dispute your
> reasoning to exclude (I don't know enough about Arabic to do so) but I am
> wondering whether there is a way to do this that is rule-based or context
> based or something that exercises the mechanisms of IDNA2008?
Note that we already have a number of such exceptional characters, those in
(F) http://tools.ietf.org/html/draft-ietf-idnabis-tables-05#section-2.6,
that are singled out. And categories D, H, and I are also really
exceptional. They happen to be describable with properties, but their
inclusion is based on other reasons than some reason connected with the
meaning of those properties.
However, this was a very good suggestion. Using properties often reveals
other cases, and it does so in this case. The intersection of two
properties picks out that character plus another that is likely to behave
the same way (I need to get confirmation of this).
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[:Extender=True:]%26[:Joining_Type=Join_Causing:]]
Here is the result, for those whose emailers don't support links:
Arabic - *Based on ISO 8859-6*
U+0640<http://unicode.org/cldr/utility/character.jsp?a=0640>( ـ )
ARABIC TATWEEL
NKo - *Letter extender*
U+07FA<http://unicode.org/cldr/utility/character.jsp?a=07FA>( ߺ )
NKO LAJANYALAN
I'll also propose these for Table 4 in
http://www.unicode.org/reports/tr31/#Specific_Character_Adjustments
FYI: the following characters that are added by Tables (F)
http://tools.ietf.org/html/draft-ietf-idnabis-tables-05#section-2.6 are not
in the candidates in Table 3 of UAX #31. As I recall, all of them are
somewhat dubious.
Greek And Coptic - *Numeral signs*
U+0375<http://unicode.org/cldr/utility/character.jsp?a=0375>( ͵ )
GREEK LOWER NUMERAL SIGN
Arabic - *Signs for Sindhi*
U+06FD<http://unicode.org/cldr/utility/character.jsp?a=06FD>( ۽ )
ARABIC SIGN SINDHI AMPERSAND
U+06FE <http://unicode.org/cldr/utility/character.jsp?a=06FE> ( ۾ ) ARABIC
SIGN SINDHI POSTPOSITION MEN
Tibetan - *Marks and signs*
U+0F0B<http://unicode.org/cldr/utility/character.jsp?a=0F0B>( ་ )
TIBETAN MARK INTERSYLLABIC TSHEG
Katakana - *Conjunction and length marks*
U+30FB<http://unicode.org/cldr/utility/character.jsp?a=30FB>( ・ )
KATAKANA MIDDLE DOT
Patrik,
Tthe listing of characters in
http://tools.ietf.org/html/draft-ietf-idnabis-tables-05#section-2.6 could be
improved to make it clear what is going on. As it is, it maps characters to
PVALID, CONTEXTO, and/or DISALLOWED. It would be handier to have different
sections in F. No substantive change, but makes it easier to understand
*PVALID: // would otherwise have been DISALLOWED
*
00DF; PVALID # LATIN SMALL LETTER SHARP S
03C2; PVALID # GREEK SMALL LETTER FINAL SIGMA
06FD; PVALID # ARABIC SIGN SINDHI AMPERSAND
06FE; PVALID # ARABIC SIGN SINDHI POSTPOSITION MEN
0F0B; PVALID # TIBETAN MARK INTERSYLLABIC TSHEG
3007; PVALID # IDEOGRAPHIC NUMBER ZERO
*CONTEXTO: // would otherwise have been DISALLOWED
* 00B7; CONTEXTO # MIDDLE DOT
0375; CONTEXTO # GREEK LOWER NUMERAL SIGN (KERAIA)
05F3; CONTEXTO # HEBREW PUNCTUATION GERESH
05F4; CONTEXTO # HEBREW PUNCTUATION GERSHAYIM
30FB; CONTEXTO # KATAKANA MIDDLE DOT
*CONTEXTO: // would otherwise have been PVALID
* 002D; CONTEXTO # HYPHEN-MINUS
02B9; CONTEXTO # MODIFIER LETTER PRIME
0660; CONTEXTO # ARABIC-INDIC DIGIT ZERO
0661; CONTEXTO # ARABIC-INDIC DIGIT ONE
0662; CONTEXTO # ARABIC-INDIC DIGIT TWO
0663; CONTEXTO # ARABIC-INDIC DIGIT THREE
0664; CONTEXTO # ARABIC-INDIC DIGIT FOUR
0665; CONTEXTO # ARABIC-INDIC DIGIT FIVE
0666; CONTEXTO # ARABIC-INDIC DIGIT SIX
0667; CONTEXTO # ARABIC-INDIC DIGIT SEVEN
0668; CONTEXTO # ARABIC-INDIC DIGIT EIGHT
0669; CONTEXTO # ARABIC-INDIC DIGIT NINE
06F0; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT ZERO
06F1; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT ONE
06F2; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT TWO
06F3; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT THREE
06F4; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT FOUR
06F5; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT FIVE
06F6; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT SIX
06F7; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT SEVEN
06F8; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT EIGHT
06F9; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT NINE
0483; CONTEXTO # COMBINING CYRILLIC TITLO
3005; CONTEXTO # IDEOGRAPHIC ITERATION MARK
303B; CONTEXTO # VERTICAL IDEOGRAPHIC ITERATION MARK
*DISALLOWED: // would otherwise have been PVALID
* 302E; DISALLOWED # HANGUL SINGLE DOT TONE MARK
302F; DISALLOWED # HANGUL DOUBLE DOT TONE MARK
>
> vint
>
>
>
> Vint Cerf
> Google
> 1818 Library Street, Suite 400
> Reston, VA 20190
> 202-370-5637
> vint at google.com
>
>
>
>
>
> On Mar 20, 2009, at 7:00 AM, Alireza Saleh wrote:
>
> I don't see why we should not just let the registry have the authority
>> to do this? If you want to disallow this at the protocol level, you
>> should also consider disallowing the Low rise 'U+005F' and
>> Hyphen-minus U+002D because these have also the same shape as Tatweel
>> specially when they come in between of non-joining characters. My
>> opinion is to limit protocol prohibitions to absolutely necessary cases.
>>
>> Alireza
>>
>> Mark Davis wrote:
>>
>>> I propose that we make U+0640 ( ـ ) ARABIC TATWEEL (aka kashida) be
>>> DISALLOWED, adding it to
>>> http://tools.ietf.org/html/draft-ietf-idnabis-tables-05#section-2.6.
>>> Currently it is PVALID, but it does not carry semantics by any
>>> Arabic-Script orthography, and its only value is for spoofing.
>>>
>>> For example: جوجل can be written with extra kashidas as جـوجل or as
>>> جوجـل by inserting a kashida after the first or third character. This
>>> is very hard for users to detect. We added it to Unicode for use in
>>> manual justification, but has no place in IDNA.
>>>
>>> (http://en.wikipedia.org/wiki/Kashida,
>>> http://unicode.org/cldr/utility/character.jsp?a=0640)
>>>
>>> Mark
>>> _______________________________________________
>>> Idna-update mailing list
>>> Idna-update at alvestrand.no
>>> http://www.alvestrand.no/mailman/listinfo/idna-update
>>>
>>>
>> _______________________________________________
>> Idna-update mailing list
>> Idna-update at alvestrand.no
>> http://www.alvestrand.no/mailman/listinfo/idna-update
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20090320/49292742/attachment-0001.htm
More information about the Idna-update
mailing list