consensus Call: TATWEEL

Vint Cerf vint at google.com
Thu Mar 26 12:54:00 CET 2009


Alireza,

I am pretty sure that the IDNABIS WG does not think that all visual  
confusion problems can be solved at the protocol level. However, it  
has taken the decision that characters identified as symbols not used  
in written language should be excluded, for instance. The IDNA2008  
stance so far has been to eliminate characters that do not appear to  
serve a useful purpose in the formation of domain names (a restricted  
class of "writing"). As nearly as I can tell, Tatweel provokes a  
fairly strong, if rough, consensus that it serves no useful purpose  
for domain names and should be disallowed.

There is still a great deal of dependence on the discretion of  
registries (in the general sense of the word, at all levels of domain  
name labels) for protective restrictions on the use of some (many?)  
allowed characters. I think the sense of the working group is that  
protocol level prohibition serves the interests of Internet's domain  
name users where the ban has consensus which we would appear to have  
here, possibly with a few dissenting views, yours among them it would  
seem.

vint



Vint Cerf
Google
1818 Library Street, Suite 400
Reston, VA 20190
202-370-5637
vint at google.com




On Mar 26, 2009, at 6:47 AM, Alireza Saleh wrote:

> Dear Kenneth,
>
> thanks for the description, but I still don't understand why this
> character has been coded as Arabic-Letter ? and as it is for now  
> then it
> should be in the protocol. If IDNA2008 wants to be independent from  
> the
> specific version of Unicode it shouldn't make decision on character by
> character basis. Maybe, in Unicode 5.2 there will be a character that
> have the same characteristics like Tatweel, then we should update the
> protocol documents again.
>
> Besides, I still don't hear any arguments other than visual confusion.
> As you know, there are many of these visual confusions
> that still remains within the Arabic script and are expected to be
> handled at the registry. I'm not arguing for permitting Tatweel as  
> such,
> what I'm arguing is that the way of making decisions should be  
> changed.
> If the IDNAbis working group thinks that all these problems
> should be handled at the protocol level, then please go ahead and
> resolve them ALL; there will be a long list of confusions more  
> dangerous
> than having Tatweel can be sent to the group. if not, better leave  
> them
> all to be resolved at the registry. The registry rules may be only
> effective to the labels that are registered within the registry, but
> please note that the domain owner can create a confusion sub-label as
> well as confusion URI, with which IDNAbis has nothing to do if it  
> comes
> after slash.
>
> Best
> Alireza
>
>
>
>
> Kenneth Whistler wrote:
>> Alireza asked:
>>
>>
>>> Why you think they are very unlike ? other than it has been using  
>>> for
>>> many years in DNS
>>>
>>> /\lireza
>>>
>>> Kenneth Whistler wrote:
>>>
>>>>> Hyphen and tatweel are very unlike.
>>>>>
>>>>>
>>>> I agree. Which is why hyphen (U+002D) does (and must
>>>> continue to) occur in domain names, and why U+0640
>>>> ARABIC TATWEEL shouldn't.
>>>>
>>
>>
>> Well, let me count the ways. ;-)
>>
>> U+002D HYPHEN-MINUS
>>
>> 1. Is in the ASCII subset, which has all kinds of implications
>>   for grandfathered usage in protocols, syntax, etc.
>>
>> 2. Is ambiguous between usage as a punctuation mark (hyphen)
>>   and a mathematical unary operator (minus sign).
>>
>> 3. May make content distinctions in some orthographies, both
>>   lexically and/or syntactically.
>>
>> 4. Has the Line_Break property lb=HY, with implications for
>>   hyphenation and line breaking behavior.
>>
>> 5. Has the Word_Break property wb=Other, so by default will
>>   mark a word break boundary.
>>
>> 6. Is Common script, used with many scripts besides Latin.
>>
>> 7. Is General_Category, gc=Pd, i.e. a punctuation dash.
>>
>> 8. Has the Bidi_Class, bc=ES, with implications for numeric layout.
>>
>> U+0640 ARABIC TATWEEL
>>
>> 1. Is not in the ASCII subset.
>>
>> 2. Is neither a punctuation mark nor a mathematical operator.
>>
>> 3. Makes no content distinctions in text, but is used only
>>   to justify text for display.
>>
>> 4. Has the Line_Break property lb=AL, i.e. is treated like
>>   any letter for the purposes of line breaking, and does
>>   not mark special opportunities for line breaking.
>>
>> 5. Has the Word_Break property wb=ALetter, so by default will
>>   never mark a word break boundary.
>>
>> 6. Is Arabic and Syriac script only, and requires specific font  
>> design
>>   to harmonize with an Arabic (or Syriac) font baseline.
>>
>> 7. Is General_Category, gc=Lm, i.e. a modifier letter.
>>
>> 8. Has the Bidi_Class, bc=AL, i.e. behaves for bidi like true
>>   Arabic letters.
>>
>>
>> There are other distinctions, but I think continuing in this
>> vein would be probably be more than is required.
>>
>> What do the two characters share?: vaguely similar appearances
>> (in some fonts only, when the glyphs are viewed in isolation
>> and not in context).
>>
>> Now what we might be missing is that some Arabic system users
>> *may* have repurposed U+0640 (or more likely its analogue
>> in 8-bit systems: Windows 1256 0xDC, and ISO 8859-6 0xE0) as
>> another kind of dash character, using it as an Arabic equivalent
>> of HYPHEN-MINUS, even though because of its semantics it
>> wouldn't work well that way on either Windows or other
>> Unicode-based systems now.
>>
>> Is that what you are talking about, Alireza?
>>
>> --Ken
>>
>>
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update



More information about the Idna-update mailing list