consensus Call: TATWEEL

Kenneth Whistler kenw at sybase.com
Wed Mar 25 23:34:07 CET 2009


Alireza asked:

> Why you think they are very unlike ? other than it has been using for 
> many years in DNS
> 
> /\lireza
> 
> Kenneth Whistler wrote:
> >> Hyphen and tatweel are very unlike.
> >>     
> >
> > I agree. Which is why hyphen (U+002D) does (and must
> > continue to) occur in domain names, and why U+0640
> > ARABIC TATWEEL shouldn't.


Well, let me count the ways. ;-)

U+002D HYPHEN-MINUS

1. Is in the ASCII subset, which has all kinds of implications
   for grandfathered usage in protocols, syntax, etc.
   
2. Is ambiguous between usage as a punctuation mark (hyphen)
   and a mathematical unary operator (minus sign).
   
3. May make content distinctions in some orthographies, both
   lexically and/or syntactically.
   
4. Has the Line_Break property lb=HY, with implications for
   hyphenation and line breaking behavior.
   
5. Has the Word_Break property wb=Other, so by default will
   mark a word break boundary.
   
6. Is Common script, used with many scripts besides Latin.

7. Is General_Category, gc=Pd, i.e. a punctuation dash.

8. Has the Bidi_Class, bc=ES, with implications for numeric layout.

U+0640 ARABIC TATWEEL

1. Is not in the ASCII subset.

2. Is neither a punctuation mark nor a mathematical operator.

3. Makes no content distinctions in text, but is used only
   to justify text for display.

4. Has the Line_Break property lb=AL, i.e. is treated like
   any letter for the purposes of line breaking, and does
   not mark special opportunities for line breaking.
   
5. Has the Word_Break property wb=ALetter, so by default will
   never mark a word break boundary.
  
6. Is Arabic and Syriac script only, and requires specific font design
   to harmonize with an Arabic (or Syriac) font baseline.
   
7. Is General_Category, gc=Lm, i.e. a modifier letter.

8. Has the Bidi_Class, bc=AL, i.e. behaves for bidi like true
   Arabic letters.


There are other distinctions, but I think continuing in this
vein would be probably be more than is required.

What do the two characters share?: vaguely similar appearances
(in some fonts only, when the glyphs are viewed in isolation
and not in context).

Now what we might be missing is that some Arabic system users
*may* have repurposed U+0640 (or more likely its analogue
in 8-bit systems: Windows 1256 0xDC, and ISO 8859-6 0xE0) as
another kind of dash character, using it as an Arabic equivalent
of HYPHEN-MINUS, even though because of its semantics it
wouldn't work well that way on either Windows or other
Unicode-based systems now.

Is that what you are talking about, Alireza?

--Ken



More information about the Idna-update mailing list