consensus Call: TATWEEL
Kenneth Whistler
kenw at sybase.com
Wed Mar 25 23:34:07 CET 2009
Alireza asked:
> Why you think they are very unlike ? other than it has been using for
> many years in DNS
>
> /\lireza
>
> Kenneth Whistler wrote:
> >> Hyphen and tatweel are very unlike.
> >>
> >
> > I agree. Which is why hyphen (U+002D) does (and must
> > continue to) occur in domain names, and why U+0640
> > ARABIC TATWEEL shouldn't.
Well, let me count the ways. ;-)
U+002D HYPHEN-MINUS
1. Is in the ASCII subset, which has all kinds of implications
for grandfathered usage in protocols, syntax, etc.
2. Is ambiguous between usage as a punctuation mark (hyphen)
and a mathematical unary operator (minus sign).
3. May make content distinctions in some orthographies, both
lexically and/or syntactically.
4. Has the Line_Break property lb=HY, with implications for
hyphenation and line breaking behavior.
5. Has the Word_Break property wb=Other, so by default will
mark a word break boundary.
6. Is Common script, used with many scripts besides Latin.
7. Is General_Category, gc=Pd, i.e. a punctuation dash.
8. Has the Bidi_Class, bc=ES, with implications for numeric layout.
U+0640 ARABIC TATWEEL
1. Is not in the ASCII subset.
2. Is neither a punctuation mark nor a mathematical operator.
3. Makes no content distinctions in text, but is used only
to justify text for display.
4. Has the Line_Break property lb=AL, i.e. is treated like
any letter for the purposes of line breaking, and does
not mark special opportunities for line breaking.
5. Has the Word_Break property wb=ALetter, so by default will
never mark a word break boundary.
6. Is Arabic and Syriac script only, and requires specific font design
to harmonize with an Arabic (or Syriac) font baseline.
7. Is General_Category, gc=Lm, i.e. a modifier letter.
8. Has the Bidi_Class, bc=AL, i.e. behaves for bidi like true
Arabic letters.
There are other distinctions, but I think continuing in this
vein would be probably be more than is required.
What do the two characters share?: vaguely similar appearances
(in some fonts only, when the glyphs are viewed in isolation
and not in context).
Now what we might be missing is that some Arabic system users
*may* have repurposed U+0640 (or more likely its analogue
in 8-bit systems: Windows 1256 0xDC, and ISO 8859-6 0xE0) as
another kind of dash character, using it as an Arabic equivalent
of HYPHEN-MINUS, even though because of its semantics it
wouldn't work well that way on either Windows or other
Unicode-based systems now.
Is that what you are talking about, Alireza?
--Ken
More information about the Idna-update
mailing list