Urdu and SPACE, FULL STOP (Re: comments on IDNAbis: draft-faltstrom-idnabis-tables-04.txt Arabic

Sarmad Hussain sarmad.hussain at nu.edu.pk
Thu Feb 21 05:13:02 CET 2008


> 
> On Feb 20, 2008 1:29 PM, Harald Alvestrand 
> <harald at alvestrand.no> wrote:

...
> > >
> > >
> > > In addition, in Urdu we also would have a problem for not allowing
> > > space as we do not have use of ZWNJ in Pakistan.  Urdu users in
> > > Pakistan type space whether it is required to shape 
> letter within a
> > > word or at the end of it.  It is not possible to train 
> all users to
> > > distinguish between space and ZWNJ (especially as the 
> latter is not a
> > > linguistic entity in the language and users are never taught its
> > > concept, but a computational engineering solution from 
> the perspective
> > > of Urdu).  As the domain name standard has to deal with 
> applications
> > > with which users will be directly interacting, it may 
> also be included
> > > as a recommendation (at least for Urdu) that the users 
> may be allowed
> > > to type it and it may be automatically be converted to 
> ZWNJ (and could
> > > follow same rules as ZWNJ after such conversion).
> > >
> > >
> > I am curious about this ... can you tell me more about how 
> Urdu speakers
> > regard the words that the Unicode Consortium's experts 
> insist can only
> > be written by use of ZWNJ - are they regarded as "two 
> words, set closely
> > together", or are they regarded as "one word that we have 
> to type in a
> > weird way"?
> >
> > In the Latin-script languages, we have forced all users, if 
> they want to
> > have multiple words in their labels, to use the unnatural and
> > strange-looking hyphen, or write each word in a separate 
> label and have
> > a domain name with multiple strings.
> > If we could avoid the use of ZWNJ entirely without causing 
> too much pain
> > to users (no more than is currently suffered by users of English and
> > Norwegian), that would simplify our rules a great deal.
> >
> > Again, thank you very much for your information!
> >
> >                      Harald Alvestrand
> >

For understanding this, we have to start from the very basic fact: 

	Urdu hand writing and calligraphy DOES NOT have the concept of space


It is like Lao, Thai, Khmer and many other Asian languages in that respect.
I would not be surprised if this is true of all other languages using Arabic
script, but cannot say for sure.

"Space" character has been used by users in computers NOT to put space
between two words but to correct the glyph shaping as per the user needs.
Thus, it has been used for the same purpose as ZWNJ.  For Urdu (at least)
the two are synonymous at the presentation layer (except that space can also
be used for justification, a role which ZWNJ cannot have as it has zero
width).  Thus, space can be considered a stylistic variation of ZWNJ
functionally.  

However, "Space" is needed separately from ZWNJ for computational purposes
(internally within applications).  If glyph shaping is done within a work,
ZWNJ is to be used.  If it is done across words, "space" is to be used.
This will help applications like tokenizers and spell checkers to work
properly.  But this is not a distinction end users know of or realize or
type.  They consistently type space for glyph shaping in both contexts
(within and across words).  Thus, automatic word segmentation techniques
(which are also used for some of the other languages I have mentioned (see
much literature on Chinese and Thai on this)) are also employed by Urdu.

For scientific data to support this, please read
http://www.crulp.org/Publication/papers/2007/spelling_error_trends_in_urdu.p
df.  Note that Table 3 indicates that from an error corpus of 975 words,
75.5 percent of errors were related to irregular use of space (people
entering space within words, or not entering space across words (in the
latter case glyph shaping was not desired (as word ending characters were
non-joiners) and thus typists skipped space altogether, indicating that
space is not a word boundary "tool" in Urdu but a glyph shaping "tool" for
them).  

In summary, Urdu does not have space between words.  "Space" is a glyph
shaping and justification tool for end users of Urdu. ZWNJ is a requirement
for computational processing of Urdu (internally within applications; not by
users). 

Best regards,
Sarmad





More information about the Idna-update mailing list