[Fwd]: Response to Mark's message]

Thu Apr 10 10:25:36 CEST 2003

Jon Hanna wrote on 04/10/2003 06:26:56 AM:

> The second is that this orthogonal quality doesn't preclude "educated
> guesses". It's perfectly reasonable IMHO to assume Latin script for en-GB
> *as long as you remember that you are making an assumption*.

It has been suggested (perhaps before this thread was moved to this list)
that this should be taken beyond assumptions: that we should construct a
list of implicit relationships so that we know en can be universally
assumed to imply en-Latn (for contexts in which written form is relevant),
ar can be universally assumed to imply ar-Arab, etc.

> Currently the only method for deducing scripts is either heuristically
(look
> at the characters used and then deduce that the script used is whatever
> script uses those characters) or guessing from the language as in the
second
> point above. While we all agree that this is not ideal, we have to
recognise
> that software doing so will continue to exist for some time after a
better
> solution is available.

The need goes beyond deducing the script used in the content: users need to
be able to specify constraints on content they are searching for.

> Further a solution that places script codes into language codes has some
> strangeness. The hierarchy behind tags is imperfect...
> Whatever way I look at this I cannot find myself satisfied by anything
that
> attempts to push script information into language tags.

Three+ years ago when we were drafting RFC3066, I had a lot of reservations
about going in this direction, but after simmering over it for a couple of
years and writing a few papers related to the topic, I am convinced it is a
move we should make.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485