Mixing scripts (Re: Unicode versions (Re: Criteria for exceptional characters))

Mark Davis mark.davis at icu-project.org
Wed Dec 20 00:23:50 CET 2006


Having looked this over, it's clear that the first line was inappropriate,
for which I'm sorry. I should avoid trying to be lighthearted, since it is
too easy to get wrong.

Unicode is big and complex, and I wouldn't expect anyone who is not deeply
immersed in the topic to know all the details. Many of the issues are not
obvious to someone who isn't a specialist in the subject, and there is a lot
of history behind the structure and documentation that makes it sometimes
difficult to approach. I meant no disparagement at all by my phrasing, which
was just meant to indicate that it is sometimes a difficult area to get a
handle on.

Mark

On 12/19/06, Mark Davis <mark.davis at icu-project.org> wrote:
>
>
> > I take it this means the answer to my question is "no", since the script
> > names in Scripts.txt and the ISO 15924 codes don't match up.
>
>
> We need to drag you, kicking and screaming, into ever deeper understanding
> of how Unicode works.
>
> Each Unicode property name, and property value name may have aliases.
> These aliases, as you would expect, are encapsulated in a machine-readable
> file, such as
> http://www.unicode.org/Public/UNIDATA/PropertyValueAliases.txt
>
> So, for example, you see there:
>
> sc ; Arab      ; Arabic
> sc ; Armn      ; Armenian
> sc ; Bali      ; Balinese
> sc ; Beng      ; Bengali
> ...
>
> The first field, sc, is the short name for the "script" property; Armn is
> the short name for one of its values (which corresponds to the 15924 code),
> and Armenian is the long name used in the data file Script.txt. If you
> look at the site for the 15924 Registration Authority (
> http://www.unicode.org/iso15924/), you'll find also in the tables such as
> http://www.unicode.org/iso15924/iso15924-codes.html a listing of both the
> long and short value names.
>
> The Unicode script property (2001-02-06) actually predated first
> publication of ISO 15924 (2004-01-09), however, it was done in the knowledge
> that 15924 was coming, and they have been kept in sync since.
>
> Mark
>
> On 12/19/06, Harald Alvestrand <harald at alvestrand.no> wrote:
> >
> > Thanks for pointing out the relevant TR for the use of script codes, and
> > the special status of "Common" and "Inherited". The algorithm grows....
> >
> > --On 19. desember 2006 12:45 -0800 Kenneth Whistler < kenw at sybase.com>
> > wrote:
> >
> > >> Is there a list of the Unicode codepoints known to be used in each of
> > >> the ISO 15924 script codes?
> > >
> > > That is an ill-formed question. ISO 15924 defines script codes.
> > > It does not define repertoires or associate code points with
> > > those script codes. So you can't have sets of Unicode code points
> > > "in each ISO 15924 script code".
> > >
> > > The closest you are going to get to an repertoire partitioning
> > > of Unicode into scripts is Scripts.txt, the very file we have
> > > been talking about and using for the development of the
> > > inclusions file.
> >
> > I take it this means the answer to my question is "no", since the script
> >
> > names in Scripts.txt and the ISO 15924 codes don't match up.
> >
> >             Harald
> >
> >
> >
> >
> > _______________________________________________
> > Idna-update mailing list
> > Idna-update at alvestrand.no
> > http://www.alvestrand.no/mailman/listinfo/idna-update
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20061219/01025075/attachment.html


More information about the Idna-update mailing list