final sigma and tonos

Fri Feb 1 02:41:24 CET 2008

------------- Begin Forwarded Message -------------

Date: Thu, 31 Jan 2008 17:32:34 -0800 (PST)
From: Kenneth Whistler <kenw at sybase.com>
Subject: Re: final sigma and tonos
To: erikv at google.com
Cc: idna-update at alvestrant.no
Content-MD5: MltDnRQOuElnjlz3RDfH+Q==

Erik asked:

> On Jan 31, 2008 4:25 PM, Kenneth Whistler <kenw at sybase.com> wrote:
> > Greek is not a cursive script, and ZWNJ has never had anything
> > to do with selection of final or non-final forms for Greek
> > sigma. ZWNJ is *not* a final form variation selector.
> 
> Is there any invisible character in Unicode that *could* be used by an
> application for whatever purpose it wishes?

Sure. Any *non*character, e.g. U+FDD0..U+FDEF, U+FFFF, etc.

But the very thing that makes those usable "for whatever purpose
[an application] wishes" is also what makes them
non-interoperable. You can't conformantly interchange
noncharacters outside your own privately controlled
context, where you understand what you are using them for.

An application can also use private-use characters for
whatever it pleases (U+E000..U+F7FF, etc.), and those
*can* be openly interchanged. But again, somebody else
is only going to know what you intend by them if you
actually share a private agreement about their meaning.
Furthermore, for rendering, any implementation that
doesn't share your private agreement would display
private-use characters as a visible blort (i.e. square
box, etc.), so those don't even meet the criteria of
being an "invisible character".

And of course, because of those interoperability
concerns, noncharacters and private-use characters
are also forbidden for IDNs, anyway.

The recommended way, in the Unicode encoding context,
to represent a visually required arbitrary glyphic
distinction when a separate character encoding is
not pertinent would be to standardize a variation
sequence, using a variation selector. But even that
wouldn't be relevant to the Greek case, because
in Greek, the final sigma and the non-final sigma
are *already* encoded as separate characters -- a
situation inherited from ISO 8859-7 = ELOT 928
(and IBM CP 423, IBM CP 851, Windows 1253, MacGreek, etc.).

So the Greek answer has been around for decades.
If you want to represent a final sigma, you use
U+03C2, and if you want to represent a non-final
sigma, you use U+03C3. And then you don't need
context rules, nor do you need invisible format
control characters.

The problem, of course, is that for most *matching*
purposes, which aren't concerned with display per se,
both of the sigmas are *sigmas*, after all, and
have to fold together. And the other problem is
that casing for sigmas is asymmetrical, because there
is only one coded character for the uppercase, but
two for the lowercase.

So the answer for that is that Greek implementations
simply need to be aware of the sigmas and know they
need special treatment for folding for matching,
and special treatment for casing. Adding an
invisible character into the mix wouldn't actually
help anything, I think.

>
> ... Nevertheless, I
> would also be quite interested to hear how the Greeks and French on
> this mailing list feel.

Of course. I just want to make sure that we have a common
understanding of "the issue" we are trying to get
clarity on.

--Ken

------------- End Forwarded Message -------------