Casefolding Sigma (was: Re: IDNAbis Preprocessing Draft)

Tue Jan 22 19:59:17 CET 2008

--On Tuesday, 22 January, 2008 11:47 +0900 Martin Duerst
<duerst at it.aoyama.ac.jp> wrote:

> I'm sure this has already been discussed, probably in several
> places, but thinking from a simple user perspective, why should
> final small sigma be disallowed? After all, writing a word
> ending in sigma with a non-final sigma would look really
> strange, or wouldn't it? And likewise writing a word
> containing a singma in the middle with a final sigma would
> look really strange, or wouldn't it? So in my view, it would
> be better to address this e.g. at the registry level rather
> than to produce bad typography.

Martin,

When we can avoid it, I find it helpful to avoid thinking about
and debating individual characters.  Instead, let's focus on
principles, both because they permit us to generalize and
because, if we do our job well, it makes it easier for users to
understand what does and does not work in IDNs.   The question
of what does and does not work is one of the big criticisms of
IDNA2003: the non-expert user, and even her registrar, can't
predict which characters are permitted, which ones are
prohibited, which ones map into something else (and whether the
"something else" can be predicted by those who don't use the
script).

Two of those principles apply to everything we may reasonably
try to do with IDNs.  As Vint points out, DNS labels (not just
IDN ones) are identifiers.  They need to be as clear,
unambiguous, and predictable as possible to make them maximally
usable.  And, no matter how often we fall into the trap of
thinking about them that way, they are not "words" and don't
need to obey the constraints of "words".  Wanting to write the
Great Slobbovian Novel in domain names labels is just not a
consideration.  Picking up an example from one of Stephane's
notes, the decision that GOOGLE.com and google.com should be
equivalent was made before there was a Google and indeed long
before there was an Internet.   Long experience with identifiers
and what we now call Basic Latin characters persuaded us that,
while specialists could get used to it, trying to treat
identifier strings as different when they differed only by case
was a bad idea.  There are two ways to deal with the problems
that bad idea can cause.   One is to permit one form only and
the other is to map them (or cause them to match in any
comparison processes, even if they are not mapped).   And,
because they are identifiers and "no ambiguity" is a major
consideration for identifiers --perhaps _the_ major
consideration-- if one tries to make the adjustments at
comparison time, it better be 100% clear what should happen in
any case, without complex rules about the processing or
comparison model.

Now let's take a few steps back from both final-form Sigma and a
number of coding decisions that Unicode inherited from other
standards.  Probably as the result of very old calligraphic or
other decorative conventions, we have a number of scripts in the
world that make visual distinctions about how characters are
written depending on where they appear in a word.   In some
cases, those distinctions are made with leading characters
(sentence capitalization and proper names in English, Nouns in
German, page and paragraph illumination in some medieval Latin
texts, and so on).  In others they are distinctions made at the
ends of words (a quick scan of the UnicodeData table shows up
characters identified as "final" in Greek, Hebrew, Syriac,
Canadian Syllabics, New Tai Lue, Bopomofo, Arabic, and some
symbols we presumably don't care about).   Some of these may
actually not be final forms in the same sense as the
Greek-Hebrew-Arabic ones  and there may be others out there, but
please let's understand that final-position presentation form
changes are not uncommon... they may occur in more scripts than
those we usually think of a having case.

Given that, we need to face some very general issues _for IDNs_.
I stress that because the issues, and the answers, may be very
different for sentences, running text, novels, typesetting, and
even programming languages.  Indeed, they probably are
different.  One is that, because DNS labels are identifiers,
standard orthographic rules for languages that use the script
may not apply.  We have extensive experience with constructions
like FirstBank.tld, the established expectation that they will
match firstbank.tld, _and_ the knowledge that such
middle-of-the-string capitalization cannot possibly occur if the
string is a single English word.  An obvious analogy to this
common practice of constructing a label by cramming two words
together could easily lead to final-form characters in the
middle of analogous labels.  Whether that is "bad typography" or
not is in the mind of the beholder.  Again because these are
identifiers, formulations that work "almost all the time" are
not good enough.  We either have to have enough motivation to
call something out as an exception case, or everything has to be
treated in the same way.

If we now return to how these things are coded, I note that, in
a perfect world and with a focus on identifiers, one might have
avoided assigning separate code points to these typographically
variant case-sensitive and final forms but instead used the base
character and, when needed, a form-modifier of some sort.  That
would make all sorts of comparison operations much easier and
would probably have no ill effects on typography.  It might have
made some other things more difficult, including requiring more
script-specific or language-specific characteristics in
rendering engines, so I'm not suggesting it would be universally
a good idea.  Unicode didn't do it that way (except for Arabic)
and, because almost all of what they did do was inherited from
other standards, gets neither credit nor blame for the decision
(insofar as either is relevant).  However, even within Unicode,
treatment of normal and special-form characters is not
completely consistent.  Lower and upper case Greek, Cyrillic,
and Latin characters are considered distinct although there is a
table and algorithm for case-folding.  Most normal and
final-form characters in Hebrew, Syriac and others are treated
as distinct, but final-form Arabic characters (and a few
precomposed Hebrew ones) are treated as compatibility characters
and mapped out by NFKC.

IDNA2003 and Stringprep tried to resolve these issues and ended
up with, IMO, a bit of a mess due to the inconsistencies
described above.   Upper-case characters are mapped to
lower-case, creating a mapping that is not reversible and is
obvious for almost all, but not quite all, cases.  That case
mapping eliminates final form sigma as a separate character
(those who are arguing for as much compatibility with IDNA2003
as possible should take note of this -- final form sigma cannot
appear in an actual label in IDNA2003).  The Hebrew final forms
are not mapped and can occur in IDN labels as themselves.   We
don't know what would have happened to the Arabic final forms,
since they were apparently defined after Unicode 3.2, but the
NFKC mapping plus comments in the Unicode text lets us deduce
that they would have been mapped out.

I think where this all leads is to the conclusions that:

	* Neither "bad typography" nor observations about normal
	word formation are particularly helpful in making
	decisions about alternate forms of the same character
	for IDN purposes.

	* Ideally, we should treat all case and positional
	presentation variations the same way.  Since there are
	no perfect rules for the mappings, that way is probably
	to explicitly ban the variant forms from actual
	domain labels.  In IDNA2003, that means that they cannot
	be represented directly in the ACE form and that
	ToUnicode() of a  valid ACE form will never generate
	them.  In IDNA200X, that means they are banned as input
	from IDNA (even if mappings are applied as a UI or
	preprocessor matter).

	* Our general principle that any codepoint that has a
	compatibility mapping to another codepoint (or set of
	codepoints) is banned at the IDNA level continues to
	apply to these cases as well as all the others.

That leaves us with five final-form characters in Hebrew, one in
Syriac, and around 18 in scripts I don't understand well enough
to make guesses about that are permitted, as themselves, under
IDNA2003 but that application of the principles above would
probably ban.  A discussion about what to do with them is
ultimately a discussion of whether consistency of principles is
more or less important than compatibility with IDNA2003.  That
discussion would seem to me to be much more helpful at this
point than a discussion of final sigma, which gets banned under
either criterion.

And I note that all of these characters except final sigma are
headed for the categories we call "MAYBE", which really should
be interpreted, for these sorts of cases, as "specific advice
from the community of users of the script about conditions for
appropriateness for IDN use is needed".

regards,
    john