AW: AW: sharp s (Eszett)

Tue Mar 18 20:31:54 CET 2008

As several of you have noticed, I needed to large stop dealing
with in-depth comments on this list shortly before IETF.  Now
digging back in...

--On Monday, 17 March, 2008 13:49 +0900 Martin Duerst
<duerst at it.aoyama.ac.jp> wrote:

> At 00:05 08/03/12, John C Klensin wrote:
> 
>> Our general rule should be to avoid information loss.  While
>> we assume both for historical reasons and due to the symmetry
>> of case folding them, that there is no information loss in
>> transforming ordinary Latin upper case characters into lower
>> case ones, when we start talking about non-reversible mappings
>> and and spelling rules, we should keep the characters separate
>> and registerable, rather than taking excursions into the
>> peculiarities of casefolding (which the Unicode book
>> acknowledges loses information and recommends against
>> precisely the way we are using it if that can be avoided).  
> 
> I agree that the concept of "information loss" is an important
> and useful criterion. But I'm a bit wary to make it a
> (potentially close to absolute) rule. 

We agree.  But I'm trying to see if we can find a useful
principle or two here, rather than having interminable arguments
about single characters.

>> Now, if we can agree on that principle, we need to examine the
>> set of characters that are transformed in non-obvious ways by
>> case folding.   For those that are like this, we have to
>> figure out how to implement the principle, which may require
>> some additions to the exception list.
> 
> The next example where to test this approach would be the issue
> of the (Turkish,...) dotless i. My guess is that things would
> work out fine (i.e. the concept of information loss would show
> the desirability for having both dot-ful and dot-less 'i').

Yes, but then we run into several different aspects of the
casefolding issue (or maybe they are separate issues that don't
lead to the same conclusion):

	* The Unicode standard is fairly clear that case folding
	is an information-losing operation.  It notes that case
	folding loses important information and suggests that
	the original string be retained along with the folder
	one (something IDNA doesn't do ... see the last
	paragraph in page 187 of TUS 5.0).   The "upper then
	lower" transformations that are typical of the
	definition of the case folding operation is what gets us
	from Eszett -> SS --> ss, regardless of the information
	loss.

	* We are all agreed that keeping character
	transformations stable (in the normal meaning of that
	term) is A Good Thing.  But that can create some
	snapshot inconsistencies.   For example, since Unicode
	5.1 contains "Capital Eszett", if case folding were
	being defined for the first time with 5.1, or if "Small
	Eszett" had not been assigned a code point earlier,
	there would at least be a very strong case for  Small
	Eszett --> Capital Eszett -> Small Eszett, with no "ss"
	transformations involved.

	* If the proposed change (from IDNA2003) is preserved in
	IDNA200X and we move mappings out of the protocol, then
	there is no _protocol_ requirement to invoke toCaseFold
	(or any similar operation).   Characters in the protocol
	are either different or prohibited.  It is here that the
	"information loss" criterion becomes most important and
	it is here that it becomes fairly clear that Eszett is a
	"real" character, not, e.g., an alternate way to
	write/type "ss".

Of course, while those three ways of looking at the problem
might help us if we were making decisions for the first time
today, they do not help very much when we address the question
of backward compatibility.    In the extreme cases, that
question has only two possible answers:  (i) We must preserve
compatibility always, even if it means carrying a mistake
forward and (ii) we need to figure out what is right, construct
rules on that basis and then, if necessary, figure out how to
get from IDNA2003 registrations to IDNA200X registrations and
lookup procedures. 

It has probably been obvious that I generally favor the latter
when decisions made for the IDNA2003 documents cause hardships
or orthographic contradictions, but I don't see it as an
absolute rule.   And, obviously, YMMD.

  regards,
    john