New draft submitted of 3066bis...

Tue Nov 2 17:45:03 CET 2004

On Tuesday, November 2, 2004, 4:43:40 PM, Elliotte wrote:

EH> Chris Lilley wrote:

>> a) all the data is un US-ASCII, so case folding is on US-ASCII only

EH> This is not correct The logical assertion being made here is that 
EH> because all the data is US-ASCII, therefore the case folded version is
EH> US-ASCII too.

EH> This is not true in either theory or practice. Case folding the letter
EH> "i" to upper case produces non-ASCII data some of the time.

As we discussed before, that depends on your case folding algorithm. The
one I suggested

a) folds to lower case, so i folds to i
b) is not locale dependent, so I folds to i not
turkish-dotless-lowercase-i

EH> Case folding
EH>   the letter "I" to lower case produces non-ASCII data some of the time.

EH> In other words, the set of ASCII characters is not reliably closed over
EH> the operation of case-folding.

If you allow case folding to be locale dependent. As I argued, the
problem is with the introduction of locale differences into xml
syntactic processing.

EH> What the data allows is not relevant. Case folding can produce unallowed
EH> data. The spec should define that it maps the ASCII characters a-z onto
EH> the ASCII characters A-Z and/or vice versa.

My contention was that it already did so, due to the definition of the
document character set for XML. Since the document character set is
Unicode, the Unicode case folding operation applies. I considered this
to be simple and obvious. Since it is not obvious to you, it needs to be
stated explicitly in the spec.

EH> You could even specify it
EH> numerically by adding or subtracting 'A'-'a' from each character. That's
EH> all fine.

I think re-inventing Unicode case folding is a mistake.

EH> But case folding is not a well-defined operation across platforms, 
EH> natural languages, and computer languages, even when applied to the 
EH> ASCII character set. :-(

Unless you point to the algorithm.

By the way, I should point out that the comments above are strictly to
be applied to syntactic processing of XML. I am not unaware of the
complexities of natural language processing, locales, preferred
languages, and so on. However, I believe that interoperable xml
processing world wide clearly requires that locale is not taken into
account when doing case folding on such syntactic items as language
codes, domain names, and other such things which have (unfortunately)
been defined to be 'case insensitive'.

-- 
 Chris Lilley                    mailto:chris at w3.org
 Chair, W3C SVG Working Group
 Member, W3C Technical Architecture Group