Greek Casefolding sigma
mark.davis at icu-project.org
Sat Mar 29 19:49:38 CET 2008
I'm afraid I was a bit too terse (failing of mine that other Unicoders will
atest to). Let me try to flesh it out.
When you case-fold in Unicode, you remove certain distinctions. That is, you
have cases where, say, 3 characters map to one. Because there are a very
small number of source characters that could map to any target character, as
you do the folding you can also extract a set of bits that tell you which of
the original characters was mapped. That vector of bits, plus the target
string, can get you back to the original casing. You need no bits for
uncased characters (eg Chinese), need typically 1 bit per cased character,
and in exceptional cases need more than one bit. (For σ, Σ, ς you'd have 00,
01, 10, for example.)
The simplest mechanism would be to then take that set of bits and walk
through the Punycode, and for each bit in the vector changing each cased
letter to uppercase to represent a 1 bit, and leaving it lowercase represent
a 0 bit.
You then end up with an A-label that contains enough information that it can
be converted back to the cased version of the U-label, yet a
case-insensitive comparison of A-labels yields the same result as a
case-insensitive comparison of the original U-labels.
Thus Σωτήρης would convert to xn--jxas3agdc8a in Punycode. The bit vector
for conversion back would be <01,0,0,0,0,0,10>. Applying the casing bits
we'd get xN--jxas3aGdc8a.
As another example, Oréal would convert to xn--oral-cpa, with a bit vector
of <1,0...>, thus yielding Xn--oral-cpa.
This would work for any label where you don't run out of ASCII letters
before you run out of one bits. I haven't done any comprehensive tests, but
I would suspect that we could preserve case in the vast majority of cases
just with this simple mechanism. And as I think about it, I strongly suspect
that we could apply certain compression techniques to cover essentially all
cases, and end up with a mechanism that would actually be backwards
compatible that preserved input Unicode. (Cc'ing Markus, who might be
interested in the bit-twiddling aspect.)
I hope that's a bit clearer.
On Sat, Mar 29, 2008 at 9:03 AM, John C Klensin <klensin at jck.com> wrote:
> --On Saturday, March 29, 2008 8:47 AM -0700 Mark Davis
> <mark.davis at icu-project.org> wrote:
> > Patrik, you misunderstand. I'm not saying that this should be
> > part of the protocol. What I'm saying is that the protocol,
> > *combined with a postprocessing step for UIs, *would handle
> > the situation.
> > (In retrospect -- and water long under the bridge -- we would
> > have been better off to use one of the variants of Punycode,
> > which has the ability to encode case and other distinguishing
> > information in the original Unicode using case in the ASCII
> > form. Had we gone that route, we could have maintained the
> > visual distinctions on output of DNS for sigma and similar
> > cases, because the DNS does a caseless compare for A-Z.)
> Unless I misunderstand what you are suggesting, that punycode
> variation would not have helped. Because the code points are
> different, punycode(raw-upper-case-string) is not going to
> contain the same characters as
> punycode(equivalent-lower-case-string). One could use punycode
> case to encode things the way you suggest, but only by case
> folding first and then using the punycode case to indicate "used
> to be upper case". But that wouldn't help for the sigma
> situation because the case folding operation itself is what
> loses the information (about final form, not really case), and
> that isn't subject to a binary "upper/lower" switch.
> Or have I missed something in what you are suggesting?
> But, one way or the other, certainly water under the bridge.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Idna-update