Greek Casefolding sigma

Martin Duerst duerst at it.aoyama.ac.jp
Tue Apr 1 08:46:37 CEST 2008


At 03:49 08/03/30, Mark Davis wrote:

[some Unicode examples mangled, sorry]

>Thus $B%[!W%^2Q(P%g%^"P%-%^#(Bwould convert to xn--jxas3agdc8a in Punycode. The bit vector for conversion back would be <01,0,0,0,0,0,10>. Applying the casing bits we'd get xN--jxas3aGdc8a.
>
>As another example, Or$B%F%%(Bal would convert to xn--oral-cpa, with a bit vector of <1,0...>, thus yielding Xn--oral-cpa.

Ah, yes, we have two spare bits.

I think things could work out. There is at least one letter/digit
in the punycode output per letter in the input. If we assume
final sigma, sz, and Turkish i as those characters that need
a second bit (in addition to a case bit), then we might be fine
at least on average. We may get into problems in the case where
the punycode output contains many digits rather than letters,
but a cursory reading of the encoding algorithm seems to indicate
that punycode is biased towards generating letters rather than
digits.

We could also use an additional trick, namely to move the 'special'
bits to the front, making sure they never get dropped, and have
the case bits follow. We could also reorder the case bits according
to the order in the word (rather than according to the order their
letters are encoded, which is in codepoint order), because case
towards the end of a name or word in general contains less and
less information (the first case bit can distinguing between
all-upper and all-lower case, and the second bit can also
distinguish titlecase, and these are the main variants).

The most basic advantage of all this is of course that this is
very much backwards compatible, just loosing the 'nicer presentation'
aspect in older versions.

Regards,    Martin.


>This would work for any label where you don't run out of ASCII letters before you run out of one bits. I haven't done any comprehensive tests, but I would suspect that we could preserve case in the vast majority of cases just with this simple mechanism. And as I think about it, I strongly suspect that we could apply certain compression techniques to cover essentially all cases, and end up with a mechanism that would actually be backwards compatible that preserved input Unicode. (Cc'ing Markus, who might be interested in the bit-twiddling aspect.)
>
>I hope that's a bit clearer.


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst at it.aoyama.ac.jp     



More information about the Idna-update mailing list