Casefolding Sigma (was: Re: IDNAbis PreprocessingDraft)

Fri Jan 25 11:08:51 CET 2008

Hello John,

At 03:59 08/01/23, John C Klensin wrote:
>
>
>--On Tuesday, 22 January, 2008 11:47 +0900 Martin Duerst
><duerst at it.aoyama.ac.jp> wrote:
>
>> I'm sure this has already been discussed, probably in several
>> places, but thinking from a simple user perspective, why should
>> final small sigma be disallowed? After all, writing a word
>> ending in sigma with a non-final sigma would look really
>> strange, or wouldn't it? And likewise writing a word
>> containing a singma in the middle with a final sigma would
>> look really strange, or wouldn't it? So in my view, it would
>> be better to address this e.g. at the registry level rather
>> than to produce bad typography.
>
>Martin,
>
>When we can avoid it, I find it helpful to avoid thinking about
>and debating individual characters.  Instead, let's focus on
>principles, both because they permit us to generalize and
>because, if we do our job well, it makes it easier for users to
>understand what does and does not work in IDNs.

Making users understand what does and what doesn't work is
definitely important. But except for a few specialists like
us, the general user won't be interested in principles, they
will be interested in their script and the characters used
in their language. To Greeks, it will be completely irrelevant
whether their script is treated according to some general
principles or not, or whether their script is treated the
same way as Hebrew or whatever. Same thing the other way
round.

Also, I agree that the more we can work with general principles,
the better, but we have to look at individual cases to make sure
we choose the right principles. Even the best principles aren't
helpful if they are not validated in concrete cases.

And what's REALLY, REALLY important is to understand that we are
dealing with culture and traditions grown over centuries and
millenia, from a wide variety of human activity, so any attempt
at solving problems with general rules can be very partial at
best, and therefore should be done extremely carefully.

>The question
>of what does and does not work is one of the big criticisms of
>IDNA2003: the non-expert user, and even her registrar, can't
>predict which characters are permitted, which ones are
>prohibited, which ones map into something else (and whether the
>"something else" can be predicted by those who don't use the
>script).

If you think that the new proposals will make this easier, I think
you are mistaken. The 'localization layer' will easily confuse
people more than the current IDNA2003. WithIDNA2003, at least,
all implementations are supposed to behave the same. That won't
be the case anymore if we introduce a 'localization layer'.

>Two of those principles apply to everything we may reasonably
>try to do with IDNs.  As Vint points out, DNS labels (not just
>IDN ones) are identifiers.  They need to be as clear,
>unambiguous, and predictable as possible to make them maximally
>usable.  And, no matter how often we fall into the trap of
>thinking about them that way, they are not "words" and don't
>need to obey the constraints of "words".  Wanting to write the
>Great Slobbovian Novel in domain names labels is just not a
>consideration.  Picking up an example from one of Stephane's
>notes, the decision that GOOGLE.com and google.com should be
>equivalent was made before there was a Google and indeed long
>before there was an Internet.

I agree that with respect to GOOGLE and google, they should
be equivalent. Even if nothing else, people have gotten used
to this for 20 or more years.

>Long experience with identifiers
>and what we now call Basic Latin characters persuaded us that,
>while specialists could get used to it, trying to treat
>identifier strings as different when they differed only by case
>was a bad idea.  There are two ways to deal with the problems
>that bad idea can cause.   One is to permit one form only and
>the other is to map them (or cause them to match in any
>comparison processes, even if they are not mapped).   And,
>because they are identifiers and "no ambiguity" is a major
>consideration for identifiers --perhaps _the_ major
>consideration-- if one tries to make the adjustments at
>comparison time, it better be 100% clear what should happen in
>any case, without complex rules about the processing or
>comparison model.

Yes. But what about solving the Greek Sigma problem with
bundling? That works for Chinese characters with much more
complex relationships.

>Now let's take a few steps back from both final-form Sigma and a
>number of coding decisions that Unicode inherited from other
>standards.  Probably as the result of very old calligraphic or
>other decorative conventions, we have a number of scripts in the
>world that make visual distinctions about how characters are
>written depending on where they appear in a word.

Well, yes, except that the "calligraphic and other traditions"
sounds as if that's somehow a special case, where it isn't,
at least not for the scripts involved.

>In some
>cases, those distinctions are made with leading characters
>(sentence capitalization and proper names in English, Nouns in
>German, page and paragraph illumination in some medieval Latin
>texts, and so on).  In others they are distinctions made at the
>ends of words (a quick scan of the UnicodeData table shows up
>characters identified as "final" in Greek, Hebrew, Syriac,
>Canadian Syllabics, New Tai Lue, Bopomofo, Arabic, and some
>symbols we presumably don't care about).   Some of these may
>actually not be final forms in the same sense as the
>Greek-Hebrew-Arabic ones  and there may be others out there, but
>please let's understand that final-position presentation form
>changes are not uncommon... they may occur in more scripts than
>those we usually think of a having case.

Canadian Syllabics "Final" seem to be more like some modifier
characters. Two of them even don't have any glyph printed in
Unicode 5.0, and there is no explanation in the text, so this
is difficult to say with certainty. Probably Ken knows more.

For New Tai Lue, these are in some sense composed forms of
a base consonant and a virama. If they are dealt with in a
similar way to other Indic scripts, I guess these variants
should be separate.

>Given that, we need to face some very general issues _for IDNs_.
>I stress that because the issues, and the answers, may be very
>different for sentences, running text, novels, typesetting, and
>even programming languages.  Indeed, they probably are
>different.

If we come to the conclusion that something has to
be different for IDNs, that's fine. However, just saying that
these are identifiers and that therefore we don't have to care,
because identifiers are different anyway, isn't what should happen.

>One is that, because DNS labels are identifiers,
>standard orthographic rules for languages that use the script
>may not apply.  We have extensive experience with constructions
>like FirstBank.tld, the established expectation that they will
>match firstbank.tld, _and_ the knowledge that such
>middle-of-the-string capitalization cannot possibly occur if the
>string is a single English word.

Yes, we have extensive experience for English, and to a large
extent also for other Latin-based languages. But we don't have
this experience for Greek. And we should try to avoid to tell
the Greeks what experience they are allowed to make, and what
not.

>An obvious analogy to this
>common practice of constructing a label by cramming two words
>together could easily lead to final-form characters in the
>middle of analogous labels.

Yes it could indeed. We should leave it to the Greeks whether they
want to use a "final sigma only in final position" policy or a "sigma
and final sigma bundling policy", or a "no final sigma" policy (because
they e.g. think that upper-case will be frequent) or maybe even a
"both allowed" policy. I very much doubt that the later is necessary
or a good policy, but Greeks may well decide that:
a) Every schoolkid knows the difference between sigma and final
   sigma, so no big chance for spoofers, and
b) There are cases where both a domain name composed of two words
   and a domain name in a single word take the same "base spelling",
   with only the sigma/final sigma difference.
I repeat that I don't think this is a good idea or it will happen,
but I don't know enough Greek, and even if I did, I didn't want to
dictate what the Greeks should do.

>Whether that is "bad typography" or
>not is in the mind of the beholder.  Again because these are
>identifiers, formulations that work "almost all the time" are
>not good enough.  We either have to have enough motivation to
>call something out as an exception case, or everything has to be
>treated in the same way.

What would Greeks say if they could choose between allowing
final sigma (with registries restricting it to final positions,
or before '-') and allowing words without spaces between them?
I don't know. I don't want to decide for them.

>If we now return to how these things are coded, I note that, in
>a perfect world and with a focus on identifiers, one might have
>avoided assigning separate code points to these typographically
>variant case-sensitive and final forms but instead used the base
>character and, when needed, a form-modifier of some sort.  That
>would make all sorts of comparison operations much easier and
>would probably have no ill effects on typography.  It might have
>made some other things more difficult, including requiring more
>script-specific or language-specific characteristics in
>rendering engines, so I'm not suggesting it would be universally
>a good idea.  Unicode didn't do it that way (except for Arabic)
>and, because almost all of what they did do was inherited from
>other standards, gets neither credit nor blame for the decision
>(insofar as either is relevant).  However, even within Unicode,
>treatment of normal and special-form characters is not
>completely consistent.  Lower and upper case Greek, Cyrillic,
>and Latin characters are considered distinct although there is a
>table and algorithm for case-folding.  Most normal and
>final-form characters in Hebrew, Syriac and others are treated
>as distinct, but final-form Arabic characters (and a few
>precomposed Hebrew ones) are treated as compatibility characters
>and mapped out by NFKC.

Grep can be very helpful, but it has it's limitations.
It doesn't look consistent, but it actually is in many ways.
First, observe that for Greek and Hebrew (and probably the
explicitly encoded final Syriac letter), these are word-final,
whereas for Arabic, these are cursive-run finals, which may
or may not be word-final. Second, observe that for Greek and
Hebrew and the expcilictly encoded final Syriac letter, these
are just a handful of special cases. One could call them
orthographic finals.

For Arabic, every letter has a final form, and all of them have
a nominal/independent form, and many of them also have a fully
connected and an initial form, but according to experts, that's
actually just a very, very simplified model of real Arabic
typography. The Unicode standard contains these forms in
the compatibility block because they have been used in some
legacy encodings that have long fallen out of favor. For domain
names, these are irrelevant; any general display algorithm that
includes Arabic can handle context. It turned out that cost-
performance of developing an automatic rendering system was
worth it for Arabic, but not for Greek or Hebrew.

Overall, we can show the situation as follows:

                Count   Type            occur   impl    code
                                        at end          points
                                        of
Arabic          many    typographic     run     autom.  (compat. only)

Syriac          many    typographic     run     autom.  no
(except 1)

Greek           few     orthographic    word    hand    yes
Hebrew          few     orthographic    word    hand    yes
Syriac 1        few     orthographic    word    hand    yes

I think the pattern here should be obvious: There are two kinds
of final forms. Those that are handled automatically don't
have to have separate characters in IDNs, because they don't
have separate characters in running text. Those that are
handled by hand are already to some extend separated, the
Greek sigma is the exception, which we may be able to fix.

>IDNA2003 and Stringprep tried to resolve these issues and ended
>up with, IMO, a bit of a mess due to the inconsistencies
>described above.   Upper-case characters are mapped to
>lower-case,

Wrong. All case variants that are mapped together for applications
such as searching are given a single representative, leading to
mappings from one lower-case character to other(s), not just
Upper->lower mappings.

>creating a mapping that is not reversible and is
>obvious for almost all, but not quite all, cases.  That case
>mapping eliminates final form sigma as a separate character
>(those who are arguing for as much compatibility with IDNA2003
>as possible should take note of this -- final form sigma cannot
>appear in an actual label in IDNA2003).

Yes. That would be an argument I could accept, if we came to the
conclusion that for whatever reason it's unfortunately too late
to fix this. Given that just about anything about IDNA2003 seems
to be open, at least for discussion, it doesn't yet look like
a very strong argument.

>The Hebrew final forms
>are not mapped and can occur in IDN labels as themselves.

Yes, overall, the issues of final forms was indeed not considered,
leading to inconsistencies when looking at it from that viewpoint.

>We
>don't know what would have happened to the Arabic final forms,
>since they were apparently defined after Unicode 3.2,

No. They were around before Unicode 3.2, see
http://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt
Actually, they are there since 1.1, see
http://www.unicode.org/Public/UNIDATA/DerivedAge.txt:
FE76..FEFC    ; 1.1 # [135] ARABIC FATHA ISOLATED FORM..ARABIC LIGATURE LAM WITH ALEF FINAL FORM

They were always compatibility equivalents, so NFKC takes care
of them, which is exactly the right thing to do. But the chance
that somebody types one of these these days is very low, which
is definitely not the case for Greek final sigma, which has it's
own key. Same for the five Hebrew ones.

>but the
>NFKC mapping plus comments in the Unicode text lets us deduce
>that they would have been mapped out.
>
>I think where this all leads is to the conclusions that:
>
>       * Neither "bad typography" nor observations about normal
>       word formation are particularly helpful in making
>       decisions about alternate forms of the same character
>       for IDN purposes.

I disagree. Both "bad typography" and normal word formation
can be very helpful. For sure it would be wrong to use them
as the only guidelines, but simply throwing them out is definitely
the wrong thing to do.

>       * Ideally, we should treat all case and positional
>       presentation variations the same way.

No, as shown above, we should raise above the level of
"use grep, if it contains the same word, treat it the same"
and "if there is an Unicode table for it, just go use it"
and do a minimum of analysis that might show some underlying
principles that will then allow us to produce better solutions.

>Since there are
>       no perfect rules for the mappings,

I'm affraid this may turn into "the absence of the perfect
is the enemy of the good". But I hope that we realize that
even if there are no perfect rules, we can still try to
get the best possible rules.

>that way is probably
>       to explicitly ban the variant forms from actual
>       domain labels.  In IDNA2003, that means that they cannot
>       be represented directly in the ACE form and that
>       ToUnicode() of a  valid ACE form will never generate
>       them.

That was the case for final sigma (unfortunately) and for
final Arabic letters (the right decision).

>In IDNA200X, that means they are banned as input
>       from IDNA (even if mappings are applied as a UI or
>       preprocessor matter).
>       
>       * Our general principle that any codepoint that has a
>       compatibility mapping to another codepoint (or set of
>       codepoints) is banned at the IDNA level continues to
>       apply to these cases as well as all the others.

Some final characters have compatibility mappings. Some others
don't. I think the distinction coincides with the distiction
I worked out above. Definitely, neither Hebrew nor Greek finals
have compatibility mappings, for good reasons.

>That leaves us with five final-form characters in Hebrew, one in
>Syriac, and around 18 in scripts I don't understand well enough
>to make guesses about that are permitted, as themselves, under
>IDNA2003 but that application of the principles above would
>probably ban.  A discussion about what to do with them is
>ultimately a discussion of whether consistency of principles is
>more or less important than compatibility with IDNA2003.  That
>discussion would seem to me to be much more helpful at this
>point than a discussion of final sigma, which gets banned under
>either criterion.

I guess all of what I wrote above pretty much is part of this
discussion.

>And I note that all of these characters except final sigma are
>headed for the categories we call "MAYBE", which really should
>be interpreted, for these sorts of cases, as "specific advice
>from the community of users of the script about conditions for
>appropriateness for IDN use is needed".

In my opinion, that also would apply for the final sigma!

Regards,   Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst at it.aoyama.ac.jp