Casefolding Sigma (was: Re: IDNAbis Preprocessing Draft)

Wed Jan 23 00:22:30 CET 2008

--On Tuesday, 22 January, 2008 23:36 +0100 JFC Morfin
<jefsey at jefsey.com> wrote:

> At 19:59 22/01/2008, John C Klensin wrote:
>> When we can avoid it, I find it helpful to avoid thinking
>> about and debating individual characters.  Instead, let's
>> focus on principles,
> 
> Dear John,
> your analysis seems to be correct but on one point that
> Michael pointed out. You talk of "characters" but  do not
> define what a "character" is.

That was quite deliberate, since the definition of what is, and
is not, a "character" also depends on context and local
definitions.  

> It seems it can be:
> - either a visusal item (Michael)

At least in the vicinity of Unicode, that is usually called a
"glyph".  I don't think Michael thinks that defines a
"character" either, although he can speak for himself.

> - either a registered DNS item (you)
> - or a Unicode point (IDNA).

Actually, since IDNA is defined in terms of Unicode, I can
accept "Unicode code point", although with some qualifications
about combining forms, etc.

> If you do not say:
> - what a character is,
> - at what layer language (and therefore semantic) issues are
> dealt with,
> we will stay with confusion, and different forms of layer
> violation depending on who speaks.

>From my point of view, language issues (with or without
semantics) need to be dealt with at some layer above the DNS.
There is really no practical alternative, since DNS labels are
very short strings and no language-coding information is present
in the DNS.

> As far I am concerned:
> 
> 1. "characters" are a set of visual graphics that are
> registered in the same DNS way.

Then we disagree, because font and artistic variations in glyphs
makes it impossible to identify a character precisely that way,
develop unambiguous matching rules, etc.

>      - The way they are displayed as initial, middle, last
> character, in upper, small upper or lower case is irrelevant.
>      - The script they belong to is irrelevant.

Interestingly, I tend to agree with these statements, at least
as I understand the term "irrelevant".  But, if you think that
initial for final forms are different characters because they
look different ("visual graphics"), then we are quite seriously
not in agreement.

> 2. language related issues are semantic and do not belong to
> the layer of IETF responsibility. However, nothing must
> prevent them to be restored at application layer, so the
> differences made by Michael can be respected (Words is able to
> restore upper case at the begining of a sentence, etc.). IETF
> does not deal with artists, graphists, lawyers, etc. but with
> computers which in turn deal with them.

But the DNS is a specialized, hierarchical, distributed,
identifier namespace which has no notion of "language" or how to
code it.  Labels are not words.  So, while the paragraph above
might well be true and appropriate for some system that encodes
and transmits bodies of text, it is not relevant for the DNS.
That means there is nothing from which any semantic or stylistic
information can be restored.   

> 3. because ccTLD tables can include characters using the same
> sign as others tables, they are a working basis, but the
> semiologic sign code is not their concatenation (we would meet
> the same problem as with Unicode).

I have no idea what you mean by the above.

> 4. there a possibility to retain most of Unicode at the price
> of complexity. It is to use classes (which can be IDNA
> classes), to be identified in a way or another, whith class
> local rules. This makes IDNA more complex, but possibly faster
> to implement.

If I understand this as you intend, which I probably do not,
implementing this would require an entirely different model for
IDNA, one that would store metadata (class, language, or
something else) along with each label.   There are many reasons
why that would not work well in a DNS context but perhaps the
most important is that the user or system would need to know the
language associated with a give label in order to initiate a
lookup.  If that information was not available, we would rapidly
discover that DNS names had become ambiguous.  And, unlike
larger bodies of text that actually consisted of words, there is
little linguistic/ semantic information in DNS labels.   For
example, consider the perfectly valid label "cd23xy".   One can
say with certainty that it is not a French word.  One can say
with equal certainty that it is not an English word, or a Latin
word, or a Spanish word.   But, if one needed to identify its
language (or class) from that information alone, one would be in
very deep trouble.  In the general case, such strings are all
the information the user has available, all the information
contained in an HTML reference, all the information in the
domain-part of an email address, etc.

> 5. every solution must be fool/phishing proof at every DNS
> level.

This is an impossible goal/ condition.  It either requires a
heretofore unheard-of level of international cooperation, with
no bad guys administering zones at any level of the DNS or that
we ban  fools (and damn fools) from the Earth.  I don't know
which one is harder or less likely.

> This means that the way people/word processors
> write/print/display the characters is orthogonal to domain
> name labels.

I don't see any basis for making that inference from the
comments above, even though I agree with at least one
interpretation of what you are saying.  That may, however, not
be your interpretation.

    john