Casefolding Sigma (was: Re: IDNAbis Preprocessing Draft)

Wed Jan 23 03:50:59 CET 2008

John,
I think we are quite in agreement. The problem is the lack of clear 
layering (nothing to do with IP vs. OSI, it has the same problem). 
And of appropriate teminology for each layer. As a result it takes 
time to be sure we say the same thing.

At 00:22 23/01/2008, John C Klensin wrote:
>--On Tuesday, 22 January, 2008 23:36 +0100 JFC Morfin
><jefsey at jefsey.com> wrote:
>
> > At 19:59 22/01/2008, John C Klensin wrote:
> >> When we can avoid it, I find it helpful to avoid thinking
> >> about and debating individual characters.  Instead, let's
> >> focus on principles,
> >
> > Dear John,
> > your analysis seems to be correct but on one point that
> > Michael pointed out. You talk of "characters" but  do not
> > define what a "character" is.
>
>That was quite deliberate, since the definition of what is, and
>is not, a "character" also depends on context and local
>definitions.

Then you cannot work on something of which the "character" is the atom?

> > It seems it can be:
> > - either a visusal item (Michael)
>
>At least in the vicinity of Unicode, that is usually called a
>"glyph".  I don't think Michael thinks that defines a
>"character" either, although he can speak for himself.

As you know I do not put myself in the vicinity of Unicode which is 
script oriented. But in a semiologic context which is much clearer. 
So I efer to signs. Your thinking of a sign as a glyph creates you a 
difficulty here after. My awkward use of "visual graphic" also 
creates confusion. Let stick to sign as the materialisation of a 
concept - here a grapheme concept? (I see no difference with any 
other concept).

- Unicode thinks of "o", "omicron", "cyillic o" - 3 code points.
- you think of their glyphs - 1 or 3 code points.
- I think of an half-heigth circle on top of the line - 1 code point 
- unambiguous.

> > - either a registered DNS item (you)
> > - or a Unicode point (IDNA).
>
>Actually, since IDNA is defined in terms of Unicode, I can
>accept "Unicode code point", although with some qualifications
>about combining forms, etc.

The problem are the qualifications, the duplication of the same 
"form" (philosophical meaning), etc. the DNS does not know about.

> > If you do not say:
> > - what a character is,
> > - at what layer language (and therefore semantic) issues are
> > dealt with,
> > we will stay with confusion, and different forms of layer
> > violation depending on who speaks.
>
> >From my point of view, language issues (with or without
>semantics) need to be dealt with at some layer above the DNS.
>There is really no practical alternative, since DNS labels are
>very short strings and no language-coding information is present
>in the DNS.

Yes. (However one can conventionally induce some indications through 
classes, domains, zones)

> > As far I am concerned:
> >
> > 1. "characters" are a set of visual graphics that are
> > registered in the same DNS way.
>
>Then we disagree, because font and artistic variations in glyphs

you talk of glyphs. I dont. fonts and arts are totally foreign to my 
thinking of signs.

>makes it impossible to identify a character precisely that way,
>develop unambiguous matching rules, etc.

see above.
1. You are using the "glyph" term which relates to a fount. I have no 
artistic variation. I am just graphically expressing a concept 
concatenating sub-concepts: the A and a signs (final and initial, 
etc.) have the same code point. Because the users said so. This is 
the difference between Michael and me: he considers the way men 
outputs a character. I consider the way turing machines inputs it.
2. there is no ambiguity. What is consideed is the code point of the 
concept. This is what you do in ASCII. You put all the glyphs of A in 
the univese into a single bag and you number it "A". Then you can 
work on it. And then Michael can work at the ends to present them the 
way he wants, using all the metadata he wants - transmitted aside of 
the DNS flow.

> >      - The way they are displayed as initial, middle, last
> > character, in upper, small upper or lower case is irrelevant.
> >      - The script they belong to is irrelevant.
>
>Interestingly, I tend to agree with these statements, at least
>as I understand the term "irrelevant".  But, if you think that
>initial for final forms are different characters because they
>look different ("visual graphics"), then we are quite seriously
>not in agreement.

We are in agreement. Let say: the DNS spells labels, it does not fax glyphs.

> > 2. language related issues are semantic and do not belong to
> > the layer of IETF responsibility. However, nothing must
> > prevent them to be restored at application layer, so the
> > differences made by Michael can be respected (Words is able to
> > restore upper case at the begining of a sentence, etc.). IETF
> > does not deal with artists, graphists, lawyers, etc. but with
> > computers which in turn deal with them.
>
>But the DNS is a specialized, hierarchical, distributed,
>identifier namespace which has no notion of "language" or how to
>code it.

This is what I say. Languages do not belong tot the IETF.

>Labels are not words.

Full agreement.

>So, while the paragraph above
>might well be true and appropriate for some system that encodes
>and transmits bodies of text, it is not relevant for the DNS.

What I say is that the same as Words knows how to massage the 
presentation of your text, you can imagine an IDNScript program 
massaging the DN presentation the way languages demand it, making 
differences between initial, middle and final glyphs, colours, sizes, 
in various founts. That added intelligence has nothing to do in IDNA, 
but can complete it on an application end to end layer. This has 
nothing to do with the network layers where the DNS is embedded to 
provide destination resolution.

>That means there is nothing from which any semantic or stylistic
>information can be restored.

We are in full agreement.
What I say in addition is only that IDNX MUST not prevent semantic 
and esthetic information (separately transported) to be applied.
Nothing prevents Michael to send a metadata format of the way he 
wants the label to be displayed.
1.that format has nothing to do with the DNS
2. but we have to make sure that the way IDNX works does not confuse 
that format.

> > 3. because ccTLD tables can include characters using the same
> > sign as others tables, they are a working basis, but the
> > semiologic sign code is not their concatenation (we would meet
> > the same problem as with Unicode).
>
>I have no idea what you mean by the above.

I suppose you might better on this second reading?

>  > 4. there a possibility to retain most of Unicode at the price
> > of complexity. It is to use classes (which can be IDNA
> > classes), to be identified in a way or another, whith class
> > local rules. This makes IDNA more complex, but possibly faster
> > to implement.
>
>If I understand this as you intend, which I probably do not,
>implementing this would require an entirely different model for
>IDNA, one that would store metadata (class, language, or
>something else) along with each label.

No. Why do you want to store it along with each label. Michael only 
needs it to be matched at the other end. This does not mean that it 
has to travel with it, not to make the resolution more hazardous.

The problem is just that you keep doing the same layer violation, 
thinking of glyphs. The DNS data and the different added metadata do 
not belong to the same protocol.

>There are many reasons
>why that would not work well in a DNS context but perhaps the
>most important is that the user or system would need to know the
>language associated with a give label in order to initiate a
>lookup.  If that information was not available, we would rapidly
>discover that DNS names had become ambiguous.  And, unlike
>larger bodies of text that actually consisted of words, there is
>little linguistic/ semantic information in DNS labels.   For
>example, consider the perfectly valid label "cd23xy".   One can
>say with certainty that it is not a French word.  One can say
>with equal certainty that it is not an English word, or a Latin
>word, or a Spanish word.   But, if one needed to identify its
>language (or class) from that information alone, one would be in
>very deep trouble.  In the general case, such strings are all
>the information the user has available, all the information
>contained in an HTML reference, all the information in the
>domain-part of an email address, etc.

Correct (I will not discuss what smartly using aliases or cnames 
could permit :-)).
Now, if you consider what IDNA is:
- people exchange "xn--cd23xy" like labels.
- share a format which permits them to enter and retrieve it in a 
different form.
That format is punycode.

> > 5. every solution must be fool/phishing proof at every DNS
> > level.
>
>This is an impossible goal/ condition.  It either requires a
>heretofore unheard-of level of international cooperation, with
>no bad guys administering zones at any level of the DNS or that
>we ban  fools (and damn fools) from the Earth.  I don't know
>which one is harder or less likely.

No. It only means that there is no way to have something out which is 
different from the something in, according to the universally 
accepted rules of equivalence. For the labels of every level. This is 
what we have today in ASCII.

> > This means that the way people/word processors
> > write/print/display the characters is orthogonal to domain
> > name labels.
>
>I don't see any basis for making that inference from the
>comments above, even though I agree with at least one
>interpretation of what you are saying.  That may, however, not
>be your interpretation.

I do not know where you lost the logic, so I cannot help right now. I 
suppose that if you reread it with my comments, you might understand 
better. Big thing if you forgot about glyph and if we could stick to 
signs. Much more general.  Glyphs are related to scripts. Sign can be 
anything: script, music, alarm, colour, a gesture, the sunset, etc. 
This has much more powefull uses.
jfc