Alternate coding environments (was: Re: [Json] Json and U+08A1 and related cases)

J-F C. Morfin jfc at morfin.org
Thu Jan 29 19:56:31 CET 2015


Dear John,
I am sorry for the dealy in responding to this. Things are changing 
with some ambitious cooperative project demanding a lot of 
French/Local involvment. This helps maintaing the health.


At 14:49 25/01/2015, John C Klensin wrote:
>I either don't understand this or don't see why it is either 
>necessary or desirable:

I understand you do not understand: we do not want to address the same need.

Let me digress
===========

I want a robust basic response to the "if it looks the same, it is to 
be the same" need. You want the best DNS system that can use Unicode.

Your limit is perfect logic. My limit is reality. This is a common 
scientific difference between effective (me) and fundamental (you) 
thinking. Both are correct. The only thing is that yours introduces 
at least two spaces [rules and data] within the same considerations 
(so it is often more complex and more demanding).

This is why I prefer to split them differently, considering 
fundamental as architectonics and effective as operationalism. This 
way I avoid (this is a general consideration of the Universe) to 
answer initial questions on the reality of the time that Einstein and 
many others before and further on have made to disapear. In a 
nutshell, I live in a single universe, rather than discussing a 
multiverse: this allows me to consider a rough multi-internet in a 
single universe, rather than a single perfect internet in each 
universe of a multiverse.

I certainly think that networking brings basic metaphores to 
cosmology, but no need for now :-)

Specific of this case
===============

The difference is who is the master: the machine (theoretical and 
logic, in a local context) or the person (practical and "agoric", 
i.e.in an holistic global context).

- UNICODE is a codepoint based system, i.e. the reference is what 
UNICODE has decided to be a character.

- UNISIGN is a semiotic signpoint based system, i.e. what a common 
reader decides he/she sees/feels. We try to reduce it to UNIGRAPH, 
i.e. what the machine AND all the readers (statistics) are supposed to accept.

Your system makes sure that A -> A-label -> B=A. when there  re 
computers with the same software at both ends.

Reality does not insure that A =H= B =H= A  if =H= means "equals" by 
mutiple/different-humans reading.


This work story
===========

Remember when we started this WG I said I would support its work as 
long as there would be no MUST on the end to end path, and I would 
document a middle layer IDNA2008 based ML-DNS solution further on.

There was some layer confusion but the RFC 5895 proposition addressed 
them, permitting a consensus, introducing the subsidiarity principle 
in order to embed INDA2008 into real-life diversities (except for 
orthotypography and French Majuscule like issues).

IMHO, this was a major advancement in the overall Internet 
architectural design. But it was permitted by the RFC 5895 unlocking, 
the IESG did not accept as a part of the WG work; and the French 
majuscules generic issue was not addressed (we would not face a 
problem today if it had been).

As for IDNA2003, time was necessary to observe the field results of 
IDNA2008 as an Unicode end to end "IDNS". What I observe (my 
diagnosis) is that fringe to fringe ML-DNS is necessary and either 
calls for a non-confusability algorithm or for a Unicode replacement 
(either a complex one as UNISIGN (the whole semiotic), or more rustic 
one as UNIGRAPH (only the character sets)).


The new context of this debate
=======================

Now, your points are made with CLASS="IN" in mind and in a 
pre-20150108 context 
(http://www.ietf.org/blog/2015/01/taking-a-step-towards-iana-transition/).

Mine's are those of a member of:
- an RFC 6852 global - non-NTIA dependent - community
- which is in unformal standard technical innovation coopetition with the IETF
- which respect the ICP-3 recommendations and only consider 
CLASS="FL" (Free/Libre), [for the time being as a private CLASS].

So, I am not anymore discussing IETF considerations for a global 
standard to be documented by http://iana.arpa. IETF leaders decided 
they did not want to assume this responsibility. I am just 
contributing on behalf of a young and loose Libre Community WG about 
what we consider in an multioperable context (shared ISP switchers - 
compatibility with our own UNINAME need, i.e. for a member directory).


>If the operation of (1), which I will restate for clarity as
>conversion of String A to an A-label as specified by IDNA2008,
>produces a String B that can be found in the public DNS (which
>means QCLASS=IN for all known Internet applications these days),
>then the above is either unnecessary or identical to what
>IDNA2008 requires.  Note, in particular, that the dual
>relationship between A-labels and U-labels guarantees that A==A'.
>
>Conversely, if you want String A to be something that cannot be
>represented directly in the DNS, at least under IDNA rules (or,
>given some of your observations about Majuscules, etc., perhaps
>even in Unicode), then it seems to me that you should be doing
>something that we've discussed on and off for well over a
>decade, specifically:
>
>(i) Look A up in some non-DNS database whose matching algorithms
>conform to your needs, yielding a string C.  One important
>property of that database should be that actual lookups of name
>by value (as well as value by name) should be feasible.  FWIW,
>note that such lookups were anticipated by the original DNS
>design but dropped (and replaced by, e.g., a separate reverse
>mapping tree for addresses) when they turned out to be
>incompatible with other aspects of that design.

If you consider strings we are back to the preceding problem: strings 
are chains of characters. The difficulty is with the man/machine 
possible disagreement over characters identification.

>(ii) For maximum interoperability with the rest of the Internet,
>look up C normally in the DNS, CLASS-IN, yielding addresses of
>other records important to you.

At this time CLASS-IN would be establishing too much habits. This is 
testing, and ICP-3 is reasonably clear and prudent.

>(iii) When and as needed, go back to your database to map C to A
>(A' if you like but, if there is only one table and it is used
>for both forward and reverse mappings, the identity relationship
>is guaranteed).

This is precisely what we do not want. This means that I would need 
***three*** separate non-ASCII DNS systems to be maintained in parallel!

>Interestingly, in the model above, there is no need for C to be
>in the encoding produced by the Punycode algorithm, to use the
>IDNA "xn--" prefix, or to obey any conventions other than you
>own.  You would actually gain some simplicity and resistance
>against attacks and the sort of possibly-incorrect matching
>issues that have been the topic of this thread (and many others
>over the years) by making C a pseudo-random, all-ASCII, value
>rather than relying on a trick encoding.   The only requirement
>on C is that it be unique, and there are lots of ways to
>accomplish that.

Correct. But this is not human oriented anymore. So why not just to 
use an IPv6 address?

>Also see your Case (2) below.
>
> > For your information.
> >
> > As part of the ICP-3 conformant IUser community testing for the
> > Catenet we want to experiment the following
> > multi-ledger/multi-echnology
>\›ØXÚ
>@L-DNS)::
> >
> > 1. for the CLASS "IN" ledger of registries of
> > ICANN/NTIA/Verisign:
>‚F†R7FW÷7FW2&0k and forth conversion algorithm is:
> >      - either " A=B/B=A when A and B are restricted  to  the
> > ASCII list".
> >      - or punycode otherwise.
>
>See above.  If A and B are both ASCII, those relationships are
>guaranteed by the basic DNS design modulo case-independent
>matching and non-preservation of case in a variety of contexts
>where compression is involved.

Yes. But remember I want to protect majuscules/other metafancies so I 
must differentiate them (usual way is upper cases, but 
orthotypography has others).

>If you need case preservation in
>the ASCII range, you either cannot use the DNS or have to use a
>trick encoding for everything, including all-ASCII labels.

Right. Actually I must consider ASCII as possible IDNs (as in Elysée.fr)

>Whatever that encoding might be, the Punycode algorithm, which
>makes some special assumptions about all-ASCII labels, won't be
>appropriate.

No. See above. For example I can decide that ^e = E, so Elysee.fr is 
^elysee.fr.

>‹ˆ›ÜˆHš]˜]HÀLASS "FL" (Free/Libre) ledger of
> > registries by DNSLIB:
> >
> >      the step 1/step 3 back and forth conversion algorithm is:
>HZ]\ˆˆOP‹ÐPHÚ[ˆH[™ˆ\™H™\Àtricted  to  the
>TÐÒRH\݋‚‚Ò÷"Vç­6öFRÀied to the UNISIGN semiotic
> > (Free/Libre) sign set.
>    - or no registration permitted/filtering out otherwise.
>
>First, note that the Punycode algorithm was designed
>specifically for the Unicode repertoire and block-layout system.
>It is an optimization for that system that, in particular, has
>better (shorter-string) properties than UTF-8 for strings of
>Unicode code points with particular characteristics, notably
>when there are short distances among the numerical values of the
>code points in the label string and many or all of those code
>points have sufficiently high numeric values to require
>three-octet encodings.

My priority is something that effectively works. If a solution works 
somewhere RFC 1958 advises to use it elsewere. The idea is to use 
xu-- domain names where punnycode handles pseudo-Unicode that are 
coded/uncoded at both fringes.

> > UNISIGN only includes non-confusable character and non
> > character
> > signs. Confusable characters share a unique common sign-point.
>
>If you are going to use a repertoire and/or coding system
>different from Unicode, then the Punycode algorithm is very
>unlikely to make any sense.  If you can use the DNS for your own
>purposes and encoding, intend to use a separate CLASS,

Yes. This is the premise.

>and can
>control the coding so that it meets your needs, then there are
>probably no advantages of using anything other than your native
>coding.

No, because I do not plan replacing, but adding on top. This is a 
practical issue (at least as long as we are smaller than Google+USG 
:-))  Remember: we have a need and the need for the solution to be 
stable. This need is common with Banks, Police, Corporations, etc. We 
have no problem in transcoding further on to any "INTERNATIONALCODE".

Our problem is simply that people do not discriminate at the machine 
level as Unicode supports it. Unicode is too good for us.

>Remember that the DNS is perfectly happy with labels
>consisting of octets with arbitrary values and compares them
>perfectly and consistently.  The difficulties that led to IDNA
>arise from three things:

>(a) What now seems like a peculiarity
>that octets whose values are consistent with letters in the
>ASCII range are compared case-independently but nothing else is

Yes. We need to support that since you made IDNA reduce upper-cases 
to lower-cases during the IDNA DNS process.

>(b) Within CLASS=IN, ASCII is basically assumed for octets in
>the range 0x00 to 0x7F; for octets outside that range, the
>character repertoire and encoding are unspecified and those
>octets are not assumed to be character data at all.  For other
>CLASSes, you could easily specify a CLASS-wide character set and
>encoding and design that encoding to not trigger any DNS
>idiosyncrasies.

This is the very basis of the "FL" CLASS. Again, this is ICP-3 
recommendation. I do not want to innovate and take risks. Just to get 
out of the 1986 status-quo, fully respecting the stabilized RFC 
(including the now completed RFC 6852). This is the new 
"permissionless innovation" motto.

>(c) Many applications limit their identifiers and associated DNS
>labels to ASCII-only.  IDNA was designed to avoid having to
>upgrade all of those applications as a condition for deploying
>IDNs.    Because introducing a new coding system or lookups in a
>different database or different DNS CLASS requires upgrading all
>relevant applications anyway, you don't have that constraint.

No. It only requires people like Firefox to accept that the users are 
grown enough and can decide to send to the DNS resolver what they 
actually entered, and the DNS resolution to be handled at the "IUI" 
or "MYCANN Plug-in" level, where entries are read and massaged 
appropriately (this is RFC 5895).

> > The UNISIGN table results from a non-confusability algorithm
> > under
> > financing/work, It should be based upon rastering comparison
> > in an UNISIGN fount of reference.
>
>Good luck with that and do let us know, ideally in the
>peer-reviewed literature, how it works out.  It may work if you
>can select (and perhaps design) a single type family and
>eliminate all others from your enviornment, ideally eliminating
>all stylistic variations (like italics or bold for many
>Latin-based type families) within that family as well.

I do not need that. The fount is virtual (i.e. only in establishing 
the graphic average confusability. Depending on the environments it 
can be Arial, use, or the law). Please remember it is just deciding 
that roman o and cyrillic o are a graphic o and giving the receiving 
end the semiotics metadata to use it (language, script, 
orthotypography). Why do you think I made sure langtags would not be 
confusable?

>That
>would leave you in a situation in which characters that were
>confusing in your reference font might not be in more normal
>ones and vice versa.  It would, of course, work even better if
>you restrict the scripts of interest.  Otherwise, for example,
>the visual (and likely raster and even vector) similarity
>between U+03BB and U+5165, a resemblance that I think we can be
>quite confident is entirely coincidental and one where even a
>few characters of context would likely be adequate to make a
>distinction, would result in a single code point in your system.

Yes. No phishing.

>I think I've mentioned this to you before, but anyone
>contemplating automatic confusability detection based on raster
>or vector properties of printed/written characters should
>examine the pre-Kurzweil  (mid-to-late 1960s) literature in
>optical character recognition as a starting point in
>understanding how difficult it is to get from abstractions of
>characters to character identification.

There is not automatic confusability detection. There is only a table 
based transcoding algorithm, between UNICODE and UNIGRAPH in the 
UNISIGN framework (there are not only character signs). It can be 
established manually, but it will probably be easier/cheaper and more 
operationally oriented to prepare it with published/legal existing 
rasters with the advantage to produce raster printing as well for 
easy international identifications (ex. Customs, forms, signings, 
etc.). The point is not to know what Chinese use to confuse but what 
people from anywhere may confuse in Chinese.

What pre/post=Kurzwell may have writen on raster confusability is 
certainly be of interest. We will leave that to experts. Our need is 
just to start a Libre momentum and protect the UNISIGN and UNIGRAPH 
terms so all this does not land into a closed commercial consortium 
but in a DECLIC approach (development for Catenet Libre, 
Institutional and Commercial use or whatever better English wording 
you can find for the French "DEstiné au Catenet/Libre Institutionnel 
Commercial" (DECLIC) license we work on.

jfc 



More information about the Idna-update mailing list