Alternate coding environments (was: Re: [Json] Json and U+08A1 and related cases)
J-F C. Morfin
jfc at morfin.org
Thu Jan 29 19:56:31 CET 2015
Dear John,
I am sorry for the dealy in responding to this. Things are changing
with some ambitious cooperative project demanding a lot of
French/Local involvment. This helps maintaing the health.
At 14:49 25/01/2015, John C Klensin wrote:
>I either don't understand this or don't see why it is either
>necessary or desirable:
I understand you do not understand: we do not want to address the same need.
Let me digress
===========
I want a robust basic response to the "if it looks the same, it is to
be the same" need. You want the best DNS system that can use Unicode.
Your limit is perfect logic. My limit is reality. This is a common
scientific difference between effective (me) and fundamental (you)
thinking. Both are correct. The only thing is that yours introduces
at least two spaces [rules and data] within the same considerations
(so it is often more complex and more demanding).
This is why I prefer to split them differently, considering
fundamental as architectonics and effective as operationalism. This
way I avoid (this is a general consideration of the Universe) to
answer initial questions on the reality of the time that Einstein and
many others before and further on have made to disapear. In a
nutshell, I live in a single universe, rather than discussing a
multiverse: this allows me to consider a rough multi-internet in a
single universe, rather than a single perfect internet in each
universe of a multiverse.
I certainly think that networking brings basic metaphores to
cosmology, but no need for now :-)
Specific of this case
===============
The difference is who is the master: the machine (theoretical and
logic, in a local context) or the person (practical and "agoric",
i.e.in an holistic global context).
- UNICODE is a codepoint based system, i.e. the reference is what
UNICODE has decided to be a character.
- UNISIGN is a semiotic signpoint based system, i.e. what a common
reader decides he/she sees/feels. We try to reduce it to UNIGRAPH,
i.e. what the machine AND all the readers (statistics) are supposed to accept.
Your system makes sure that A -> A-label -> B=A. when there re
computers with the same software at both ends.
Reality does not insure that A =H= B =H= A if =H= means "equals" by
mutiple/different-humans reading.
This work story
===========
Remember when we started this WG I said I would support its work as
long as there would be no MUST on the end to end path, and I would
document a middle layer IDNA2008 based ML-DNS solution further on.
There was some layer confusion but the RFC 5895 proposition addressed
them, permitting a consensus, introducing the subsidiarity principle
in order to embed INDA2008 into real-life diversities (except for
orthotypography and French Majuscule like issues).
IMHO, this was a major advancement in the overall Internet
architectural design. But it was permitted by the RFC 5895 unlocking,
the IESG did not accept as a part of the WG work; and the French
majuscules generic issue was not addressed (we would not face a
problem today if it had been).
As for IDNA2003, time was necessary to observe the field results of
IDNA2008 as an Unicode end to end "IDNS". What I observe (my
diagnosis) is that fringe to fringe ML-DNS is necessary and either
calls for a non-confusability algorithm or for a Unicode replacement
(either a complex one as UNISIGN (the whole semiotic), or more rustic
one as UNIGRAPH (only the character sets)).
The new context of this debate
=======================
Now, your points are made with CLASS="IN" in mind and in a
pre-20150108 context
(http://www.ietf.org/blog/2015/01/taking-a-step-towards-iana-transition/).
Mine's are those of a member of:
- an RFC 6852 global - non-NTIA dependent - community
- which is in unformal standard technical innovation coopetition with the IETF
- which respect the ICP-3 recommendations and only consider
CLASS="FL" (Free/Libre), [for the time being as a private CLASS].
So, I am not anymore discussing IETF considerations for a global
standard to be documented by http://iana.arpa. IETF leaders decided
they did not want to assume this responsibility. I am just
contributing on behalf of a young and loose Libre Community WG about
what we consider in an multioperable context (shared ISP switchers -
compatibility with our own UNINAME need, i.e. for a member directory).
>If the operation of (1), which I will restate for clarity as
>conversion of String A to an A-label as specified by IDNA2008,
>produces a String B that can be found in the public DNS (which
>means QCLASS=IN for all known Internet applications these days),
>then the above is either unnecessary or identical to what
>IDNA2008 requires. Note, in particular, that the dual
>relationship between A-labels and U-labels guarantees that A==A'.
>
>Conversely, if you want String A to be something that cannot be
>represented directly in the DNS, at least under IDNA rules (or,
>given some of your observations about Majuscules, etc., perhaps
>even in Unicode), then it seems to me that you should be doing
>something that we've discussed on and off for well over a
>decade, specifically:
>
>(i) Look A up in some non-DNS database whose matching algorithms
>conform to your needs, yielding a string C. One important
>property of that database should be that actual lookups of name
>by value (as well as value by name) should be feasible. FWIW,
>note that such lookups were anticipated by the original DNS
>design but dropped (and replaced by, e.g., a separate reverse
>mapping tree for addresses) when they turned out to be
>incompatible with other aspects of that design.
If you consider strings we are back to the preceding problem: strings
are chains of characters. The difficulty is with the man/machine
possible disagreement over characters identification.
>(ii) For maximum interoperability with the rest of the Internet,
>look up C normally in the DNS, CLASS-IN, yielding addresses of
>other records important to you.
At this time CLASS-IN would be establishing too much habits. This is
testing, and ICP-3 is reasonably clear and prudent.
>(iii) When and as needed, go back to your database to map C to A
>(A' if you like but, if there is only one table and it is used
>for both forward and reverse mappings, the identity relationship
>is guaranteed).
This is precisely what we do not want. This means that I would need
***three*** separate non-ASCII DNS systems to be maintained in parallel!
>Interestingly, in the model above, there is no need for C to be
>in the encoding produced by the Punycode algorithm, to use the
>IDNA "xn--" prefix, or to obey any conventions other than you
>own. You would actually gain some simplicity and resistance
>against attacks and the sort of possibly-incorrect matching
>issues that have been the topic of this thread (and many others
>over the years) by making C a pseudo-random, all-ASCII, value
>rather than relying on a trick encoding. The only requirement
>on C is that it be unique, and there are lots of ways to
>accomplish that.
Correct. But this is not human oriented anymore. So why not just to
use an IPv6 address?
>Also see your Case (2) below.
>
> > For your information.
> >
> > As part of the ICP-3 conformant IUser community testing for the
> > Catenet we want to experiment the following
> > multi-ledger/multi-echnology
>\ØXÚ
>@L-DNS)::
> >
> > 1. for the CLASS "IN" ledger of registries of
> > ICANN/NTIA/Verisign:
>FR7FW÷7FW2&0k and forth conversion algorithm is:
> > - either " A=B/B=A when A and B are restricted to the
> > ASCII list".
> > - or punycode otherwise.
>
>See above. If A and B are both ASCII, those relationships are
>guaranteed by the basic DNS design modulo case-independent
>matching and non-preservation of case in a variety of contexts
>where compression is involved.
Yes. But remember I want to protect majuscules/other metafancies so I
must differentiate them (usual way is upper cases, but
orthotypography has others).
>If you need case preservation in
>the ASCII range, you either cannot use the DNS or have to use a
>trick encoding for everything, including all-ASCII labels.
Right. Actually I must consider ASCII as possible IDNs (as in Elysée.fr)
>Whatever that encoding might be, the Punycode algorithm, which
>makes some special assumptions about all-ASCII labels, won't be
>appropriate.
No. See above. For example I can decide that ^e = E, so Elysee.fr is
^elysee.fr.
>ÜH]]HÀLASS "FL" (Free/Libre) ledger of
> > registries by DNSLIB:
> >
> > the step 1/step 3 back and forth conversion algorithm is:
>HZ]\OPÐPHÚ[H[\H\Àtricted to the
>TÐÒRH\ÝÒ÷"Vç6öFRÀied to the UNISIGN semiotic
> > (Free/Libre) sign set.
> - or no registration permitted/filtering out otherwise.
>
>First, note that the Punycode algorithm was designed
>specifically for the Unicode repertoire and block-layout system.
>It is an optimization for that system that, in particular, has
>better (shorter-string) properties than UTF-8 for strings of
>Unicode code points with particular characteristics, notably
>when there are short distances among the numerical values of the
>code points in the label string and many or all of those code
>points have sufficiently high numeric values to require
>three-octet encodings.
My priority is something that effectively works. If a solution works
somewhere RFC 1958 advises to use it elsewere. The idea is to use
xu-- domain names where punnycode handles pseudo-Unicode that are
coded/uncoded at both fringes.
> > UNISIGN only includes non-confusable character and non
> > character
> > signs. Confusable characters share a unique common sign-point.
>
>If you are going to use a repertoire and/or coding system
>different from Unicode, then the Punycode algorithm is very
>unlikely to make any sense. If you can use the DNS for your own
>purposes and encoding, intend to use a separate CLASS,
Yes. This is the premise.
>and can
>control the coding so that it meets your needs, then there are
>probably no advantages of using anything other than your native
>coding.
No, because I do not plan replacing, but adding on top. This is a
practical issue (at least as long as we are smaller than Google+USG
:-)) Remember: we have a need and the need for the solution to be
stable. This need is common with Banks, Police, Corporations, etc. We
have no problem in transcoding further on to any "INTERNATIONALCODE".
Our problem is simply that people do not discriminate at the machine
level as Unicode supports it. Unicode is too good for us.
>Remember that the DNS is perfectly happy with labels
>consisting of octets with arbitrary values and compares them
>perfectly and consistently. The difficulties that led to IDNA
>arise from three things:
>(a) What now seems like a peculiarity
>that octets whose values are consistent with letters in the
>ASCII range are compared case-independently but nothing else is
Yes. We need to support that since you made IDNA reduce upper-cases
to lower-cases during the IDNA DNS process.
>(b) Within CLASS=IN, ASCII is basically assumed for octets in
>the range 0x00 to 0x7F; for octets outside that range, the
>character repertoire and encoding are unspecified and those
>octets are not assumed to be character data at all. For other
>CLASSes, you could easily specify a CLASS-wide character set and
>encoding and design that encoding to not trigger any DNS
>idiosyncrasies.
This is the very basis of the "FL" CLASS. Again, this is ICP-3
recommendation. I do not want to innovate and take risks. Just to get
out of the 1986 status-quo, fully respecting the stabilized RFC
(including the now completed RFC 6852). This is the new
"permissionless innovation" motto.
>(c) Many applications limit their identifiers and associated DNS
>labels to ASCII-only. IDNA was designed to avoid having to
>upgrade all of those applications as a condition for deploying
>IDNs. Because introducing a new coding system or lookups in a
>different database or different DNS CLASS requires upgrading all
>relevant applications anyway, you don't have that constraint.
No. It only requires people like Firefox to accept that the users are
grown enough and can decide to send to the DNS resolver what they
actually entered, and the DNS resolution to be handled at the "IUI"
or "MYCANN Plug-in" level, where entries are read and massaged
appropriately (this is RFC 5895).
> > The UNISIGN table results from a non-confusability algorithm
> > under
> > financing/work, It should be based upon rastering comparison
> > in an UNISIGN fount of reference.
>
>Good luck with that and do let us know, ideally in the
>peer-reviewed literature, how it works out. It may work if you
>can select (and perhaps design) a single type family and
>eliminate all others from your enviornment, ideally eliminating
>all stylistic variations (like italics or bold for many
>Latin-based type families) within that family as well.
I do not need that. The fount is virtual (i.e. only in establishing
the graphic average confusability. Depending on the environments it
can be Arial, use, or the law). Please remember it is just deciding
that roman o and cyrillic o are a graphic o and giving the receiving
end the semiotics metadata to use it (language, script,
orthotypography). Why do you think I made sure langtags would not be
confusable?
>That
>would leave you in a situation in which characters that were
>confusing in your reference font might not be in more normal
>ones and vice versa. It would, of course, work even better if
>you restrict the scripts of interest. Otherwise, for example,
>the visual (and likely raster and even vector) similarity
>between U+03BB and U+5165, a resemblance that I think we can be
>quite confident is entirely coincidental and one where even a
>few characters of context would likely be adequate to make a
>distinction, would result in a single code point in your system.
Yes. No phishing.
>I think I've mentioned this to you before, but anyone
>contemplating automatic confusability detection based on raster
>or vector properties of printed/written characters should
>examine the pre-Kurzweil (mid-to-late 1960s) literature in
>optical character recognition as a starting point in
>understanding how difficult it is to get from abstractions of
>characters to character identification.
There is not automatic confusability detection. There is only a table
based transcoding algorithm, between UNICODE and UNIGRAPH in the
UNISIGN framework (there are not only character signs). It can be
established manually, but it will probably be easier/cheaper and more
operationally oriented to prepare it with published/legal existing
rasters with the advantage to produce raster printing as well for
easy international identifications (ex. Customs, forms, signings,
etc.). The point is not to know what Chinese use to confuse but what
people from anywhere may confuse in Chinese.
What pre/post=Kurzwell may have writen on raster confusability is
certainly be of interest. We will leave that to experts. Our need is
just to start a Libre momentum and protect the UNISIGN and UNIGRAPH
terms so all this does not land into a closed commercial consortium
but in a DECLIC approach (development for Catenet Libre,
Institutional and Commercial use or whatever better English wording
you can find for the French "DEstiné au Catenet/Libre Institutionnel
Commercial" (DECLIC) license we work on.
jfc
More information about the Idna-update
mailing list