Alternate coding environments (was: Re: [Json] Json and U+08A1 and related cases)

John C Klensin john-ietf at jck.com
Sun Jan 25 14:49:52 CET 2015


(address list trimmed to IDNA and IUCG lists, with apologies to
the former, but this one seems to need a response)

--On Sunday, January 25, 2015 03:50 +0100 JFC Morfin
<jefsey at jefsey.com> wrote:

> We are discussing a three steps multi-layer digital name system
> (ML-DNS) to be used by multiple applications and network
> technologies on the.catenet:
> 
> 1. I enter a string A in my local querry system which
> punycodes it
>      into string B so it may be used by the next step. (RFC
> 4895)
> 2. The string B is entered in the Domain Name Service of the
> concerned
>      CLASS to get an IP address.
> 3. the string B is restored as A' in the destination system.
> 
> The target is that A'=A.

I either don't understand this or don't see why it is either
necessary or desirable:

If the operation of (1), which I will restate for clarity as
conversion of String A to an A-label as specified by IDNA2008,
produces a String B that can be found in the public DNS (which
means QCLASS=IN for all known Internet applications these days),
then the above is either unnecessary or identical to what
IDNA2008 requires.  Note, in particular, that the dual
relationship between A-labels and U-labels guarantees that A==A'.

Conversely, if you want String A to be something that cannot be
represented directly in the DNS, at least under IDNA rules (or,
given some of your observations about Majuscules, etc., perhaps
even in Unicode), then it seems to me that you should be doing
something that we've discussed on and off for well over a
decade, specifically:

(i) Look A up in some non-DNS database whose matching algorithms
conform to your needs, yielding a string C.  One important
property of that database should be that actual lookups of name
by value (as well as value by name) should be feasible.  FWIW,
note that such lookups were anticipated by the original DNS
design but dropped (and replaced by, e.g., a separate reverse
mapping tree for addresses) when they turned out to be
incompatible with other aspects of that design.

(ii) For maximum interoperability with the rest of the Internet,
look up C normally in the DNS, CLASS-IN, yielding addresses of
other records important to you.

(iii) When and as needed, go back to your database to map C to A
(A' if you like but, if there is only one table and it is used
for both forward and reverse mappings, the identity relationship
is guaranteed).

Interestingly, in the model above, there is no need for C to be
in the encoding produced by the Punycode algorithm, to use the
IDNA "xn--" prefix, or to obey any conventions other than you
own.  You would actually gain some simplicity and resistance
against attacks and the sort of possibly-incorrect matching
issues that have been the topic of this thread (and many others
over the years) by making C a pseudo-random, all-ASCII, value
rather than relying on a trick encoding.   The only requirement
on C is that it be unique, and there are lots of ways to
accomplish that.

Also see your Case (2) below.

> For your information.
> 
> As part of the ICP-3 conformant IUser community testing for the
> Catenet we want to experiment the following
> multi-ledger/multi-echnology
> approach (ML-DNS)::
> 
> 1. for the CLASS "IN" ledger of registries of
> ICANN/NTIA/Verisign:
> 
>      the step 1/step 3 back and forth conversion algorithm is:
>      - either " A=B/B=A when A and B are restricted  to  the
> ASCII list".
>      - or punycode otherwise.

See above.  If A and B are both ASCII, those relationships are
guaranteed by the basic DNS design modulo case-independent
matching and non-preservation of case in a variety of contexts
where compression is involved.  If you need case preservation in
the ASCII range, you either cannot use the DNS or have to use a
trick encoding for everything, including all-ASCII labels.
Whatever that encoding might be, the Punycode algorithm, which
makes some special assumptions about all-ASCII labels, won't be
appropriate.

> 2. for the private CLASS "FL" (Free/Libre) ledger of
> registries by DNSLIB:
> 
>      the step 1/step 3 back and forth conversion algorithm is:
>      - either " A=B/B=A when A and B are restricted  to  the
> ASCII list".
>      - or punycode applied to the UNISIGN semiotic
> (Free/Libre) sign set.
>      - or no registration permitted/filtering out otherwise.

First, note that the Punycode algorithm was designed
specifically for the Unicode repertoire and block-layout system.
It is an optimization for that system that, in particular, has
better (shorter-string) properties than UTF-8 for strings of
Unicode code points with particular characteristics, notably
when there are short distances among the numerical values of the
code points in the label string and many or all of those code
points have sufficiently high numeric values to require
three-octet encodings.

> UNISIGN only includes non-confusable character and non
> character
> signs. Confusable characters share a unique common sign-point.

If you are going to use a repertoire and/or coding system
different from Unicode, then the Punycode algorithm is very
unlikely to make any sense.  If you can use the DNS for your own
purposes and encoding, intend to use a separate CLASS, and can
control the coding so that it meets your needs, then there are
probably no advantages of using anything other than your native
coding.  Remember that the DNS is perfectly happy with labels
consisting of octets with arbitrary values and compares them
perfectly and consistently.  The difficulties that led to IDNA
arise from three things: (a) What now seems like a peculiarity
that octets whose values are consistent with letters in the
ASCII range are compared case-independently but nothing else is
(b) Within CLASS=IN, ASCII is basically assumed for octets in
the range 0x00 to 0x7F; for octets outside that range, the
character repertoire and encoding are unspecified and those
octets are not assumed to be character data at all.  For other
CLASSes, you could easily specify a CLASS-wide character set and
encoding and design that encoding to not trigger any DNS
idiosyncrasies. 
(c) Many applications limit their identifiers and associated DNS
labels to ASCII-only.  IDNA was designed to avoid having to
upgrade all of those applications as a condition for deploying
IDNs.    Because introducing a new coding system or lookups in a
different database or different DNS CLASS requires upgrading all
relevant applications anyway, you don't have that constraint.

> The UNISIGN table results from a non-confusability algorithm
> under
> financing/work, It should be based upon rastering comparison
> in an UNISIGN fount of reference.

Good luck with that and do let us know, ideally in the
peer-reviewed literature, how it works out.  It may work if you
can select (and perhaps design) a single type family and
eliminate all others from your enviornment, ideally eliminating
all stylistic variations (like italics or bold for many
Latin-based type families) within that family as well.  That
would leave you in a situation in which characters that were
confusing in your reference font might not be in more normal
ones and vice versa.  It would, of course, work even better if
you restrict the scripts of interest.  Otherwise, for example,
the visual (and likely raster and even vector) similarity
between U+03BB and U+5165, a resemblance that I think we can be
quite confident is entirely coincidental and one where even a
few characters of context would likely be adequate to make a
distinction, would result in a single code point in your system.
I think I've mentioned this to you before, but anyone
contemplating automatic confusability detection based on raster
or vector properties of printed/written characters should
examine the pre-Kurzweil  (mid-to-late 1960s) literature in
optical character recognition as a starting point in
understanding how difficult it is to get from abstractions of
characters to character identification.







More information about the Idna-update mailing list