FW: Your statement on Identifiers and Unicode 7.0.0

John C Klensin klensin at jck.com
Mon Feb 9 15:37:21 CET 2015

--On Sunday, February 08, 2015 23:46 +0100 JFC Morfin
<jefsey at jefsey.com> wrote:

>>> then even a wider selection of precomposed characters is
>>> insufficient.
>> This is something I do not understand.
> The best is to ask him.
> John, is it because you think this would result in too many
> Unigraph code points, or for another reason we do not
> see/understand?

Jefsey (and others),

The notion of a "no combining marks or other display guidance"
unified character set has been explored several times, most
notably (at least in my opinion) in the initial design for ISO
10646, a design that evolved and was then ultimately replaced by
a specification that is code point-identical with Unicode.  The
explosion in the size of the code space is immense, especially
since trying to do that probably requires avoiding the
"unification" of the CJK and Arabic and Perso-Arabic script

If one also wants to make that hypothetical coding system
glyph-sensitive without invisible or non-spacing characters that
provide formatting clues and to eliminate the rather complex
(and sometimes language-sensitive, rather than script-sensitive)
rendering rules associated with several scripts in Unicode, the
code space gets even larger, almost certainly beyond the 32 bits
originally anticipated for 10646, much less the 16 planned for
Unicode 1.0 and the circa 23 bits of present-day Unicode.  

I don't consider a code space that would require 40 or even 48
bits per character (my rough guess) to really be plausible. In
addition to sheer size, the encoding principles that would be
needed to let people find characters and how they are coded
would become quite daunting.   YMMD.

Asmus and I discussed this in our exchange circa two weeks ago;
I would encourage you to go back and review those notes if you
are interested in pursuing this issue.


More information about the Idna-update mailing list