Unicode & IETF

Tue Aug 12 00:02:30 CEST 2014

Shawn,

I still owe you a response to an earlier note.  I will try to
get back to that soon but, in the interim, let me add a bit to
Patrik's response below.

Completely consistent behavior would be wonderful.  Beyond
joining Patrik in saying "I'd like a pony too", I would note
that there was an idea in the programming language standards
committee many years ago that the right way to do a universal
character set would involve one code point per character, no
combining sequences, and, ideally, no script divisions.  The
questions of whether "o" (U+006F (Latin)), "o" (U+03BF (Greek)),
and "o" (U+043E (Cyrillic)) were the same, or even confusingly
similar, would have vanished because they would be assigned the
same code point.  The question of the relationship between
U+00F6 ("ö") and U+006F U+0308 would be a non-issue because the
"combining character" would not be coded and only the former
would exist.  This "HAMZA ABOVE" issue wouldn't exist for the
same reason -- we'd have a character equivalent to what is not
U+08A1 or we'd have nothing.  WHat made this idea popular with
the programming language community was precisely the issue you
are trying to address -- one character, one form, no problems
with string comparison between different ways to form "the same"
character.

The idea was naïve and, for multiple reasons, unworkable (even
though one of the main reasons given early on turned out to be
irrelevant).   The folks who have been active with Unicode for
the last 20 years or so can explain the reasons why to you much
better than I can.    Instead of that naïve and unworkable idea
(no matter how attractive it was as a programming language and
programmer's dream), we got Unicode.  Unicode, and its code
point assignment rules, represent a large series of tradeoffs
and an attempt to reach a balance among criteria.  Reading
Section 2.2 of the standard and comparing it to this discussion
is really illuminating.  Those differences in criteria and
tradeoffs result in inconsistencies and oddities in behavior,
especially when coding and repertoire were inherited from
pre-existing national standards that were, themselves, not all
developed using the same criteria and that may reflect
constraints that are irrelevant to Unicode.  The result is a
little messy.  As Asmus points out, that is partially because
"text" is messy.  But, unlike the above naïve idea (and the
several ponies it would have delivered), it "works".  

The difficulty is that all of those different constraints and
criteria inevitably create rough edges in which different
criteria would produce different results and different coding
decisions.  That isn't "someone was wrong" or "someone wasn't
expert enough", it is just a difference in requirements that
require adjustments.  Those adjustments are guaranteed to make
the programmer's life harder whether "harder" is the need to
normalize before comparing, to develop and run much complicated
keyboard mapping routines that might be needed in a more perfect
world, or something else.

It makes a big difference in that respect whether you (or I or
Patrik) need to get it right once or to keep juggling.  IDNA2008
is based on two hypotheses: that we understood where the
peculiarities are looking backwards and have dealt effectively
with them and that things will be stable (by criteria consistent
with IDNA that includes mapping consistency for characters that
are graphically the same even within a script if they are
linguistically different).   This particular example,
particularly with the addition of lists from Roozbeh and others
about presumably-similar or identical cases we didn't know
about, is calling both hypotheses into question.   That is
pretty scary regardless of what else is going on.

Even within the IETF's scope, we have a problem.  As several
people on this list know, I've been pushing to have the rules
that PRECIS and its various dependent protocols use either
identical to the IDNA ones or to have sufficient explanations of
the differences that implementers and users could understand why
the differences are worth the trouble.   That consistency is
completely in line with what you are asking for, but everyone
has a reason why they or their protocol should be an exception.

The other irony about this is that, if you want consistent and
easily predictable behavior, you should be asking for exactly
whet we thought we were promised -- no new precomposed
characters under normal circumstances unless there were
characters missing from the combining sequences that could
otherwise be used to form them and, if any exception was made to
add one anyway, it should decompose back to the relevant
combining sequence.  The subtle issues about linguistic and
phonetic forms that may be very important for other uses of
Unicode should have as little relevancy to your needs as they do
to IDNA or PRECIS.

best,
    john

--On Monday, August 11, 2014 22:06 +0200 Patrik Fältström
<paf at frobbit.se> wrote:

> On 11 aug 2014, at 21:52, Shawn Steele
> <Shawn.Steele at microsoft.com> wrote:
> 
>>> That the two standard organizations do reach different
>>> results when applying whatever algorithms they use when
>>> calculating what the best solution is to reach whatever goal
>>> is to be reached is for me completely understood.
>> 
>> As an implementer that's really problematic. I'd like Unicode
>> to behave consistently.  If there's an environment/context
>> that Unicode 'doesn't work for', then I'd like a
>> well-designed set of rules that considers both sides and
>> figures out how to make those rules (which is what I think
>> the IETF is attempting).  Different sets of rules for every
>> application or contexts quickly become unwieldy.
>> 
>> It doesn't help users if one application has characters that
>> are typed a certain way for certain languages or whatever,
>> and then another application says "I don't understand the
>> word you typed, spell it differently".  I can't spell it
>> differently, my keyboard only let me type it one way.
> 
> Completely understood. Given I am also an implementer of
> "these kind of things" I completely agree with you of an
> interesting goal.
> 
> In IETF context we joke and say "I also want a pony" in these
> kind of situations.
> 
> Already in the world of IDNA2003 with stringprep it was
> recognized that different application need different
> stringprep profiles.
> 
> That is also the situation for IDNA2008.
> 
> IDNA2008 is for domain names, and possibly only host names
> (but lets skip that discussion). It leaves (compared with
> IDNA2003) case folding and various transformations out of the
> standard itself just because various keyboards and input
> mechanisms should be given the ability to transform input to
> whatever is to be used as a domain name in the best possible
> way for the context the transformation is happening in. That
> can not be dictated, but instead should be left for innovation.
> 
> Because of this, IETF has for example the PRECIS working group
> that is working on other applications than domain names. Where
> different rules will be used. Sure, a majority is most
> certainly the same. And most certainly based on basic unicode
> constructs.
> 
> But there will be a difference. As it seems.
> 
>    Patrik
>