Unicode & IETF

Tue Aug 12 12:17:02 CEST 2014

Ken,

I can't speak for others, but I found this posting immensely
helpful, especially in combination with a couple of off-list
conversations.  I'm still probably confused (at least one reason
is identified inline below), but better able to pin down the
sources.   As Patrik has said several times, the important issue
for us in all of this isn't how IDNA treats U+08A1 but rather
how we explain the situation and either what action we take or
what advice we give more generally.  Part of that problem, in
turn, is that, if we are going to try to continue with a system
based on rules and derived properties, rather than one of
normative tables -- i.e., by being able to point to and/or think
about what the principle is rather than looking at the current
version of a table and saying "whatever is there is there, it
don't matter why it got there, but the table is normative, and
the principle is just an explanation".    I don't want or intend
to be as binary as that sounds -- if we can have rules generally
but there are a few places where we can't, I imagine we can
figure out a way to deal with it.  But it is not a wonderful
situation. 

Those off-list conversations have helped me understand that my
trying to find a vocabulary to talk about these things with
relative precision has been a failure and has probably added to
the confusion, so I'm going to try to get away from that.

Inline below.

--On Monday, August 11, 2014 22:48 +0000 "Whistler, Ken"
<ken.whistler at sap.com> wrote:

> John,
> 
>> The other irony about this is that, if you want consistent and
>> easily predictable behavior, you should be asking for exactly
>> whet we thought we were promised -- no new precomposed
>> characters under normal circumstances unless there were
>...
> And the *other* other irony about this is that that is exactly
> what you have gotten!
> 
> The relevant data file to watch for this is:
> 
> http://www.unicode.org/Public/UCD/latest/ucd/CompositionExclus
> ions.txt
> 
> which carries all the information about the "funny cases" for
> normalization.

Ok, but see above (and several notes by others) about the
contrast between "normative rules that generate tables" (the
IDNA2008 model, among other things) and "normative tables".
That data file and references to it seem to require that we go
back to normative tables.  Is that understanding correct?

> The last time an exception of the type you are talking about --
> a new precomposed character being added, for which a
> canonical combining decomposition had to be added
> simultaneously to "undo" the encoding as a single code point,
> so that the NFC form was decomposed, rather than composed, was
> for
> U+2ADC FORKING. That went into Unicode 3.2, in March, *2002*,
> 12+ years ago.

Part of the source of my confusion is that, at least through
6.2/ 6.3, Section 2.2 ("Design Principles"), especially the
subsections on Unification and Dynamic Composition, appear to
still say that language differences are not considered and that
assembling character forms based on appearance is an accepted
and common method even when the combinations are not explicitly
recognized by the Standard (or, presumably, the precomposed
character table).   In addition, Sections 3 and 5 of the June
version of UAX 15 appear to specify that such characters (back
to that in a moment) will be decomposed.  If none of that has
actually applied for the last dozen years or (I gather more
likely) there are nuances that those sections don't capture, why
have we gone a dozen years without fixes to the text... even if
those fixes just say "there are other circumstances that may
override these seemingly-clear rules"?

> The claim in the Unicode Standard for the cases like U+08A1
> beh-with-hamza and U+A794 c-with-palatal-hook, is that these
> are *ATOMIC* characters, and not actually precomposed
> characters. Therefore they do not fall afoul of the stability
> issue for the formal definition of Unicode normalization. They
> are encoded because they are *not* equivalent to what might
> naturally be taken as a pre-existing combining sequence.

> This is an old, old, old, old issue for the Unicode Standard.
> The line has to be drawn somewhere, and it is not at all
> self-evident, contrary to what some people seem to think.

I am not one of those who believe it is self-evident, nor have I
believed that for a very long time.  I knew about some of the
other cases you and Asmus have cited and, especially when they
are added to the "consistency with other standards" principle,
the issue is clear as is the observation that edge and/or
complex cases are inevitable.

> Cases like adding acute accents to letters clearly fall on one
> side of the line.
> 
> Cases like making new letters by erasing parts of existing
> letters or turning pieces around clearly fall on the other
> side of the line. (Meditate, perhaps, on U+1D09 LATIN SMALL
> LETTER TURNED I.)
> 
> Cases which make new letters by adding ostensibly systematic
> diacritics which then fuse unpredictably into the forms of the
> letters are the middle ground, where the UTC had to place
> a stake in the ground and decide whether to treat them all
> as atomic units, despite their historic derivative complexity,
> or to try to deal with them as ligated  display units based on
> encoded sequences.

Again, that explanation is very helpful.

> And just to make things more complicated, the encoding for
> Arabic was/is systematically distinct from that for Latin.
> For Latin, the *default* is that if the diacritic is visually
> separated from the base letter, then a precomposition is
> presumed. However, for Arabic, because of the early history
> of how Arabic was encoded on computers, the line is drawn
> differently: any skeleton + ijam (diacritic) combination is
> encoded atomically, and is not treated as a sequence. The
> known exceptions to that principle are well documented and
> historically motivated.
> 
> This is noticeably messy, of course, both because writing
> systems are messy, and because the history of computer
> character encoding itself is messy and full of hacks. However,
> two points should stand out for the takeaway here:
> 
> 1. It is not possible to have a completely internally
> consistent diacritic encoding story for Latin (and scripts
> like it). Quit trying to have that pony!
> 
> 2. It is not possible to have a completely consistent model
> for how diacritics are handled between Latin and Arabic.
> Quit trying to have that pony, too!

Ok.  Now I have two issues with which I would appreciate your
advice and help:   I'm going to try to avoid words like
"appearance" or "precombined" in any normative-sounding way
below.

(i) I can write into a document that the expectation for the
effectiveness of normalization in bringing together forms that
might be construed as the same character, particularly forms
that a reasonable person might assume fall under the "Dynamic
Composition" statement "Because the process of character
composition is open-ended, new forms with modifying marks may be
created by a combination of base characters followed by
combining characters" is simply different for European scripts
(where the expectation is strong) and, e.g., Arabic and
Devanagari (where it is much weaker).  Is that what I/we should
say?  To be clear, I'd rather not write that text at all -- I'd
rather you and your colleagues write it, put it in that section
of the Standard, and let us reference it.  However, if the
latter has not happened for a dozen years or more, we are going
to need to say something unless the Unicode Standard text will
be updated Really Soon Now.  Please suggest text.

(ii) Independent of issues about similarity or identity of
representative glyphs, perhaps a better way to explain what I/we
are trying to get at, especially in the context of my exchange
with Shawn, involves a scenario at the other end of the process,
i.e., what people type and what they expect to have happen.  Let
me pose it as a question.   There are clearly speaker of Fula
who prefer (or would prefer) to write it in Arabic script.  What
have they been doing for the last half-dozen years (and will do
between now and whatever time it takes for Unicode 7.0.0 and
corresponding display mechanisms to propagate sufficiently to be
relied upon)?  One possibility is waiting, probably impatiently,
for you to code this atomic character because the language
cannot reasonably be written without it, presumably sticking to
Latin script in the interim.  The other, possibly guided by that
"Dynamic Composition" text is making up (or, if you prefer,
simulating or approximating) the character from the combining
sequence.  

The latter is a non-issue for any of the atomic forms that
include Hamza Above because they have existed with code points
for the atomic forms for a long time and are presumably just
used that way.  If there is an standard Fula-in-Arabic keyboard,
it presumably has those atomic forms on it.  But BEH WITH HAMZA
ABOVE (I gather phonetically implosive /b/) is a little more
complicated: either the character is not on the keyboard or,
prior to Unicode 7.0.0, it generates something that involves
more than one code point.  For purposes of this discussion, I
note that sequence need not be U+0628 U+0654; it could be any
other convention that people accepted and got used to.

If people haven't been waiting but instead have used some other
convention, and Fula text in Arabic is processed in ways that
would make comparisons relevant (not, e.g., just printed), then
it seems to me that they either have to have a flag day in which
old keyboards and typing conventions disappear and old encoded
texts are upgraded or there has to be a mapping procedure or
comparison rule that joins the old convention with the new,
atomic, U+08A1 code point.   If there is such a mapping
procedure, do we need to incorporate it into IDNA (possibly
violating the duality of U-labels and A-labels, which would be a
rather major step)?  And what would you call it given that
"normalization", "canonicalization", etc., seem to be used up?

Am I missing something or is the above logic reasonable?  Have
people really been waiting?

> What we have in Unicode instead is a perfectly serviceable
> donkey that can move your cart to market.

For IDNA purposes, the donkey seems to have either gotten
stubborn on the road or gone lame.  I continue to hope we can
work together to come up with a solution or explanation that
does not require either discarding the IDNA card or taking some
drastic donkey-action.

best,
     john