Unicode 7.0.0, (combining) Hamza Above, and normalization

John C Klensin klensin at jck.com
Fri Aug 8 14:10:26 CEST 2014

--On Friday, August 08, 2014 07:06 -0400 Andrew Sullivan
<ajs at anvilwalrusden.com> wrote:

>> I think it's dangerous to assume that fixing this lessens any
>> risk of any attacks. 
> In my opinion, this conversation would go better if we each
> attended to making the most modest claims possible.  I don't
> think anyone is arguing that addressing this particular issue
> is going to solve all problems. 

>> It was mentioned in another mail that if Unicode
>> had picked a different name this may not have even been
>> noticed.
> Yes; and frankly, that is why we are having a discussion about
> the topic.  We developed IDNA2008 with a particular
> understanding of the consequences of the normalization and
> stability rules.  It would appear that at least some of us had
> the wrong understanding, and the implications of the actual
> rules are different to what we'd believed. That raises the
> question of whether the fundamental cross-versioning
> assumption was right.  In other words, with this new bit of
> information, it might be that the entire "inclusion" approach
> is riskier than previously thought, and that we need to
> recalibrate our risk understanding (and then decide whether
> the risk is worth the reward).

Andrew, in the same spirit as your comment about modest claims,
let me suggest a narrower version of that last sentence.  This
is not a proposal -- I don't think we are nearly there yet --
but an observation that is motivated both by trying to be a
little more narrow (or modest) and by not knowing what
abandoning the inclusion approach would even mean.  However, not
only did some of us have a particular understanding of what
normalization meant, but that understanding was, I believe,
continuous from early in the pre-IDNA2003 discussions through
IDNA2008 and into PRECIS and elsewhere.  Stated simplistically,
that understanding has been that normalization would deal
effectively with the issue of equality comparisons between
"characters" within the same script that had the same
appearance.  As Vint and others have suggested, that was very
much an equality definition based on "same appearance" or "same
character form", not, e.g., either vague ideas of visual
similarity (or confusion) or linguistic or phonetic criteria of
various sorts.  Many people in the IDNA community wished that
those criteria would work across scripts, e.g., that
identical-looking (and even commonly-derived) Greek, Latin, or
Cyrillic be normalized together, but most of us accepted that as
being implausible (and the reasons it was implausible) a very
long time ago.

>From that perspective, the difference between NFKC (used as an
important basis for table formation in IDNA2003) and NFC (used
as a screening step for IDNA2008) was simply one of what counted
as equality within a script, but with the same assumption about
what "equality" was about.

If we now find that intra-script normalization is insufficient
to give us a consistent identity comparison among the different
ways a character (shape) could be formed within the same script,
then it seems to me that it is not inclusion that is at risk but
simply that assumption of normalization sufficiency.  While I
gather that the idea of a specialized normalization form would
remind some people of very early discussions (even
disagreements) within the Unicode Consortium process, we might
have to contemplate an IETF-specific, or IDN-specific,
normalization form, one built on the strict visual form model
that we understood rather than incorporating per-character
language, linguistic, or phonetic or other usage considerations
for some cases.  

A decision to move in the direction of a different,
non-Unicode-standard, normalization form would probably take us
down the path toward character-by-character evaluations that
many of us have dreaded (again, since early in the pre-IDNA2003
discussions).  But that brings us back to your observation about
recalibrating risk understanding and deciding whether the risk
--or the mechanisms needed to mitigate it -- are worth the
effort and reward.   But I've seen no evidence, or even strong
hints, that the issues this case have turned up brings the
inclusion model, or even the existing IDNA2008 rule and category
sets, into doubt, only the reliance on NFC to do a job that it
appears that, for some cases, it doesn't actually do and wasn't
intended to do.


