Unicode 7.0.0, (combining) Hamza Above, and normalization

Fri Aug 8 14:49:41 CEST 2014

John,

I think this is an important insight and it may indeed be the case that
normalization for Domain Name purposes and normalization for other purposes
are not as aligned as we supposed. Most users of the Unicoded scripts are
unaware of most or any of the various mechanisms associated with Unicode
and will likely be guided more by the principle of least astonishment than
anything else. I wonder whether a domain-name -specific normalization would
improve the likelihood of achieving the aim of that principle?

v

On Fri, Aug 8, 2014 at 8:10 AM, John C Klensin <klensin at jck.com> wrote:

>
>
> --On Friday, August 08, 2014 07:06 -0400 Andrew Sullivan
> <ajs at anvilwalrusden.com> wrote:
>
> >> I think it's dangerous to assume that fixing this lessens any
> >> risk of any attacks.
> >
> > In my opinion, this conversation would go better if we each
> > attended to making the most modest claims possible.  I don't
> > think anyone is arguing that addressing this particular issue
> > is going to solve all problems.
> >...
>
> >> It was mentioned in another mail that if Unicode
> >> had picked a different name this may not have even been
> >> noticed.
> >
> > Yes; and frankly, that is why we are having a discussion about
> > the topic.  We developed IDNA2008 with a particular
> > understanding of the consequences of the normalization and
> > stability rules.  It would appear that at least some of us had
> > the wrong understanding, and the implications of the actual
> > rules are different to what we'd believed. That raises the
> > question of whether the fundamental cross-versioning
> > assumption was right.  In other words, with this new bit of
> > information, it might be that the entire "inclusion" approach
> > is riskier than previously thought, and that we need to
> > recalibrate our risk understanding (and then decide whether
> > the risk is worth the reward).
>
> Andrew, in the same spirit as your comment about modest claims,
> let me suggest a narrower version of that last sentence.  This
> is not a proposal -- I don't think we are nearly there yet --
> but an observation that is motivated both by trying to be a
> little more narrow (or modest) and by not knowing what
> abandoning the inclusion approach would even mean.  However, not
> only did some of us have a particular understanding of what
> normalization meant, but that understanding was, I believe,
> continuous from early in the pre-IDNA2003 discussions through
> IDNA2008 and into PRECIS and elsewhere.  Stated simplistically,
> that understanding has been that normalization would deal
> effectively with the issue of equality comparisons between
> "characters" within the same script that had the same
> appearance.  As Vint and others have suggested, that was very
> much an equality definition based on "same appearance" or "same
> character form", not, e.g., either vague ideas of visual
> similarity (or confusion) or linguistic or phonetic criteria of
> various sorts.  Many people in the IDNA community wished that
> those criteria would work across scripts, e.g., that
> identical-looking (and even commonly-derived) Greek, Latin, or
> Cyrillic be normalized together, but most of us accepted that as
> being implausible (and the reasons it was implausible) a very
> long time ago.
>
> >From that perspective, the difference between NFKC (used as an
> important basis for table formation in IDNA2003) and NFC (used
> as a screening step for IDNA2008) was simply one of what counted
> as equality within a script, but with the same assumption about
> what "equality" was about.
>
> If we now find that intra-script normalization is insufficient
> to give us a consistent identity comparison among the different
> ways a character (shape) could be formed within the same script,
> then it seems to me that it is not inclusion that is at risk but
> simply that assumption of normalization sufficiency.  While I
> gather that the idea of a specialized normalization form would
> remind some people of very early discussions (even
> disagreements) within the Unicode Consortium process, we might
> have to contemplate an IETF-specific, or IDN-specific,
> normalization form, one built on the strict visual form model
> that we understood rather than incorporating per-character
> language, linguistic, or phonetic or other usage considerations
> for some cases.
>
> A decision to move in the direction of a different,
> non-Unicode-standard, normalization form would probably take us
> down the path toward character-by-character evaluations that
> many of us have dreaded (again, since early in the pre-IDNA2003
> discussions).  But that brings us back to your observation about
> recalibrating risk understanding and deciding whether the risk
> --or the mechanisms needed to mitigate it -- are worth the
> effort and reward.   But I've seen no evidence, or even strong
> hints, that the issues this case have turned up brings the
> inclusion model, or even the existing IDNA2008 rule and category
> sets, into doubt, only the reliance on NFC to do a job that it
> appears that, for some cases, it doesn't actually do and wasn't
> intended to do.
>
> best,
>     john
>
>
>
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20140808/0d056cb3/attachment.html>