Visually confusable characters (was: Re: Unicode 7.0.0, (combining) Hamza Above, and normalization for comparison)
jefsey at jefsey.com
Fri Aug 8 20:32:34 CEST 2014
>At 22:29 07/08/2014, John C Klensin wrote:
>And I suspect this list has now had enough of this discussion.
>At 19:37 07/08/2014, John C Klensin wrote:
>Personally, I don't believe an objective standard and categorization
>is possible unless one constrains the problem to the point of making
>it uninteresting (e.g., by believing that the world can, in
>practice, be forced into a single universal type style of type family).
What is uninteresting and cumbersome and I certain have enough
discussing are exceptions. Whatever they may be. Unicode is a
typographic encoding we chose to use for standard end to end
transmissions and related services (with some restrictions). Either
it comes with a built-in common protocol to support what we consider
as exceptions or we do not accept them.
>At 01:07 08/08/2014, Andrew Sullivan wrote:
>Sorry for the iPhoney reply. I wasn't trying to say the result is
>wrong as such. I think it _may_ be wrong for IDNA and therefore
>possibly an indication that our approach in IDNA2008 (and therefore
>alas in precis) is inadequate.
certainly it is. But this is the best compromise we found due to the
way Unicode is.
>At 00:56 08/08/2014, Whistler, Ken wrote:
>All of this discussion seems to be boiling down to IETF
>second-guessing of Unicode character encoding decisions and
>complaints about Unicode normalization not satisfying expectations
>based on rather simplistic notions of which things that look the
>same should *be* the same.
"things that look the same should *be* the same" is very confusing as
you do not specify to who they look the same. Unicode deals with
people, we deal with interconnected people and the systems that
interconnect them. This makes three different strata, in which there
are different cognition/processing layers.
>At 01:23 08/08/2014, John C Klensin wrote:
>But I'm pretty sure that assertions that this is a different
>character despite the same name as the combining sequence and, as
>far aw we can tell, an identical appearance, do not help move us forward.
I appreciate that you say in details what I say in principle in that
mail. Your conlusion confirms that we are in the same case as the
French majuscules (once they have been minored by IDNA2008 - a
metadata that IDNA2008 loses). As long as Unicode does not support a
metadata protocol for exception encoding ...
>At 02:27 08/08/2014, Andrew Sullivan wrote:
>I don't think that's a fair characterization. Nobody is
>"second-guessing" anything. It's rather that we -- John, actually
>-- discovered that there's a consequence of this case that we did
>not previously understand, and it has uncomfortable consequences for
>the way we had previously relied on Unicode, because it didn't work
>the way we thought.
May be time to reconsider the idea of an IETF Unicode including our
exception management through an additional protocol rather than only
by Patrik's tables?
>Presumably, implementers have a greater reason to become familiar
>with the picky exceptional cases.
This is not possible unless if pecky exceptional cases (that include
French majuscules :-) and many others) are supported by a
standardized protocol. The question is simply to know where is this
protocol to be located (at Unicode, at RFC 5895, at Layer six).
>This has, note, not just implications for IDNA2008. We have a whole
>working group (PRECIS) that is busy attempting to use the same
>strategy in a generalized way for other protocols. It hasn't
>shipped yet, but it's gone to the IESG. So we can't just shrug our shoulders.
> > There are likely many similar-looking things that fit in a
> similar > bucket and have escaped notice.
>All the more reason to concern ourselves with it, no?
>At 14:10 08/08/2014, John C Klensin wrote:
>Stated simplistically, that understanding has been that
>normalization would deal effectively with the issue of equality
>comparisons between "characters" within the same script that had the
Your further comment, as well as my suggestion is that we may have to
refine what what is then "comparison" and "same appearance" (visually
is not precise enough).
>If we now find that intra-script normalization is insufficient to
>give us a consistent identity comparison among the different ways a
>character (shape) could be formed within the same script, then it
>seems to me that it is not inclusion that is at risk but simply that
>assumption of normalization sufficiency.
>While I gather that the idea of a specialized normalization form
>would remind some people of very early discussions (even
>disagreements) within the Unicode Consortium process, we might have
>to contemplate an IETF-specific, or IDN-specific, normalization
>form, one built on the strict visual form model that we understood
>rather than incorporating per-character language, linguistic, or
>phonetic or other usage considerations for some cases.
>A decision to move in the direction of a different,
>non-Unicode-standard, normalization form would probably take us down
>the path toward character-by-character evaluations that
>many of us have dreaded (again, since early in the pre-IDNA2003 discussions).
This would be a too black and white decision. I prefer to add a
discrimination algorithm to IDNA than to quit Unicode.
>But that brings us back to your observation about recalibrating risk
>understanding and deciding whether the risk --or the mechanisms
>needed to mitigate it -- are worth the
>effort and reward. But I've seen no evidence, or even strong
>hints, that the issues this case have turned up brings the inclusion
>model, or even the existing IDNA2008 rule and category sets, into
>doubt, only the reliance on NFC to do a job that it appears that,
>for some cases, it doesn't actually do and wasn't intended to do.
This calls for a market study (cf. RFC 6852): what is the market
demand for 10 years old http://unisign.org - those interested are welcome.
>At 14:49 08/08/2014, Vint Cerf wrote:
>I think this is an important insight and it may indeed be the case
>that normalization for Domain Name purposes and normalization for
>other purposes are not as aligned as we supposed. Most users of the
>Unicoded scripts are unaware of most or any of the various
>mechanisms associated with Unicode and will likely be guided more by
>the principle of least astonishment than anything else. I wonder
>whether a domain-name-specific normalization would improve the
>likelihood of achieving the aim of that principle?
the issue is the principle of least astonishment for the human reader
and non confusability by the computer reader. The response is "sign +
equivalence table" based on linguistic use tags. So far the best
system I found (cf. my initial exchange with John) is a raster
geometric grid based upon a legally accepted uniform script. The
question is not the fount diversity but the computer memorized
sign/symbold code. This standardization should be made compatible
(for many uses) with ISO various TCs. This Wikipedia page can be used
as a start point: http://en.wikipedia.org/wiki/List_of_symbols
>At 18:01 08/08/2014, John Levine wrote:
>If I may stick my semi-informed oar in, it seems to me that for
>linguistic purposes, homographs are generally not an
>issue. Remember all those manual typewriters that didn't have digit
>1 or 0 keys, so you used letters l and O instead.
>In our case, homographs are a big deal. So can we just say that,
>and decide to do whatever minimizes homograph issues even though
>it's not the same as what would reflect linguistic usage?
IMHO a request for a common effort should go to various trade and
government SDO in order to secure printed/computerized systems
(banks, police, passports, etc.) and should be associated to RFID
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Idna-update