Visually confusable characters (was: Re: Unicode 7.0.0, (combining) Hamza Above, and normalization for comparison)

Jefsey jefsey at
Fri Aug 8 20:32:34 CEST 2014

>At 22:29 07/08/2014, John C Klensin wrote:
>And I suspect this list has now had enough of this discussion.

>At 19:37 07/08/2014, John C Klensin wrote:
>Personally, I don't believe an objective standard and categorization 
>is possible unless one constrains the problem to the point of making 
>it uninteresting (e.g., by believing that the world can, in 
>practice, be forced into a single universal type style of type family).

Dear John,
What is uninteresting and cumbersome and I certain have enough 
discussing are exceptions. Whatever they may be. Unicode is a 
typographic encoding we chose to use for standard end to end 
transmissions and related services (with some restrictions). Either 
it comes with a built-in common protocol to support what we consider 
as exceptions or we do not accept them.

>At 01:07 08/08/2014, Andrew Sullivan wrote:
>Sorry for the iPhoney reply.  I wasn't trying to say the result is 
>wrong as such.  I think it _may_ be wrong for IDNA and therefore 
>possibly an indication that our approach in IDNA2008 (and therefore 
>alas in precis) is inadequate.

Dear Andrew,
certainly it is. But this is the best compromise we found due to the 
way Unicode is.

>At 00:56 08/08/2014, Whistler, Ken wrote:
>All of this discussion seems to be boiling down to IETF 
>second-guessing of Unicode character encoding decisions and 
>complaints about Unicode normalization not satisfying expectations 
>based on rather simplistic notions of which things that look the 
>same should *be* the same.

Dear Ken,
"things that look the same should *be* the same" is very confusing as 
you do not specify to who they look the same. Unicode deals with 
people, we deal with interconnected people and the systems that 
interconnect them. This makes three different strata, in which there 
are different cognition/processing layers.

>At 01:23 08/08/2014, John C Klensin wrote:
>But I'm pretty sure that assertions that this is a different 
>character despite the same name as the combining sequence and, as 
>far aw we can tell, an identical appearance, do not help move us forward.

I appreciate that you say in details what I say in principle in that 
mail. Your conlusion confirms that we are in the same case as the 
French majuscules (once they have been minored by IDNA2008 - a 
metadata that IDNA2008 loses). As long as Unicode does not support a 
metadata protocol for exception encoding ...

>At 02:27 08/08/2014, Andrew Sullivan wrote:
>I don't think that's a fair characterization.  Nobody is 
>"second-guessing" anything.  It's rather that we -- John, actually 
>-- discovered that there's a consequence of this case that we did 
>not previously understand, and it has uncomfortable consequences for 
>the way we had previously relied on Unicode, because it didn't work 
>the way we thought.

Dear Andrew,
May be time to reconsider the idea of an IETF Unicode including our 
exception management through an additional protocol rather than only 
by Patrik's tables?

>Presumably, implementers have a greater reason to become familiar 
>with the picky exceptional cases.

This is not possible unless if pecky exceptional cases (that include 
French majuscules :-) and many others) are supported by a 
standardized protocol. The question is simply to know where is this 
protocol to be located (at Unicode, at RFC 5895, at Layer six).

>This has, note, not just implications for IDNA2008.  We have a whole 
>working group (PRECIS) that is busy attempting to use the same 
>strategy in a generalized way for other protocols.  It hasn't 
>shipped yet, but it's gone to the IESG.  So we can't just shrug our shoulders.


> > There are likely many similar-looking things that fit in a 
> similar > bucket and have escaped notice.
>All the more reason to concern ourselves with it, no?


>At 14:10 08/08/2014, John C Klensin wrote:
>Stated simplistically, that understanding has been that 
>normalization would deal effectively with the issue of equality 
>comparisons between "characters" within the same script that had the 
>same appearance.

Your further comment, as well as my suggestion is that we may have to 
refine what what is then "comparison" and "same appearance" (visually 
is not precise enough).

>If we now find that intra-script normalization is insufficient to 
>give us a consistent identity comparison among the different ways a 
>character (shape) could be formed within the same script, then it 
>seems to me that it is not inclusion that is at risk but simply that 
>assumption of normalization sufficiency.


>While I gather that the idea of a specialized normalization form 
>would remind some people of very early discussions (even 
>disagreements) within the Unicode Consortium process, we might have 
>to contemplate an IETF-specific, or IDN-specific, normalization 
>form, one built on the strict visual form model that we understood 
>rather than incorporating per-character language, linguistic, or 
>phonetic or other usage considerations for some cases.
>A decision to move in the direction of a different, 
>non-Unicode-standard, normalization form would probably take us down 
>the path toward character-by-character evaluations that
>many of us have dreaded (again, since early in the pre-IDNA2003 discussions).

This would be a too black and white decision. I prefer to add a 
discrimination algorithm to IDNA than to quit Unicode.

>But that brings us back to your observation about recalibrating risk 
>understanding and deciding whether the risk --or the mechanisms 
>needed to mitigate it -- are worth the
>effort and reward.   But I've seen no evidence, or even strong 
>hints, that the issues this case have turned up brings the inclusion 
>model, or even the existing IDNA2008 rule and category sets, into 
>doubt, only the reliance on NFC to do a job that it appears that, 
>for some cases, it doesn't actually do and wasn't intended to do.

This calls for a market study (cf. RFC 6852): what is the market 
demand for 10 years old - those interested are welcome.

>At 14:49 08/08/2014, Vint Cerf wrote:
>I think this is an important insight and it may indeed be the case 
>that normalization for Domain Name purposes and normalization for 
>other purposes are not as aligned as we supposed. Most users of the 
>Unicoded scripts are unaware of most or any of the various 
>mechanisms associated with Unicode and will likely be guided more by 
>the principle of least astonishment than anything else. I wonder 
>whether a domain-name-specific normalization would improve the 
>likelihood of achieving the aim of that principle?

Dear Vint,
the issue is the principle of least astonishment for the human reader 
and non confusability by the computer reader. The response is "sign + 
equivalence table" based on linguistic use tags. So far the best 
system I found (cf. my initial exchange with John) is a raster 
geometric grid based upon a legally accepted uniform script. The 
question is not the fount diversity but the computer memorized 
sign/symbold code. This standardization should be made compatible 
(for many uses) with ISO various TCs. This Wikipedia page can be used 
as a start point:

>At 18:01 08/08/2014, John Levine wrote:
>If I may stick my semi-informed oar in, it seems to me that for 
>linguistic purposes, homographs are generally not an 
>issue.  Remember all those manual typewriters that didn't have digit 
>1 or 0 keys, so you used letters l and O instead.
>In our case, homographs are a big deal.  So can we just say that, 
>and decide to do whatever minimizes homograph issues even though 
>it's not the same as what would reflect linguistic usage?

IMHO a request for a common effort should go to various trade and 
government SDO in order to secure printed/computerized systems 
(banks, police, passports, etc.) and should be associated to RFID 
oriented work.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Idna-update mailing list