Unicode & IETF

Tue Aug 12 16:53:21 CEST 2014

I just had a Princess Bride moment 😮.
Often when I've had disagreements with intelligent people, the
disagreement turns out to be 
a difference in the use of terms.

Bear with me a bit, while I set out how Unicode uses the terms, because it
appears to be different than the IETF.
(I'll simplify a bit, but nothing material for the purpose of this
discussion.)

First, the term "character" has so many different meanings that it is best
to avoid it completely where clarity is needed. So let's just talk about
assigned Unicode code points and glyphs.

*Glyphs. *An assigned Unicode code point has an set of glyphs (shapes) that
can normally represent it. Think of the letter 'a', for example. Not only
has it different glyphs based on font-family, such as the following:
a, a, a, a, a, a, a

but also variations within a font (regular vs italic), weights (not only
bold and light, but arbitrary weights in between), width, size, etc. The
set is theoretically unbounded, although there are of course physical
limits. For more, see: http://www.w3.org/TR/css3-fonts/

*Homoglyphs. *When two assigned Unicode code points have intersecting sets
of glyphs, they are called homoglyphs. Examples:

   1. U+0430 <http://unicode.org/cldr/utility/character.jsp?a=0430> ( а )
   CYRILLIC SMALL LETTER A
   2. U+0061 <http://unicode.org/cldr/utility/character.jsp?a=0061> ( a )
   LATIN SMALL LETTER A.

But this is not confined to single code points; it may include sequences of
one or more code points, such as the following homoglyphs:

   1. U+00E5 <http://unicode.org/cldr/utility/character.jsp?a=00E5> ( å )
   LATIN SMALL LETTER A WITH RING ABOVE
   2. U+0061 <http://unicode.org/cldr/utility/character.jsp?a=0061>, U+030A
   <http://unicode.org/cldr/utility/character.jsp?a=030A> ( å ) LATIN
   SMALL LETTER A, COMBINING RING ABOVE
   3. U+0430 <http://unicode.org/cldr/utility/character.jsp?a=0430>, U+030A
   <http://unicode.org/cldr/utility/character.jsp?a=030A> ( а̊ ) CYRILLIC
   SMALL LETTER A, COMBINING RING ABOVE

Note that in some cases the overlap among glyph sets is very large; they
are essentially the same. That is the case for #1, #2, and #3 above.

In other cases, the overlap is much smaller: an italic glyph for  U+0438
<http://unicode.org/cldr/utility/character.jsp?a=0438> ( и ) CYRILLIC SMALL
LETTER I normally looks identical to an italic glyph for U+0075
<http://unicode.org/cldr/utility/character.jsp?a=0075> ( u ) LATIN SMALL
LETTER U, but non-italic glyphs are normally different. And there is a
whole range between "essentially the same" glyph set and just a narrow
overlap.

*Confusables.* All homoglyphs are confusables. Confusables are just a bit
broader. The glyph sets don't have to intersect: it is enough that some
glyphs in each set are confusably similar. (More on that in TR36.)

*Canonical Equivalence. *This is a specification for when Unicode considers
that two sequences of code points are to be regarded as "meaning the same
thing". Of course, there are other environments where "meaning the same
thing" can be differently and more broadly interpreted, such as "has the
same case folding", or "is a homoglyph". But canonical equivalence is the
core Unicode definition.

So the following passage from Vint causes some head-scratching.

This is not about "confusables" in the sense that some characters look like
> others.

It is about the fact that the same glyph has multiple encodings that do not
> collapse to an unambiguous canonical form.

Let's walk through the example above.

   1. The same glyph (for #1, #2, and #3 above) has multiple encodings
   (namely #1, #2, and #3) above.
   2. They are homoglyphs, and thus confusables.
   3. #1 and #2 collapse to the same canonical form (#1)
   4. However, #3 does *not* collapse to that form, despite "having the
   same glyph". In terms of Unicode, it "doesn't mean the same thing".

And #3 is not alone. Because of the combinatorics, there are probably more
cases like #3 than there are Unicode characters! So U+08A1 is not at all an
isolated case: aside from the other Arabic characters cited there are
indefinitely many other cases of sequences that are not canonically
equivalent, but have "the same glyph".

The purpose of this is *not* to show that Vint is wrong; it is instead that
miscommunication is causing some fundamental misunderstandings.

Mark <https://google.com/+MarkDavis>

*— Il meglio è l’inimico del bene —*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20140812/fc3a29a0/attachment.html>