Unicode & IETF

Vint Cerf vint at google.com
Tue Aug 12 17:09:19 CEST 2014


mark,

the problem is poverty of vocabulary then. I said nothing about "meaning"
only about encoding and the side effects of having two ways to represent
the same <character? glyph? thing?>. Unless canonicalization produces only
one representation, comparisons can fail and create unintended results.

v



On Tue, Aug 12, 2014 at 10:53 AM, Mark Davis ☕️ <mark at macchiato.com> wrote:

> I just had a Princess Bride moment ​😮.
> ​Often when I've had disagreements with intelligent people, the
> disagreement turns out to be ​
> a difference in the use of terms.
> ​ ​
> Bear with me a bit, while I set out how Unicode uses the terms, because it
> appears to be different than the IETF.
> (I'll simplify a bit, but nothing material for the purpose of this
> discussion.)
>>
>
> First, the term "character" has so many different meanings that it is best
> to avoid it completely where clarity is needed. So let's just talk about
> assigned Unicode code points and glyphs.
>
> *Glyphs. *An assigned Unicode code point has an set of glyphs (shapes)
> that can normally represent it. Think of the letter 'a', for example. Not
> only has it different glyphs based on font-family, such as the following:
> a, a, a, a, a, a, a
>
> ​but also variations within a font (regular vs italic), weights (not only
> bold and light, but arbitrary weights in between), width, size, etc. The
> set is theoretically unbounded, although there are of course physical
> limits. For more, see: http://www.w3.org/TR/css3-fonts/
>
> *Homoglyphs. *When two assigned Unicode code points have intersecting
> sets of glyphs, they are called homoglyphs. Examples:
>
>    1. U+0430 <http://unicode.org/cldr/utility/character.jsp?a=0430> ( а )
>    CYRILLIC SMALL LETTER A
>    2. U+0061 <http://unicode.org/cldr/utility/character.jsp?a=0061> ( a )
>    LATIN SMALL LETTER A.
>
> But this is not confined to single code points; it may include sequences
> of one or more code points, such as the following homoglyphs:
>
>    1. U+00E5 <http://unicode.org/cldr/utility/character.jsp?a=00E5> ( å )
>    LATIN SMALL LETTER A WITH RING ABOVE
>    2. U+0061 <http://unicode.org/cldr/utility/character.jsp?a=0061>,
>    U+030A <http://unicode.org/cldr/utility/character.jsp?a=030A> ( å )
>    LATIN SMALL LETTER A, COMBINING RING ABOVE
>    3. U+0430 <http://unicode.org/cldr/utility/character.jsp?a=0430>,
>    U+030A <http://unicode.org/cldr/utility/character.jsp?a=030A> ( а̊ )
>    CYRILLIC SMALL LETTER A, COMBINING RING ABOVE
>
> Note that in some cases the overlap among glyph sets is very large; they
> are essentially the same. That is the case for #1, #2, and #3 above.
>
> In other cases, the overlap is much smaller: an italic glyph for  U+0438
> <http://unicode.org/cldr/utility/character.jsp?a=0438> ( и ) CYRILLIC
> SMALL LETTER I normally looks identical to an italic glyph for U+0075
> <http://unicode.org/cldr/utility/character.jsp?a=0075> ( u ) LATIN SMALL
> LETTER U, but non-italic glyphs are normally different. And there is a
> whole range between "essentially the same" glyph set and just a narrow
> overlap.
>
> *Confusables.* All homoglyphs are confusables. Confusables are just a bit
> broader. The glyph sets don't have to intersect: it is enough that some
> glyphs in each set are confusably similar. (More on that in TR36.)
>
> *Canonical Equivalence. *This is a specification for when Unicode
> considers that two sequences of code points are to be regarded as "meaning
> the same thing". Of course, there are other environments where "meaning the
> same thing" can be differently and more broadly interpreted, such as "has
> the same case folding", or "is a homoglyph". But canonical equivalence is
> the core Unicode definition.
>
> So the following passage from Vint causes some head-scratching.
>
> This is not about "confusables" in the sense that some characters look
>> like others.
>
> It is about the fact that the same glyph has multiple encodings that do
>> not collapse to an unambiguous canonical form.
>
>
> Let's walk through the example above.
>
>    1. The same glyph (for #1, #2, and #3 above) has multiple encodings
>    (namely #1, #2, and #3) above.
>    2. They are homoglyphs, and thus confusables.
>    3. #1 and #2 collapse to the same canonical form (#1)
>    4. However, #3 does *not* collapse to that form, despite "having the
>    same glyph". In terms of Unicode, it "doesn't mean the same thing".
>
> And #3 is not alone. Because of the combinatorics, there are probably more
> cases like #3 than there are Unicode characters! So U+08A1 is not at all
> an isolated case: aside from the other Arabic characters cited there are
> indefinitely many other cases of sequences that are not canonically
> equivalent, but have "the same glyph".
>
> The purpose of this is *not* to show that Vint is wrong; it is instead
> that miscommunication is causing some fundamental misunderstandings.
>
> Mark <https://google.com/+MarkDavis>
>
> *— Il meglio è l’inimico del bene —*
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20140812/384350a6/attachment-0001.html>


More information about the Idna-update mailing list