Unicode & IETF

Mark Davis ☕️ mark at macchiato.com
Tue Aug 12 17:18:22 CEST 2014

This sentence actually makes my point:

> Unless canonicalization produces only one representation, comparisons can
fail and create unintended results.

Unicode canonicalization (NFC) of canonically equivalent sequences always
produces a unique representation (the NFC form), if by "representation" you
mean "sequence of code points". And if you mean by comparison, "code point
comparisons", then such comparison cannot fail.

If you mean something else by "representation" and "comparison", then you
have to define exactly what you mean. And what I'm saying is that it would
help, since we are talking about Unicode, to use the Unicode terms. So, for
example, you might mean by "representation" a sequence of glyphs...

Mark <https://google.com/+MarkDavis>

 *— Il meglio è l’inimico del bene —*

On Tue, Aug 12, 2014 at 8:09 AM, Vint Cerf <vint at google.com> wrote:

> mark,
> the problem is poverty of vocabulary then. I said nothing about "meaning"
> only about encoding and the side effects of having two ways to represent
> the same <character? glyph? thing?>. Unless canonicalization produces only
> one representation, comparisons can fail and create unintended results.
> v
> On Tue, Aug 12, 2014 at 10:53 AM, Mark Davis ☕️ <mark at macchiato.com>
> wrote:
>> I just had a Princess Bride moment ​😮.
>> ​Often when I've had disagreements with intelligent people, the
>> disagreement turns out to be ​
>> a difference in the use of terms.
>> ​ ​
>> Bear with me a bit, while I set out how Unicode uses the terms, because
>> it appears to be different than the IETF.
>> (I'll simplify a bit, but nothing material for the purpose of this
>> discussion.)
>> First, the term "character" has so many different meanings that it is
>> best to avoid it completely where clarity is needed. So let's just talk
>> about assigned Unicode code points and glyphs.
>> *Glyphs. *An assigned Unicode code point has an set of glyphs (shapes)
>> that can normally represent it. Think of the letter 'a', for example. Not
>> only has it different glyphs based on font-family, such as the following:
>> a, a, a, a, a, a, a
>> ​but also variations within a font (regular vs italic), weights (not only
>> bold and light, but arbitrary weights in between), width, size, etc. The
>> set is theoretically unbounded, although there are of course physical
>> limits. For more, see: http://www.w3.org/TR/css3-fonts/
>> *Homoglyphs. *When two assigned Unicode code points have intersecting
>> sets of glyphs, they are called homoglyphs. Examples:
>>    1. U+0430 <http://unicode.org/cldr/utility/character.jsp?a=0430> ( а )
>>    2. U+0061 <http://unicode.org/cldr/utility/character.jsp?a=0061> ( a )
>> But this is not confined to single code points; it may include sequences
>> of one or more code points, such as the following homoglyphs:
>>    1. U+00E5 <http://unicode.org/cldr/utility/character.jsp?a=00E5> ( å )
>>    2. U+0061 <http://unicode.org/cldr/utility/character.jsp?a=0061>,
>>    U+030A <http://unicode.org/cldr/utility/character.jsp?a=030A> ( å )
>>    3. U+0430 <http://unicode.org/cldr/utility/character.jsp?a=0430>,
>>    U+030A <http://unicode.org/cldr/utility/character.jsp?a=030A> ( а̊ )
>> Note that in some cases the overlap among glyph sets is very large; they
>> are essentially the same. That is the case for #1, #2, and #3 above.
>> In other cases, the overlap is much smaller: an italic glyph for  U+0438
>> <http://unicode.org/cldr/utility/character.jsp?a=0438> ( и ) CYRILLIC
>> SMALL LETTER I normally looks identical to an italic glyph for U+0075
>> <http://unicode.org/cldr/utility/character.jsp?a=0075> ( u ) LATIN SMALL
>> LETTER U, but non-italic glyphs are normally different. And there is a
>> whole range between "essentially the same" glyph set and just a narrow
>> overlap.
>> *Confusables.* All homoglyphs are confusables. Confusables are just a
>> bit broader. The glyph sets don't have to intersect: it is enough that some
>> glyphs in each set are confusably similar. (More on that in TR36.)
>> *Canonical Equivalence. *This is a specification for when Unicode
>> considers that two sequences of code points are to be regarded as "meaning
>> the same thing". Of course, there are other environments where "meaning the
>> same thing" can be differently and more broadly interpreted, such as "has
>> the same case folding", or "is a homoglyph". But canonical equivalence is
>> the core Unicode definition.
>> So the following passage from Vint causes some head-scratching.
>> This is not about "confusables" in the sense that some characters look
>>> like others.
>> It is about the fact that the same glyph has multiple encodings that do
>>> not collapse to an unambiguous canonical form.
>> Let's walk through the example above.
>>    1. The same glyph (for #1, #2, and #3 above) has multiple encodings
>>    (namely #1, #2, and #3) above.
>>    2. They are homoglyphs, and thus confusables.
>>    3. #1 and #2 collapse to the same canonical form (#1)
>>    4. However, #3 does *not* collapse to that form, despite "having the
>>    same glyph". In terms of Unicode, it "doesn't mean the same thing".
>> And #3 is not alone. Because of the combinatorics, there are probably
>> more cases like #3 than there are Unicode characters! So U+08A1 is not
>> at all an isolated case: aside from the other Arabic characters cited there
>> are indefinitely many other cases of sequences that are not canonically
>> equivalent, but have "the same glyph".
>> The purpose of this is *not* to show that Vint is wrong; it is instead
>> that miscommunication is causing some fundamental misunderstandings.
>> Mark <https://google.com/+MarkDavis>
>> *— Il meglio è l’inimico del bene —*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20140812/c896cce1/attachment.html>

More information about the Idna-update mailing list