Unicode 7.0.0, (combining) Hamza Above, and normalization for comparison

Mark Davis ☕️ mark at macchiato.com
Wed Aug 6 19:48:52 CEST 2014


On Wed, Aug 6, 2014 at 7:43 AM, Vint Cerf <vint at google.com> wrote:

> The point here is that there are two ways to encode the same character and
> none of the normalization mechanisms transforms the representation into a
> single, canonical encoding. The result is ambiguity for comparing one label
> to another in the domain names, leading to the potential for phishing. The
> principle is to try to avoid that outcome.


When someone says "two ways to encode the same *character*", then they have
make it clear as to which of the *very* many senses of "character" that is
meant. One quite common sense is "has the same visual appearance" (or
confusingly similar). Another is "canonically equivalent" according to
Unicode. There are definitely differences between these: there are code
point sequences that have the same appearance that are not canonically
equivalent, such as Greek omicron and Latin o.

The Unicode consortium defines the following to *not* be canonically
equivalent, even though they may look the same, which is why the former was
encoded.

   1. U+08A1 ARABIC LETTER BEH WITH HAMZA ABOVE
   2. U+0628 ARABIC LETTER BEH + U+0654 ARABIC HAMZA ABOVE

Thus because these sequences are not canonically equivalent, the consortium
ended up encoding #1 for use in Fula. This was intentional, according to
the character encoding model for Arabic, *not an oversight*.

Similarly, the consortium defines the following to *not* be canonically
equivalent, even though they may look the same.

   1. U+00D8 (Ø) LATIN CAPITAL LETTER O WITH STROKE
   2. U+004F (O) LATIN CAPITAL LETTER O + U+0338 ( ̸ ) COMBINING LONG
   SOLIDUS OVERLAY

Thus if (say) Danish had been a relatively obscure language, and Ø had not
been encoded, it would have been ok to encode #1 for Danish.

Compare that with the following, which both look the same, *and* which are
defined to be canonically equivalent.

   1. U+00D6 ( Ö ) LATIN CAPITAL LETTER O WITH DIAERESIS
   2. U+004F ( O ) LATIN CAPITAL LETTER O + U+0308 ( ̈ ) COMBINING DIAERESIS

The consortium supplied specification and code for determining canonical
equivalence, and for confusability. The latter is being constantly refined
as more information is determined, so it not appropriate for use at a
protocol level, rather for higher level tools and processes.

If one means a third sense of "character" (neither "glyph" or "canonically
equivalent"), then that needs to be made clear what is meant, and what
mechanism can be used to determining when two sequences do and don't
represent the same "character" in that sense.

Mark <https://google.com/+MarkDavis>

*— Il meglio è l’inimico del bene —*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20140806/a6d8da60/attachment.html>


More information about the Idna-update mailing list