IAB Statement on Identifiers and Unicode 7.0.0

Mark Davis ☕️ mark at macchiato.com
Wed Jan 28 11:41:02 CET 2015


On Wed, Jan 28, 2015 at 10:39 AM, Vint Cerf <vint at google.com> wrote:

> i had something different in mind. What was key to IDNA2008 was the
> uniqueness of the UNICODE/PUNYCODE representations. Essentially, after
> normalization, one expects that the two strings are unambiguously
> equivalent.
>

​They are, according to Unicode normalization.​




> mapping from normalized unicode to punycode and back should produce the
> same (character for character) string.
>

​Yes, if Y = NFC(X), then unicode(punycode(Y)) = Y​



> The problem that the Hamza discussion illustrates, as I understand it, is
> that there is no normalization that produces this result if one string uses
> the combined character and another uses the composed character sequence -
> no normalization produces an unambiguous result.
>

​It does produce an unambiguous result
​ in your sense of
 "
mapping from normalized unicode to punycode and back should produce the
same (character for character) string.
​"​

U+08A1 ARABIC LETTER BEH WITH HAMZA ABOVE

*NFC(U+08A1) = U+08A1*


U+0628 ARABIC LETTER BEH​ ​U+0654 ARABIC HAMZA ABOVE

*​NFC(​​U+0628​​,U+0654​)​ = U+0628​+U+0654​*


​These happen be confusable
​ (with a good font and rendering engine)​
, but Unicode doesn't consider them to be the same under normalization.
That is the case for *thousands of other characters *
*​or character sequences* ​
that are confusable
​,​
but don't normalize together according to NFC.
​

For example, the same is true for
:

U+006F LATIN SMALL LETTER O

*NFC(U+006F​) = ​U+006F*


U+043E CYRILLIC SMALL LETTER O

*NFC(U+043E​) = ​U+043E​*


​And for:

U+00F8 ( ø ) LATIN SMALL LETTER O WITH STROKE

*​NFC(​U+00F8​) = U+00F8​*


U+006F, U+0337 ( o̷ ) LATIN SMALL LETTER O, COMBINING SHORT SOLIDUS OVERLAY

*NFC(U+006F, U+0337​) = ​U+006F, U+0337*


​(with a good font and rendering engine, ​<U+006F, U+0337> will look the
same as U+00F8)

Unicode does not guarantee
that
​all ​
confusable characters will be mapped together by NFC
*—and all the way along we have made it very clear that it does not
guarantee it.*​ The confusability issue is quite complex, and as Asmus
pointed out, not something that is amenable to solution in the lowest level
protocol
​ (IDNA2008)​
.
​ There is far more about this in
http://unicode.org/reports/tr36/
​ and ​
http://unicode.org/reports/tr39/
​


>
> v
>
>
> On Wed, Jan 28, 2015 at 3:43 AM, Mark Davis [image: ☕]️ <
> mark at macchiato.com> wrote:
>
>>
>> On Wed, Jan 28, 2015 at 9:20 AM, Vint Cerf <vint at google.com> wrote:
>>
>>> I am reading your message as saying "ambiguity is ok if there are few
>>> instances of it" while some of us would like the handling of identifiers
>>> encoded with Unicode to be unambiguous.
>>>
>>
>> The sense of "unambiguous" that matters to users is that when they read a
>> sequence of glyphs, their interpretation of the underlying character
>> sequence is correct (in normal environments, with common fonts).
>>
>> That level of "unambiguous" was impossible, even before Unicode.
>>
>> Take 8859-5, with both o and Russian o, or ASCII with "google.corn" vs "
>> goog1e.com". [Both the 1 and lowercase L are an issue, but also in many
>> fonts—in common use—users will read the (r + n) in the former as an m.]
>>
>> To extend Andrew's death analogy, there is no way that we can all live
>> forever. However, there are clearly medical processes and social policies
>> that can improve and extend the years that we all have. But to be
>> productive, the focus needs to be on the big ticket items, and thus needs
>> to be prioritized by real data.
>>
>> Mark <https://google.com/+MarkDavis>
>>
>> *— Il meglio è l’inimico del bene —*
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20150128/fbbc9929/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: emoji_u2615.png
Type: image/png
Size: 1890 bytes
Desc: not available
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20150128/fbbc9929/attachment.png>


More information about the Idna-update mailing list