Unicode & IETF

Tue Aug 12 19:31:49 CEST 2014

(It's strange how the subject migrated to the changed one).

> the problem is poverty of vocabulary then. I said nothing about "meaning"
only about encoding and the side effects of having two ways to represent the same <character? glyph? thing?>. Unless canonicalization produces only one representation, comparisons can fail and create unintended results.

That's very 'mathematical'.  Either A) the system has to ignore certain linguistic considerations in favor of mathematical precision, or B) the canonicalization has to allow linguistic variation at the expense of mathematical certainty.  With IDN we sort of have both.

For the first, since IDNA2003 we've sacrificed some linguistic variation for precision.  The Turkish I is an obvious example.  I don't want to argue right/wrong, I'm just pointing out that it is probably what neither a Turkish user nor a non-Turkish user would expect for those 4 characters.

DNS has never really had 'certainty'.  For German users before IDN, it's unclear without testing whether I need to use a or ae when a word had a-umlaut.  Now there's an additional form that users can try.

So, I hypothesize that canonicalization is important in that it provides consistent output for the same inputs.  We know that it's going to fail linguistically, so we need to ensure that it remains consistent.  If it remains consistent, then ambiguities can be resolved by bundling or blocking at the registrar level, or anti-phishing/blacklisting/etc tactics at the client level.

When moving the needle through the gray area between linguistic permissiveness and mathematical precision, I would prefer to err on the side of allowing people to type the things they think they need to type.

-Shawn