UTS 46 (was: IDN processing-related security considerations for draft-ietf-websec-strict-transport-sec)

Tue Oct 11 07:17:51 CEST 2011

On 10 October 2011 19:33, John C Klensin wrote:

> I'm not sure what you are asking about... or for.

Hi John, my main point was that the Unicode IDNA online tool
works again; I could test xn--cocacola.  It contains three
occurrences of two unassigned Unicode points sometimes shown
as U+FFFD.  And the remaining PVALID points are in a script
I cannot read; so from my POV I _hope_ that user agents will
offer to display only scripts selected by me, and otherwise
stick to raw XN-labels.

That preference isn't affected by the IDNA version or UTS 46.
If user agents can somehow display only "my" scripts, then
they would be also able to flag say Latin + Cyril mixtures,
no matter if that is an otherwise "valid" U-label below .net
or .blogspot.com or .xn--xyzzy.dyndns.org.

> An unassigned code point can, in principle, always be assigned
> in some future version of Unicode.  I supposed one could make
> predictions about likelihood on a script by script or block by
> block basis, but they would be predictions, not firm promises.

ACK, no problem with that.  If user agents do not know a given
Unicode point they will handle it as "unassigned".  And their
"knowledge" can be in ROM (or similar scenarios).

> So what issue do you see and what do you think should be done
> about it?

Nothing relevant on this list, I know where I can disable IDNA
in Firefox, and I'd know how to submit Chrome feature requests.

Just two examples of user agents where I cannot select which
scripts I know.  Apparently UTS 46 got it "right" for the few
characters I'd really need (äöüß as it used to be in IDNA2003),
for a very subjective concept of "right".

>> Different ??-- introducers identifying selected subsets of
>> relevant scripts could be an idea.

> Yes.  And, has been discussed many times before, one could use
> different introducers (or one introducer and a language tag) to
> identify labels by language.   However, such a strategy would
> not change the exact match behavior of DNS servers, so the user
> would need to know exactly what language was in use and how
> (e.g., to what level of precision) it was coded in order to
> successfully look something up.   I can think of a whole
> collection of reasons why that is impractical.   Using different
> prefixes (introducers) to identify different script subsets
> would have the same problems or worse because, again, the user
> would need to be able to identify the intentions of the
> registrant in order to look up a string.  Perhaps YMMD.

Yes, I'd like it better if there would be 36*36 potential Unicode
subsets indicated by aa-- to 99-- labels, where registered subsets
are specified in an RFC and/or IANA registry.  I could then pick
"show valid cy-- labels for Cyril" and "valid nn-- labels for what
I consider as no-nonsense Latin (no szlig, no long s, no IPA)".

But obviously that's not what happened in IDNA, and emulating this
arguably desired behaviour on top of xn-- is left as an exercise
for browser developers.

I posted the xn--cocacola update here, because I stumbled over an
unrelated ICANN draft, where I saw two "long s" Unicode points
U+1E9C and U+1E9D without the real thing U+017F, but including
U+00DF:

<http://www.icann.org/en/topics/new-gtlds/latin-vip-issues-report-07oct11-en.pdf>

That is ominously inconsistent, but the Unicode IDN FAQ at least
answered my questions about U+00DF.  Oddly I read RFC 5890 + 5894
before, and still missed the essential fact that IDNA2008 allows
U+00DF.  Of course there was never a chance to get this right for
everybody.

This ICANN draft doesn't use the word "multi-stakeholder" anywhere
and references IDNA2008; some folks on this list might like it.
There are similar ICANN drafts for other scripts, I read only the
Latin paper.

>> I'm not convinced that any "transitional" labels containing
>> various IDNA2008 DISALLOWED Unicode points "go away", why
>> should they, ever?

> And that is a different version of the concern that many of us
> have about the UTF 46 approach.  From our point of view, the
> incompatible changes associated with IDNA2008 are a necessary
> consequence of eliminating properties of IDNA2003 that we
> believe to be serious problems: the difficulties of recovering
> the labels that users entered in native character form from the
> Punycode-encoded ACE forms, permitting problematic punctuation
> and other non-letter/ non-numeric characters, doing more
> checking at lookup time because of the unpredictability of
> properties of newly-assigned code points, discarding characters
> that are required to make differentiations important for some
> scripts (notably ZWJ and ZWNJ), eliminating side-effects of case
> folding that were problematic for some writing systems,
> providing a reasonable level of Unicode version independence,
> and tidying up a lot of details.  The WG could have accepted
> some of those changes and not others, but didn't.  The list
> represents the rough consensus of the WG and the IETF.

I certainly won't miss any "I<love>something" labels, my keyboard
has no <love>-key, and I don't know the Unicode points (plural)
for <love>-dingbats by <heart>.

But I'll hate u+00DF forever, that is just the consequence of too
many years with QWERTY before QWERTZ came around, plus a spelling
reform not matching what I learned in school (BTW, in theory I
like this reform, because it simplified the u+00DF rules, but in
practice I rarely get it right), plus the stupi^H^H^Hrange U+1E9E.

> From that point of view, UTR 46 is "preserve parts of IDNA2003
> forever".  It isn't really a transition strategy because there
> is no real transition model.  It isn't a compatibility strategy
> because, if different implementations make different decisions
> about what mappings to use (perhaps under local pressure to make
> some code points or some IDNA2008 treatments available), then we
> end up with even more confusing incompatibility problems.

ACK, generally.  But UTS 46 has a point wrt U+00DF, and I expect
the same behaviour for äöü vs. ÄÖÜ as for aou vs. AOU in labels.
Apparently IDNA2008 does not more enforce this behaviour.

> again, I'm not sure what you are suggesting or asking for.

Nothing in particular, I used xn--cocacola in an example, because
I thought it is a known funny XN-label, and was caught off guard
when Jeff reported that this "works" to some degree with Firefox.
Anything else were just side-effects of my attempts to understand
why some folks don't like IDNA2008.  Meanwhile I got it.

-Frank