IDN processing-related security considerations for draft-ietf-websec-strict-transport-sec

Mon Oct 10 19:33:15 CEST 2011

--On Sunday, October 09, 2011 19:24 +0200 Frank Ellermann
<hmdmhdfmhdjmzdtjmzdtzktdkztdjz at gmail.com> wrote:

> Update:
>...
> U+0BFE and U+0BFF are unassigned Unicode points in the Tamil
> block; at the moment xn--cocacola is a "fake A-label".  Sadly
> XN-labels do not tell me if mixing Tamil and Telugu will be
> always utter dubious.

Frank,

I'm not sure what you are asking about... or for.  An unassigned
code point can, in principle, always be assigned in some future
version of Unicode.  I supposed one could make predictions about
likelihood on a script by script or block by block basis, but
they would be predictions, not firm promises.

If the concern is confusion between A-labels (or fake A-labels)
and ASCII strings that don't contain the prefix, that is
inherent in this type of encoding.  We will need to hope that we
can avoid most such concerns by careful UI design and repel most
of the others by careful explanation.

So what issue do you see and what do you think should be done
about it?

> Different ??-- introducers identifying selected subsets of
> relevant scripts could be an idea.

Yes.  And, has been discussed many times before, one could use
different introducers (or one introducer and a language tag) to
identify labels by language.   However, such a strategy would
not change the exact match behavior of DNS servers, so the user
would need to know exactly what language was in use and how
(e.g., to what level of precision) it was coded in order to
successfully look something up.   I can think of a whole
collection of reasons why that is impractical.   Using different
prefixes (introducers) to identify different script subsets
would have the same problems or worse because, again, the user
would need to be able to identify the intentions of the
registrant in order to look up a string.  Perhaps YMMD.

>   In other words, meanwhile
> I found UTS 46 and its IDN FAQ.  This was a brave attempt to
> rescue IDNA2008, but I'm not convinced that any "transitional"
> labels containing various IDNA2008 DISALLOWED Unicode points
> "go away", why should they, ever?

And that is a different version of the concern that many of us
have about the UTF 46 approach.  From our point of view, the
incompatible changes associated with IDNA2008 are a necessary
consequence of eliminating properties of IDNA2003 that we
believe to be serious problems: the difficulties of recovering
the labels that users entered in native character form from the
Punycode-encoded ACE forms, permitting problematic punctuation
and other non-letter/ non-numeric characters, doing more
checking at lookup time because of the unpredictability of
properties of newly-assigned code points, discarding characters
that are required to make differentiations important for some
scripts (notably ZWJ and ZWNJ), eliminating side-effects of case
folding that were problematic for some writing systems,
providing a reasonable level of Unicode version independence,
and tidying up a lot of details.  The WG could have accepted
some of those changes and not others, but didn't.  The list
represents the rough consensus of the WG and the IETF.

>From that point of view, UTR 46 is "preserve parts of IDNA2003
forever".  It isn't really a transition strategy because there
is no real transition model.  It isn't a compatibility strategy
because, if different implementations make different decisions
about what mappings to use (perhaps under local pressure to make
some code points or some IDNA2008 treatments available), then we
end up with even more confusing incompatibility problems.

Speaking for myself, I care a lot about backward compatibility
problems.  Had they dominated for me in the discussions leading
to IDNA2008, I would have argued for grandfathering existing
registrations even though doing so and preserving the other
advantages of IDNA2008 would have created a vastly more complex
protocol and implementations.   But I'm looking at the curve of
IDN adoption and deployment into actual use and the likely
changes in the shape of that curve as IDN TLDs becomes widely
available and supported.   And that leads me to the conclusion
that long-term smooth IDN operation and usability justify our
saying to those comparatively few existing labels and uses
"sorry, you were an early adopter and early adopters sometimes
get burned as things evolve" rather than incur that complexity,
long-term unpredictability, and limitations of trying to
preserve strict compatibility with IDNA2003.  That there are
comparatively few problematic labels in use as compared to the
number if IDNs we expect to see in active use a decade from now
strongly influenced that conclusion.

YMMD.  Others certainly differ.   But, again, I'm not sure what
you are suggesting or asking for.