Filtering at lookup? (was: Re: Historic scripts as MAYBE?)

Wed Apr 30 18:59:46 CEST 2008

Martin's email seemed to focus on DISALLOWED and UNASSIGNED, while
John's response seemed to address CONTEXTJ. Let me try to address
Martin's email by focusing first on DISALLOWED.

U+2044 (FRACTION SLASH) is currently DISALLOWED in the IDNA 2008 :-)
drafts. This character is displayed using the same glyph as slash (/)
in some fonts. It is dangerous in domain names because domain names
are often used in URIs (and IRIs), where the slash is a syntax
character. For example, we might have a malicious domain name/URI such
as:

http://www.paypal.com/secure.cc/login.html

where the slash after paypal.com is actually U+2044. Now, I think we
can agree that it would be undesirable for the user agent to *display*
this domain name in Unicode format, where U+2044 looks like the normal
ASCII slash.

Where we don't seem to agree, is how to avoid *displaying* that. One
way is for the spec to simply disallow the *display* of such
characters.

Another way is for the spec to disallow the *lookup* (resolution) of
such domain names. One could say that this is a somewhat more
conservative or careful approach.

Of course, there are many DISALLOWED characters that do not seem as
dangerous as U+2044. However, I believe most of us have acknowledged
that it would take far too long if the working group had to make a
decision for *every* character in Unicode. This is why we have divided
Unicode into large swaths, and have tried to rationalize this
approach. Do you disagree?

Erik

On Wed, Apr 30, 2008 at 5:22 AM, John C Klensin <klensin at jck.com> wrote:
>
>
>  --On Wednesday, 30 April, 2008 15:05 +0900 Martin Duerst
>
> <duerst at it.aoyama.ac.jp> wrote:
>
>  > I think I understand most positions in the Historic scripts as
>  > MAYBE? thread now. But I think what hasn't been fully
>  > discussed is the assumption that clients should/must filter
>  > domain names before resolving them, and never resolve domain
>  > names containing DISALLOWED or UNASSIGNED characters. This is
>  > quite different from IDNA 2003.
>  >
>  > As far as I understand, John and Patrik in particular are very
>  > convinced that this is necessary. But I, for one, am absolutely
>  > not convinced this is necessary, and think that we need to have
>  > a really serious discussion about it.
>  >
>  > As far as I understand, there were mainly two kinds of
>  > arguments for why clients must check:
>  > 1) It makes sure registries/registrars are kept in check, and
>  >    have a strong incentive to not register disallowed or
>  > unassigned    stuff.
>  > 2) It help deal with some edge cases related to normalization
>  > and    such.
>  >
>  > I personally think 1) has some validity, but I won't be
>  > convinced before I see some reports of real examples that
>  > produced something seriously undesirable.
>  >
>  > As for 2), whenever I read such arguments, I always felt that
>  > they were so convolved and exceptional that they didn't really
>  > make a serious engineering argument.
>
>  Martin,
>
>  Let me address (2) because it is short.  Will come back to (1)
>  unless someone else does in between.
>
>  While the edge normalization cases are also relevant, the most
>  important cases involve the joiner characters that were mapped
>  out in IDNA2003.  Those characters have turned out to be
>  critical to construction of reasonable mnemonics in a few
>  languages.  If used in context with scripts where they are not
>  expected, they are invisible and an invitation to all sorts of
>  trouble.  To put a positive gloss on things, the IDNA2003
>  decision was that they weren't important and that the rule sets
>  that would be required to handle them were not worth the effort.
>  Subsequent experience and strong input from a few script
>  communities has convinced us that they are important enough to
>  justify the contextual rule machinery and checking at lookup
>  time.
>
>  While I think there were errors made in IDNA2003, this isn't one
>  of them.   There was a tradeoff to be made between the perceived
>  need for these characters and the risks they caused and
>  complexity they would have added.  The WG decided to minimize
>  the risks and complexity.  Because of issues with some scripts,
>  notably the Indic ones, the need is higher than we judged at the
>  time and the complexity needed to accommodate them while keeping
>  the risk level from rising therefore seems worth it.
>
>  Whether one needs to apply those rules at lookup time depends on
>  the degree to which we believe that predictable behavior is
>  important and about how much we can trust registry behavior.  On
>  the latter, I've made overly-strong statements in the past and
>  have been interpreted as saying things that are even more harsh.
>  But we know that there are many millions of zone administrators
>  ("registries") out there, not just a few hundred.   The care
>  with which those registries are administered inevitably varies
>  over a very broad range, from the extremely casual to the quite
>  rigorous.  If someone is trying to register a string without
>  understanding the rules or the reasons for them, or who has
>  concluded or been persuaded that their preferences are more
>  important than the rules (or with evil intent), we need to
>  assume that they will have little trouble finding somewhere to
>  register it (whether the place that they find is driven by
>  carelessness, cost-minimization, or greed is irrelevant).  Our
>  _only_ way to discourage forum-shopping of that type and to
>  assure predictable behavior involves standardizing enough
>  lookup-time checking that such people can, at least, not be
>  assured that their names will be looked up.
>
>      john
>
>
>
>
>
>  _______________________________________________
>  Idna-update mailing list
>  Idna-update at alvestrand.no
>  http://www.alvestrand.no/mailman/listinfo/idna-update
>