Filtering at lookup? (was: Re: Historic scripts as MAYBE?)

Wed Apr 30 14:22:55 CEST 2008

--On Wednesday, 30 April, 2008 15:05 +0900 Martin Duerst
<duerst at it.aoyama.ac.jp> wrote:

> I think I understand most positions in the Historic scripts as
> MAYBE? thread now. But I think what hasn't been fully
> discussed is the assumption that clients should/must filter
> domain names before resolving them, and never resolve domain
> names containing DISALLOWED or UNASSIGNED characters. This is
> quite different from IDNA 2003.
> 
> As far as I understand, John and Patrik in particular are very
> convinced that this is necessary. But I, for one, am absolutely
> not convinced this is necessary, and think that we need to have
> a really serious discussion about it.
> 
> As far as I understand, there were mainly two kinds of
> arguments for why clients must check:
> 1) It makes sure registries/registrars are kept in check, and
>    have a strong incentive to not register disallowed or
> unassigned    stuff.
> 2) It help deal with some edge cases related to normalization
> and    such.
> 
> I personally think 1) has some validity, but I won't be
> convinced before I see some reports of real examples that
> produced something seriously undesirable.
> 
> As for 2), whenever I read such arguments, I always felt that
> they were so convolved and exceptional that they didn't really
> make a serious engineering argument.

Martin,

Let me address (2) because it is short.  Will come back to (1)
unless someone else does in between.

While the edge normalization cases are also relevant, the most
important cases involve the joiner characters that were mapped
out in IDNA2003.  Those characters have turned out to be
critical to construction of reasonable mnemonics in a few
languages.  If used in context with scripts where they are not
expected, they are invisible and an invitation to all sorts of
trouble.  To put a positive gloss on things, the IDNA2003
decision was that they weren't important and that the rule sets
that would be required to handle them were not worth the effort.
Subsequent experience and strong input from a few script
communities has convinced us that they are important enough to
justify the contextual rule machinery and checking at lookup
time.

While I think there were errors made in IDNA2003, this isn't one
of them.   There was a tradeoff to be made between the perceived
need for these characters and the risks they caused and
complexity they would have added.  The WG decided to minimize
the risks and complexity.  Because of issues with some scripts,
notably the Indic ones, the need is higher than we judged at the
time and the complexity needed to accommodate them while keeping
the risk level from rising therefore seems worth it.

Whether one needs to apply those rules at lookup time depends on
the degree to which we believe that predictable behavior is
important and about how much we can trust registry behavior.  On
the latter, I've made overly-strong statements in the past and
have been interpreted as saying things that are even more harsh.
But we know that there are many millions of zone administrators
("registries") out there, not just a few hundred.   The care
with which those registries are administered inevitably varies
over a very broad range, from the extremely casual to the quite
rigorous.  If someone is trying to register a string without
understanding the rules or the reasons for them, or who has
concluded or been persuaded that their preferences are more
important than the rules (or with evil intent), we need to
assume that they will have little trouble finding somewhere to
register it (whether the place that they find is driven by
carelessness, cost-minimization, or greed is irrelevant).  Our
_only_ way to discourage forum-shopping of that type and to
assure predictable behavior involves standardizing enough
lookup-time checking that such people can, at least, not be
assured that their names will be looked up.

     john