Filtering at lookup? (was: Re: Historic scripts as MAYBE?)
Erik van der Poel
erikv at google.com
Wed Apr 30 18:59:46 CEST 2008
Martin's email seemed to focus on DISALLOWED and UNASSIGNED, while
John's response seemed to address CONTEXTJ. Let me try to address
Martin's email by focusing first on DISALLOWED.
U+2044 (FRACTION SLASH) is currently DISALLOWED in the IDNA 2008 :-)
drafts. This character is displayed using the same glyph as slash (/)
in some fonts. It is dangerous in domain names because domain names
are often used in URIs (and IRIs), where the slash is a syntax
character. For example, we might have a malicious domain name/URI such
as:
http://www.paypal.com/secure.cc/login.html
where the slash after paypal.com is actually U+2044. Now, I think we
can agree that it would be undesirable for the user agent to *display*
this domain name in Unicode format, where U+2044 looks like the normal
ASCII slash.
Where we don't seem to agree, is how to avoid *displaying* that. One
way is for the spec to simply disallow the *display* of such
characters.
Another way is for the spec to disallow the *lookup* (resolution) of
such domain names. One could say that this is a somewhat more
conservative or careful approach.
Of course, there are many DISALLOWED characters that do not seem as
dangerous as U+2044. However, I believe most of us have acknowledged
that it would take far too long if the working group had to make a
decision for *every* character in Unicode. This is why we have divided
Unicode into large swaths, and have tried to rationalize this
approach. Do you disagree?
Erik
On Wed, Apr 30, 2008 at 5:22 AM, John C Klensin <klensin at jck.com> wrote:
>
>
> --On Wednesday, 30 April, 2008 15:05 +0900 Martin Duerst
>
> <duerst at it.aoyama.ac.jp> wrote:
>
> > I think I understand most positions in the Historic scripts as
> > MAYBE? thread now. But I think what hasn't been fully
> > discussed is the assumption that clients should/must filter
> > domain names before resolving them, and never resolve domain
> > names containing DISALLOWED or UNASSIGNED characters. This is
> > quite different from IDNA 2003.
> >
> > As far as I understand, John and Patrik in particular are very
> > convinced that this is necessary. But I, for one, am absolutely
> > not convinced this is necessary, and think that we need to have
> > a really serious discussion about it.
> >
> > As far as I understand, there were mainly two kinds of
> > arguments for why clients must check:
> > 1) It makes sure registries/registrars are kept in check, and
> > have a strong incentive to not register disallowed or
> > unassigned stuff.
> > 2) It help deal with some edge cases related to normalization
> > and such.
> >
> > I personally think 1) has some validity, but I won't be
> > convinced before I see some reports of real examples that
> > produced something seriously undesirable.
> >
> > As for 2), whenever I read such arguments, I always felt that
> > they were so convolved and exceptional that they didn't really
> > make a serious engineering argument.
>
> Martin,
>
> Let me address (2) because it is short. Will come back to (1)
> unless someone else does in between.
>
> While the edge normalization cases are also relevant, the most
> important cases involve the joiner characters that were mapped
> out in IDNA2003. Those characters have turned out to be
> critical to construction of reasonable mnemonics in a few
> languages. If used in context with scripts where they are not
> expected, they are invisible and an invitation to all sorts of
> trouble. To put a positive gloss on things, the IDNA2003
> decision was that they weren't important and that the rule sets
> that would be required to handle them were not worth the effort.
> Subsequent experience and strong input from a few script
> communities has convinced us that they are important enough to
> justify the contextual rule machinery and checking at lookup
> time.
>
> While I think there were errors made in IDNA2003, this isn't one
> of them. There was a tradeoff to be made between the perceived
> need for these characters and the risks they caused and
> complexity they would have added. The WG decided to minimize
> the risks and complexity. Because of issues with some scripts,
> notably the Indic ones, the need is higher than we judged at the
> time and the complexity needed to accommodate them while keeping
> the risk level from rising therefore seems worth it.
>
> Whether one needs to apply those rules at lookup time depends on
> the degree to which we believe that predictable behavior is
> important and about how much we can trust registry behavior. On
> the latter, I've made overly-strong statements in the past and
> have been interpreted as saying things that are even more harsh.
> But we know that there are many millions of zone administrators
> ("registries") out there, not just a few hundred. The care
> with which those registries are administered inevitably varies
> over a very broad range, from the extremely casual to the quite
> rigorous. If someone is trying to register a string without
> understanding the rules or the reasons for them, or who has
> concluded or been persuaded that their preferences are more
> important than the rules (or with evil intent), we need to
> assume that they will have little trouble finding somewhere to
> register it (whether the place that they find is driven by
> carelessness, cost-minimization, or greed is irrelevant). Our
> _only_ way to discourage forum-shopping of that type and to
> assure predictable behavior involves standardizing enough
> lookup-time checking that such people can, at least, not be
> assured that their names will be looked up.
>
> john
>
>
>
>
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
More information about the Idna-update
mailing list