[R-C] Most problems are at the bottleneck: was Re: bufferbloat-induced delay at a non-bottleneck node

Thu Oct 13 16:33:04 CEST 2011

Sorry for the length of this.

Problems are usually at the bottleneck, unless you are suffering from
general network congestion.

The most common bottlenecks turn out to be your broadband link, and also
the 802.11 link between your device and your home network (in a home
environment). Since the bandwidths are roughly comparable, the
bottleneck shifts back and forth between them, and can easily be
different in different directions.

There can be problems elsewhere: peering points and firewall gateways
are also common problems, along with general congestion in networks that
are being run without AQM (which is, unfortunately, a significant
fraction of the Internet, and I hypothesise even more common in
corporate networks).

That TCP congestion avoidance is mostly toast is not good news: we could
see a return of classic congestion collapse, and I have one unconfirmed
report of one significant network having done so.

On 10/12/2011 01:12 AM, Randell Jesup wrote:
> Jim:  We're moving this discussion to the newly-created mailing
> sub-list -
>    Rtp-congestion at alvestrand.no
>    http://www.alvestrand.no/mailman/listinfo/rtp-congestion
>
> If you'd like to continue this discussion (and I'd love you to do so),
> please join the mailing list.  (Patrick, you may want to join too and
> read the very small backlog of messages (perhaps 10 so far)).
>
> On 10/11/2011 4:17 PM, Jim Gettys wrote:
>> On 10/11/2011 03:11 AM, Henrik Lundin wrote:
>>>
>>>
>>> I do not agree with you here. When an over-use is detected, we propose
>>> to measure the /actual/ throughput (over the last 1 second), and set
>>> the target bitrate to beta times this throughput. Since the measured
>>> throughput is a rate that evidently was feasible (at least during that
>>> 1 second), any beta<  1 should assert that the buffers get drained,
>>> but of course at different rates depending on the magnitude of beta.
>> Take a look at the data from the ICSI netalyzr: you'll find scatter
>> plots at:
>>
>> http://gettys.wordpress.com/2010/12/06/whose-house-is-of-glasse-must-not-throw-stones-at-another/
>>
>>
>> Note the different coloured lines.  They represent the amount of
>> buffering measured in the broadband edge in *seconds*.  Also note that
>> for various reasons, the netalyzr data is actually likely
>> underestimating the problem.
>
> Understood.  Though that's not entirely relevant to this problem,
> since the congestion-control mechanisms we're using/designing here are
> primarily buffer-sensing algorithms that attempt to keep the buffers
> in a drained state.  If there's no competing traffic at the
> bottleneck, they're likely to do so fairly well, though more
> simulation and real-world tests are needed.  I'll note that several
> organizations (Google/GIPS, Radvision and my old company WorldGate)
> had found that these types of congestion-control algorithms are quite
> effective in practice.

Except that both transient bufferbloat (e.g. what I described in my
screed against IW10), and even a single long lived TCP flow for many
applications (on anything other than Windows XP), will fully saturate
the link).

>
> However, it isn't irrelevant to the problem either:
>
> This class of congestion-control algorithms are subject to "losing" if
> faced with a sustained high-bandwidth TCP flow like some of your
> tests, since they back off when TCP isn't seeing any restriction
> (loss) yet. Eventually TCP will fill the buffers.

Exactly.  In my tests, it took about 10 seconds to fill a 1 second
buffer (which is not atypical of the current amount of bloat).

>
> More importantly, perhaps, bufferbloat combined with the high 'burst'
> nature of browser network systems (and websites) optimizing for
> page-load time means you can get a burst of data at a congestion point
> that isn't normally the bottleneck.

Yup. Ergo the screed, trying to get people to stop before making things
worse.  The irony is that I do understand that, were it not for the fact
that browsers have long since discarded HTTP's 2 connection rule, it
might be a good idea, and help encourage better behaviour.

>
> The basic scenario goes like this:
>
> 1. established UDP flow near bottleneck limit at far-end upstream>
> 2. near-end browser (or browser on another machine in the same house)
>    initiates a page-load
> 3. near-end browser opens "many" tcp connections to the site and
>    other sites that serve pieces (ads, images, etc) of the page.
> 4. Rush of response data saturates the downstream link to the
>    near-end, which was not previously the bottleneck.  Due to
>    bufferbloat, this can cause a significant amount of data to be
>    temporarily buffered, delaying competing UDP data significantly
>    (tenths of a second, perhaps >1 second in cases).  This is hard
>    to model accurately; real-world tests are important.

I've seen up to 150ms on 50Mbps cable service.  Other experiments,
particularly controlled ones, very welcome.

At 10Mbps, that could be over .5 seconds.  And it depends on the web
site and whether IW10 has been turned on.

> 5. Congestion-control algorithm notices transition to buffer-
>    induced delay, and tells the far side to back off.  The latency
>    of this decision may help us avoid over-reacting, as we have to
>    see increasing delay which takes a number of packets (at least
>    1/10 second, and easily could be more).  Also, the result of
>    the above "inrush"/pageload-induced latency may not trigger the
>    congestion mechanisms we discuss here, as we might see a BIG jump
>    in delay followed by steady delay or a ramp down (since if the
>    buffer has suddenly jumped from drained to full, all it can do is
>    be stable or drain).
>
> Note that Google's current algorithm (which you comment on above) uses
> recent history for choosing the reduction; in this case it's hard to
> say what the result would be: if it invokes the backoff at the start
> of the pageload, then the bandwidth received recently is the current
> bandwidth, so the new bandwidth is current minus small_delta.  If it
> happens after data has queued behind the burst of TCP traffic, then
> when the backoff is generated we'll have gotten almost no data through
> "recently" and we may back off all the way to min bandwidth; an
> over-reaction, depending on the time constant and level of how fast
> that burst can fill the downstream buffers.
>
> Now, in practice this is likely messier and the pageload doesn't
> generate a huge sudden block of data that fills the buffers, so
> there's some upward slope to delay as you head to saturation of the
> downstream buffers.  And there's very little you can do about this -
> and backing off a lot may help in that the less data you put onto the
> end of this overloaded queue (assuming the pageload flow has ended or
> soon will), the sooner the queue will drain and low-latency will be
> re-established.

and your poor jitter buffers are *really* unhappy.

>
> Does the ICSI data call out *where* the buffer-bloat occurs?

The data is primarily the broadband connection.

The reason we can say that confidently is the structure present in the
plot is on powers of two increments.

Buffering in the home router, or in your host operating system, is in #
of packets, typically 1500 bytes, and would not show that.

Also note that your home router and host is often *much worse* than the
broadband connection.

On current hardware, there is typically 200-300 packets of buffering
just in the ring buffers of the network device.  On top of that, there
may be even even another 1000 packets of buffering (e.g. the transmit
queue in Linux).

And there can be bloat elsewhere: peering disputes seem to be causing
these, along with firewall relays.

RED is not being deployed where it should, and you can have application
bufferbloat at firewall relays.
http://gettys.wordpress.com/2010/12/17/red-in-a-different-light/

>
>> Then realise that when congested, nothing you do can react faster than
>> the RTT including the buffering.
>>
>> So if your congestion is in the broadband edge (where it often/usually
>> is), you are in a world of hurt, and you can't use any algorithm that
>> has fixed time constants, even one as long as 1 second.
>>
>> Wish this weren't so, but it is.
>>
>> Bufferbloat is a disaster...
>
> Given the loss-based algorithms for TCP/etc, yes.  We have to figure
> out how to (as reliably *as possible*) deliver low-latency data in
> this environment.

Personal Opinion
-----------------------

Well, here's my honest opinion, formed over the last 15 months.

We can't really make our jitter buffers so big as to make for decent
audio/video when bufferbloat is present, unless you like talking to
someone half way to the moon (or further).  Netalyzr shows the problem
in broadband, but our OS's and home routers are often even worse.

Even one TCP connection (moving big data), can induce severe latency on
a large fraction of the existing broadband infrastructure; as Windows XP
retires and more and more applications deploy (e.g. backup, etc.), I
believe we'll hurt more and more.

It's impossible to make any servo system work faster than the RTT time;
and bufferbloat causes that to sometimes go insane.

We can't forklift upgrade all the TCP implementations, which will
compete with low-latency audio/video.  So "fixing" TCP isn't going to
happen fast enough to be useful.  That doesn't mean it shouldn't happen,
just that it's a 5-15 year project to do so.

We do have to do congestion avoidance, as well as TCP would do (if
bufferbloat weren't endemic).

Delay based congestion avoidance algorithms are likely to lose relative
to loss based ones, as far as I understand.  So that means that the same
issue applies as "fixing" TCP.

So the conclusion I came to this time last year was that bufferbloat was
a disaster for the immersive teleconferencing I'm supposed to be working
on, and I switched to working solely on bufferbloat, and getting it
fixed.  Because to make any  of this work well (which means not
generating service calls), we have to fix it.

Timestamps are *really* useful to detect bufferbloat, and detecting
bufferbloat suffering is key to getting people aware of it and motivated
to fix it.  I'd really like to be able to tell people what's going on in
a reliable way, to motivate them to at least fix the gear under their
control and/or provide pressure on those they pay to provide service.
Identifying where the bottleneck is that is at fault is key to this.

So we have to provide "back pressure" into the economic system to get
people to fix the network.  But trying to engineer around this entirely
I believe is futile and counter-productive: we have to fix the
Internet.  To fix the broadband edge will cost of order $100/subscriber:
this isn't an insane price to pay, as even one or two service calls cost
more than that.

Does this mean we're doomed?
-----------------------------------------

I hope not. I think there is going to have to be a multi-prong attack on
the problem.

My sense is that the worst problem is in the home and on wireless
networks.  As I can't work on the wireless networks except 802.11, I've
focussed there.  But in the home, courtesy of Linux being commonly used
in home routers, we have the ability to do a whole lot.

In the short/immediate term, mitigations are possible.  My home network
now works tremendously better than it did a year ago, and yours can
immediately too, even with many existing home routers.  But doing so is
probably beyond non-network wizards today.

The CeroWrt build of OpenWrt is a place where we're working on both
mitigations and solutions for bufferbloat in home routers (along with
other things that have really annoyed us about what we can buy
commercially).  See: http://www.bufferbloat.net/news/19 . Please come
help out.  The immediate mitigations include just tuning the router's
buffering to something more sensible.

Over the next several months, we hope to start testing AQM algorithms.

Note that even if the traditional 100ms "rule of thumb" buffersizing
http://gettys.wordpress.com/2011/07/06/rant-warning-there-is-no-single-right-answer-for-buffering-ever/
is still too high; we really need AQM in both our home routers, our
broadband gear, and in our operating systems.  The long term telephony
standard for "good enough" latency is 150ms, and to leave the gate
having lost 100ms isn't a good; if both ends are congested, you are at
200ms + the delay in the network.

Now, if you are willing to bandwidth shape your broadband service
strongly, you can already do much better than 100ms today.  That
requires you to tune your home router (if it is capable).  I'll be
posting a more "how to" entry in the blog sometime soon; but network
geeks should be able to hack what I wrote before at:
http://gettys.wordpress.com/2010/12/13/mitigations-and-solutions-of-bufferbloat-in-home-routers-and-operating-systems/ 
Your other strategy as I'll outline in my how-toish document is to
ensure your wireless bandwidth is *always* higher than that of your
broadband bandwidth; ensuring the bottleneck is at a point in the
network you can control.  This still doesn't solve downstream transient
bufferbloat at bad web sites, but I think it will solve
upstream/downstream elephant flows killing you.

Another form of mitigation is getting the broadband buffering back under
control.  That will get us back to the vicinity of the traditional "rule
of thumb".
http://gettys.wordpress.com/2011/07/13/progress-on-the-cable-front/
which is a lot better than where we are now.

Since I wrote that, I've confirmed (most recently last week) that the
cable modem and CMTS changes are well under way; it appears sometime
mid/late 2012 that will start deployment.  You may need to buy a new
cable modem when the time comes (though ones with the upgrade will
probably start shipping this year).  I have no clue if older existing
cable modems will ever see firmware upgrades, though I predict DOCSIS 2
modems almost certainly will not. I am hopeful that their motion will
force mitigation into DSL and fiber at least eventually.  But this just
gets us back to the 100ms range (maybe worse, given powerboost).

Obviously, if your network operator doesn't run AQM, then they should
and you should help educate them.

Solutions
======
Solutions come in a number of forms.

We need AQM that works and is self tuning.  And we need it even in our
operating systems.  The challenge here is that classic RED 93 and
similar algorithms won't work in the face of highly variable bandwidth.

Traffic classification can at best move who suffers when, but doesn't
fix the problem.  I still want it, however.  To do it for real, in the
broadband edge will be "interesting", as to who classifies traffic how;
today, you typically only get one queue however (though the technologies
will often support multiple queues).

Classification would also be really nice: but today, most broadband
systems have exactly one queue that you have access to.  Carrier's VOIP
is generally provisioned separately; they have a (unintended, I believe)
fundamental advantage right now.  It turns out that diffserv has been
discovered by (part of) the gaming industry noticing that Linux's
PFIFO-FAST queue discipline implements diffserv.  So you can get some
help by marking traffic. Andrew McGregor had the interesting idea that
maybe the broadband headends could observe how traffic is being marked
and classify similarly in the downstream direction.

Even with only one queue, at least we can control what happens in the
upstream direction (at least if we can keep the buffers from filling in
the broadband gear).  In the short term, bandwidth shaping is our best
tool, and I'm working on other ideas as well.

Getting all the queues lined up is still going to take some effort,
between diffserv marking, 802.11 queues, ethernet queues, etc...

I also believe that we need the congestion exposure stuff going on in
the IETF in the long term, to provide disincentives for abuse of the
network, as well as proper accounting of congestion.

What should this group do?
================

I have not seen a way to really engineer around bufferbloat at the
application layer, nor even in the network stack. It's why I'm working
on bufferbloat rather than teleconferencing, which I was hired to work
on; if we don't fix that, we can't really succeed properly on the
teleconferencing front.

I believe therefore:
    o work on the real time applications problem should not stop in the
meanwhile; it is the compelling set of applications to motivate fixing
the Internet.
      o exposing the bloat problem so that blame can be apportioned is
*really* important.  Timestamps would help greatly here in rtp in doing
so.  Modern TCP's (may) have the TCP timestamp option turned on (I know
modern Linux systems do), so I don't know of anything needed there
beyond ensuring the TCP information is made available somehow, if it
isn't already. Being able to reliably tell people: "The network is
broken, you need to fix (your OS/your router/your broadband gear)." is
productive. and to deploy IPv6 we're looking to deploying new home kit
anyway.
    o designing good congestion avoidance that will work in in an
unbroken, unbloated network is clearly needed.  But I don't think heroic
engineering around bufferbloat is worthwhile right now for RTP; that
effort is better put into the solutions outlined above, I think.  Trying
to do so when we've already lost the war (teleconferencing isn't
interesting when talking half way to the moon) is not productive, and
getting stable servo systems to work not just at the 100ms level, but
the multi-second level, when multi-second level isn't even usable for
the application is a waste.  RTP == Real-Time Transport Protocol, when
the network is no longer real time, is an oxymoron.
    o worrying about how to get diffserv actually usable (so that we can
classify at the broadband head end) seems worthwhile to me.  I'd like to
get the web mice (transient bufferbloat), to not interfere with
audio/video traffic.  I like Andrew McGregor's idea, but don't know if
will hold water.  That we can expect diffserv to sort of work in the
upstream direction already is good news; but we also need downstream to
work.
    o come help on the home router problem; if you want teleconferencing
to really work well, it needs lots of TLC.  And we have the ability to
not just write specs, but to demonstrate working code here.

                    - Jim