[R-C] Packet loss response - but how?

Fri May 4 18:33:06 CEST 2012

On 5/4/2012 9:50 AM, Bill Ver Steeg (versteb) wrote:
> The RTP timestamps are certainly our friends.
>
> I am setting up to run some experiments with the various common buffer
> management algorithms to see what conclusions can be drawn from
> inter-packet arrival times. I suspect that the results will vary wildly
> from the RED-like algorithms to the more primitive tail-drop-like
> algorithms. In the case of RED-like algorithms, we will hopefully not
> get to much delay/bloat before the drop event provides a trigger. For
> the tail-drop-like algorithms, we may have to use the increasing
> delay/bloat trend as a trigger.

This would match my experience.  As mentioned, I found access-link 
congestion loss (especially when not competing with sustained TCP flows, 
which is pretty normal for home use, especially if no one is watching 
Netflix...) results in a sawtooth delay with losses at the delay drops.  
This also happens (with more noise and often a faster ramp) when 
competing, especially when competing with small numbers of flows.  Not 
really unexpected.  As this sort of drop corresponds to a full buffer, 
it's pretty much a 'red flag' for a realtime flow.

RED drops I found to be more useful in avoiding delay (of course).  My 
general mechanism was to drop transmission rate (bandwidth estimate) an 
amount proportional to the drop rate; and tail-queue type drops 
(sawtooth) cause much sharper bandwidth drops.  I simply assume all 
drops are in some way related to congestion.

>   As I think about the LEDBAT discussions,
> I am concerned about the interaction between the various algorithms -
> but some data should be informative.

Absolutely.

> We may even be able to differentiate between error-driven loss and
> congestion driven loss, particularly if the noise is on the last hop of
> the network and thus downstream of the congested queue (which is
> typically where the noise occurs). In my tiny brain, you should be able
> to see a gap in the time record corresponding to a packet that was
> dropped due to last-mile noise. A packet dropped in the queue upstream
> of the last mile bottleneck would not have that type of time gap. You do
> need to consider cross traffic in this thought exercise, but statistical
> methods may be able to separate persistent congestion from persistent
> noise-driven loss.

Exactly the mechanism I used to differentiate "fishy" losses from 
"random" ones; "fishy" losses as mentioned cause bigger responses.  I 
still dropped bandwidth on "random" drops, which can be congestion drops 
from RED in a core router so long as the router queue isn't too long.  
You'd also see those from "minimal queue" tail-drop routers.  I did use 
a separate jitter buffer for determining losses (and for my filter info) 
from the normal jitter buffer, which being adaptive might not hold the 
data long enough for me. I actually kept around a second of delay/loss 
data on the video channel, and *if there was no loss* or large delay 
ramp I only reported stats every second or two.

> TL;DR - We can probably tell that we have queues building prior to the
> actual loss event, particularly when we need to overcome limitations of
> poor buffer management algorithms.

If the queues are large enough, or if the over-bandwidth is low enough, 
yes.  If there's a heavy burst of traffic (think modern browsers 
maximizing pageload time to sharded servers), then you may not get a 
chance; you may go from no delay to 200ms taildrop in an RTT or two - or 
even between two 20 or 30ms packets.  And you need to filter enough to 
decide if it's jitter or delay.  (You can make an argument that 'jitter' 
is really the sum of deltas in path queues, but that doesn't help you 
much in deciding whether to react to it or not.)

Data would be useful... :-)

Generally, for realtime media you really want to be undershooting 
slightly most of the time in order to make sure the queues stay at/near 
0.  The more uncertainty you have, the more you want to undershoot.  A 
stable delay signal makes it fairly safe to probe for additional 
bandwidth because you'll get a quick response, and if the probe is a 
"small" step relative to current bandwidth, then the time to recognize 
the filtered delay signal and inform the other side and have them adapt 
(roughly filter delay + RTT + encoding delay).

High jitter can be the result of wireless or cross-traffic, unfortunately.

  Also, especially near startup, my equivalent to slow-start was much 
more aggressive initially to find the safe point, but with each 
overshoot (and drop back below the apparent rate) in the same bandwidth 
range I would reduce the magnitude of the next probes until we had 
pretty much determined the safe rate.  This is most effective in finding 
the channel bandwidth without significant sustained competing traffic on 
the bottleneck link.  If I believed I'd found the channel bandwidth, I 
would remember that, and be much less likely to probe over that limit, 
though I would do so occasionally to see if there had been a change.  
This allowed for faster recovery from short-duration competing traffic 
(the most common case) without overshooting the channel bandwidth.  Note 
that the more effective your queue detection logic is, the less you need 
that sort of heuristic; it may have been overkill on my part.

-- 
Randell Jesup
randell-ietf at jesup.org