[R-C] RRTCC issues: loss, decrease rate

Mon Aug 6 18:10:15 CEST 2012

This is the first of a few messages on the details of RRTCC I expect to 
post, based on analysis (not testing) of the algorithm, and my 8-year 
experience with an algorithm I designed which had very similar theory 
and underpinnings.  This has been used (mostly in residential settings) 
in hundreds of thousands of devices mostly for peer-to-peer calls.

Please feel free to critique!  I make no assertions that this analysis 
is guaranteed correct (and in fact I'm sure it will be pointed out 
several ways in which it's wrong), but I think it will be a helpful 
starting point.  I also realize there are some simplifications assumed 
below; I've tried to note them.

The first focus is on loss.  I'll primarily focus on the impact of 
tail-drop policies for now.

Loss affects a stream in a number of ways:

1) Directly - loss of a packet in the RTP stream
2) Indirectly - loss of a packet for another traffic stream, for this or 
another destination
3) "Random" loss (non-congestion)

Note: Not all channels experience "random" loss, though some "random" 
loss is momentary congestion of a core router or some type of AQM.  
Since in this case we're focusing on tail-drop routers (at the 
bottleneck), we'll assume this category can be modeled as non-congestive 
random losses.

Obviously the more inputs and knowledge of the streams (especially 
between the two endpoints, via any port, not just the 5-tuple) the 
better the model can perform.  A confounding issue to be discussed later 
is how differential packet marking affects congestion and bandwidth models.

Since RRTCC works from inter-packet-delays (when compared to sending 
times, either from RTP timestamps of header extensions with 
closer-to-the-stack sending times), let's look at how these types of 
loss affect the signals seen by the receiver and the Kalman filter.

1) Direct loss
-------------------
In this case, we'll see the loss directly.  This is common in 
access-link self-congestion, where our stream(s) are the only 
significant user of bandwidth or if there are a very few users, and 
we've exceeded the capacity of the physical link.

The inter-packet delay was likely increasing steadily leading up to the 
loss.  For example, if the other side is sending at 220Kbps at 20ms 
intervals with an access bottleneck (upstream at their side, or 
downstream at our side) of 200Kbps, and there's no other traffic on the 
link, then the packets would have been coming in about 22ms apart 
instead of 20ms.

Let's assume a very short queue depth (no bufferbloat!) of 2 packets 
(40ms), and look what happens.  For whatever reason (packet loss, noisy 
signal to the filter, long RTT, cross-traffic that just went away, etc) 
let's assume that the system didn't react before the loss happened.

When the packet is lost, there must have been 2 packets in the buffer.  
The receiver will see packet N, N+1, and then N+3 and N+4. N will have 
been delayed around 38ms, N+1 around 40ms, N+3 about 22ms, and N+4 
around 24ms  Pictorially  this would look like a sawtooth in the delay 
profile.

If you naively input these into the filter, it will start to move down 
away from a 10% slope, and might indicate a flat queue-depth profile or 
perhaps even negative (draining) for a short time, when in fact we've 
been over-bandwidth the entire time.  This would depend on the exact 
filter type and usage, but this certainly violates some of the 
assumptions about error in a Kalman filter (for example, Gaussian 
distribution).  At least in this case, a Kalman filter might not be the 
optimum choice.

* Possible modifications:
a) drop samples where there's a loss, since losses frequently perturb 
the delta of the next packet
     This will reduce the 'noise' in the inputs to the filter, 
especially in simple cases like above.
b) increase the "uncertainty" of the next packet dramatically in the 
input to the filter
     This is an attempt to get the Kalman filter to weight the packet 
much lower, but not ignore it entirely.  I don't see this as very useful 
in practice.
c) look at whether there was a significant drop in delay following a 
loss, and use that as a separate indicator that we're over-bandwidth

In my algorithm, I termed these "fishy" losses - they implied a queue 
suddenly shortened.  Too many (more than 1) of these in a short period 
of time meant I would decrease sending bandwidth by a significant extra 
amount, since the implication is not just delay, but a full queue, which 
I really want to get out of fast.  This is a form of modification 'c'.

If we're not the only stream on the bottleneck, and we just got 
"unlucky" and had our packet get dropped, similar issues may occur, but 
the reduction in delay of the next packet may be less or much less, 
since we're typically metering in our packets at a frame rate.  With 
enough other traffic, the delay of the next packet will be roughly flat 
(or 40ms in the case above).  So this modification (c) is of most use in 
detecting the bandwidth/queue limit of a relatively idle link, typically 
an access (especially upstream) link.  It therefore may want to be 
combined with other mechanisms.

If we see a direct loss without a significant reduction in delay, we 
need to assume it's either a congested link (not a physical layer limit 
we're hitting on an idle link) or it's a "random" loss. Losses on a 
congested link also indicate a full queue, and so even though the delay 
stopped increasing (or stayed on=average stable after you've hit the max 
and started dropped), you still want to decrease sending rate.  If it's 
a "random" loss (and not AQM), this causes a minor under-utilization, 
but for a truly congested link or if AQM is causing "random" drops, it's 
a signal we should reduce to try to start the queue draining.   To avoid 
over-reaction and 'hunting', it may be good to use some type of 
threshold, perhaps averaged or filtered.  If the loss rate reached ~5%, 
I'd make a mild cut in sending rate on top of any filter-suggested cuts, 
and if it reached ~10% I'd make a strong cut.

(to be continued in another post)

-- 
Randell Jesup
randell-ietf at jesup.org