Maintaining Time-Decaying Stream Aggregates - Cohen-Wang

Download Report

Transcript Maintaining Time-Decaying Stream Aggregates - Cohen-Wang

Maintaining Time-Decaying
Stream Aggregates
Edith Cohen
Martin Strauss
AT&T Labs-research
PODS 2003
The Problem
• A data stream is a sequence of data items
observed over time.
• Presence of multiple massive data streams.
• Storage constraints allow only to maintain a
compact summary of the “essence” of
information in each stream.
• Relevance of information decays with time.
• Thus, when aggregating across time, older
information should be discounted.
PODS 2003
Applications
• IP routing - RED protocol: time-decayed
average of previous queue lengths is used to
estimate impending congestion at router
• Internet gateway selection: tracks the quality
(eg packet loss rate) of alternative paths to
select a more reliable one.
• Usage statistics of phone customers: AT&T
has about 100M customers.
• More …..
PODS 2003
Decay Functions
• A decay function is non-increasing g(x)>=0
defined for x>=1.
• f(t) >= 0 is the value of the data item
observed at time t.
• The weight at time T of an item obtained at
time t is g(T-t)
• The decayed value of the item is f(t)g(T-t)
PODS 2003
Time-Decaying Sum
Vg (T )  t T f (t ) g (T  t )
• When f(t) are 0/1 we refer to the problem
as time-decaying count.
• Maintaining the decaying sum exactly can
generally consume linear bits.
• We consider approximately maintaining it to
within 1  
PODS 2003
Time-Decaying Average
• Time-decaying weighted average of observed
values.
• f (ti ) is the value of item observed at time t i
Ag (T ) 
 f (t ) g (T  t )
i
i|ti T
i
 g (T  t )
i|ti T
i
Maintaining time-decaying average reduces to
maintaining two time-decaying sums
PODS 2003
Interesting Families of Decay
Functions
• Exponential decay [Jacobson 88] ExpD
g ( x)  exp(x)
• Sliding Windows [DGIM02] SliWinW
g(x)=1 for x<W
g(x)=0 otherwise
• Polynomial decay PolyD

g ( x)  1 / x
• General Decay functions…
PODS 2003
Exponential Decay
• Used in networking applications (RED)
• Very simple maintenance:
VExpD (t )  f (t )  exp( )VExpD (t 1)
Lemma:
• Exact tracking requires (N ) storage bits
• Approximate tracking uses (logN ) bits
PODS 2003
Sliding Window Decay
Lemma: [DGIM02]
Sliding window decay can be approximately
tracked using (log2 W ) bits (for 0/1 or
poly size values).
• “Sharp Threshold”
• Upper bound using the Exponential
Histogram (EH) technique.
PODS 2003
Polynomial Decay
Lemma:
Lower bound: (logN )
Upper bound: O(log N log log N )
(N is elapsed time)
• Often more appropriate to applications
than Exponential or Sliding Window decay
• More efficient than SliWin decay (nearly
quadratic gap), almost as efficient as
Exponential decay.
PODS 2003
General Decay Functions
• Lemma:
Can be (approximately) maintained using
2
O(log N ) bits (N is minimum of elapsed
time and min x for which g(x)=0 )
• Algorithm based on an adaptation of the
Exponential Histograms technique.
2
• Sliding windows, (with (log N ) ),
[DGIM02] are as “hard” to maintain as
general decay PODS 2003
Why Polynomial Decay?
• Link performance over time
Time
Link A
Link B
t0
Which link should we select past time t0?
Initially A or B, eventually B.
PODS 2003
good
bad
Link Selection Example )cont)
• Polynomial decay (by tuning parameter):
Initially A or B, eventually B.
• Exponential decay:
Constant relative value of A and B:
Either A forever or B forever
• Sliding Window decay:
First B then A then same…
Poly decay can model our expectation (also
other smooth subexponential functions…)
PODS 2003
Summary of Bounds
function Exp
Poly
SliWin General
decay
decay
decay
decay
bound
Upper
O(logN ) O(logN loglog N ) O(log2 N ) O(log2 N )
Lower
(logN ) (logN ) (log N ) (log2 N )
2
• Approximate to within 1  
• N is minimum of elapsed time and min x for
PODS 2003
which g(x)=0
Bucketing the Stream
Time
1
0
0
1
Time width: 4
Count: 2
1
0
1
Time width: 3
Count: 2
0
0
1
Time width: 3
Count: 1
Merge
Time width: 7
Count: 4
• Histogram determined by time boundaries and bucket counts
• Time boundaries can be fixed (counts maintained per stream)
• Counts can be fixed (time boundaries maintained per stream)
PODS 2003
Exponential Histograms [DGIM02]
• Introduced for Sliding WindowsSliWinW
• Each new item is placed in a new bucket.
• Two buckets are merged when their
combined count is at most a fraction of the
combined count of all earlier buckets.
• Buckets with start time greater than W are
discarded.
• Bucket counts are independent of stream
• Sum of bucket counts is a constant-factor
approximation for SliWinW
PODS 2003
Exponential Histograms (cont)
• Example for factor 2 approximation: (bucket
counts)
• 1
• 1, 1
• 1, 1, 1
• 1, 1, 2 (merge)
• 1, 1, 1, 2
• 1, 1, 2, 2 (merge)
• Values with time “in question” (before or after W)
are aggregated in least recent bucket.
PODS 2003
EHs properties
• Number of buckets is O(log W), for each
bucket we need to record exact start time,
thus we need O(log W) storage per bucket.
(total is O(log^2 W))
• An EH for Sliding Window W can be used to
approximate Sliding Window j for all j<W
Lemma:
EH can be used to approximate general decay
functions. (With W= minimum of elapsed time and
min x for which g(x)=0.)
PODS 2003
Reducing any Decay Function
to Sliding Windows.
• Decay function g(x) Vg (T ) 
 g(N )

T  N t T
N 1
 f (t ) g (T  t )
T  N t T
f (t )   ( g ( N  i)  g ( N  1  i))
i 1
 f (t )
T  N i t T
N 1
 g ( N ) SliWinN (T )   ( g ( N  i)  g ( N  1  i))SliWinN i (T )
i 1
From (approximate) SliWinW for all W<=N we can
compute (approximate) decayed sum according to g().
With an EH with W=N we can compute (approximately)
decayed sums according to all decay functions g() up to
elapsed time N (or forever if g(N)=0).
PODS 2003
Weight-Based Merging
• Bucket start times depend only on elapsed
time.
• WBM Histograms applies to decay functions
where g(x)/g(x+1) is non-increasing.
• Number of buckets is O(log(g(1)/g(N))).
• O(log log N) storage per bucket (for
approximate bucket counts).
• More efficient than EH on decay that is
slightly super-polynomial or slower.
• O(log N log log N) storage for polynomial decay
PODS 2003
WBM Histograms – How?
• Region boundaries b1,b2,b3,… :
b1  arg maxx (1   ) g ( x 1)  g (1)
bi  arg maxx (1   ) g ( x 1)  g (bi 1 )
• Current most-recent bucket is sealed and
new bucket is started at T s.t. T mod b1=0
• Two consecutive buckets that are in the
same region (according to elapsed start
and end times) are merged.
• At most 2 buckets per region
PODS 2003
WBMH Example
• g(x)=1/x, (1+)=2
• Regions:
1,1/2, 1/3,1/4,1/5,1/6, 1/7,1/8,…,1/14
T=1
T=2
T=3
T=4
T=5
T=6
PODS 2003
Conclusion
Summary:
• Efficient computation of time-decayed
sum/averages for general decay functions.
• Very efficient computation for polynomial
decay
• Open question:
O(log n) storage for polynomial decay
• Subsequent related work:
Spatial decay (sensor nets/p2p nets)
PODS 2003