BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai (NTT Cyber Space Labs) Spiros Papadimitriou (Carnegie Mellon Univ.) Christos Faloutsos (Carnegie Mellon Univ.)
Motivation n Data-stream applications q Network analysis q Sensor monitoring q Financial data analysis q Moving object tracking n Goal q Monitor multiple numerical streams q Determine which pairs are correlated with lags q Report the value of each such lag (if any) SIGMOD 2005 2 Y. Sakurai et al
Lag Correlations n Examples q A decrease in interest rates typically precedes an increase in house sales by a few months q Higher amounts of fluoride in the drinking water leads to fewer dental cavities, some years later q High CPU utilization on server 1 precedes high CPU utilization for server 2 by a few minutes SIGMOD 2005 3 Y. Sakurai et al
Lag Correlations n Example of lag-correlated sequences These sequences are correlated with lag l =1300 time-ticks CCF (Cross-Correlation Function) SIGMOD 2005 4 Y. Sakurai et al
Lag Correlations n Example of lag-correlated sequences q Fast (high performance) q Nimble (Low memory consumption) q Accurate (good approximation) CCF (Cross-Correlation Function) SIGMOD 2005 5 Y. Sakurai et al
Problem #1: PAIR of sequences n For given two co-evolving sequences X and Y , determine q Whether there is a lag correlation q If yes, what is the lag length l X yes; ? l = 1,300 Y n Any time, on semi-infinite streams SIGMOD 2005 6 Y. Sakurai et al
Problem #2: k-way n For given k numerical sequences, X 1 ,…,X k , report q Which pairs (if any) have a lag correlation q The corresponding lag for such pairs X 1 X 1 and X 2 ; l = 1,300 ? ... X 2 ... X k n again, ‘any time’, streaming fashion SIGMOD 2005 7 Y. Sakurai et al
Our solution, BRAID n characteristics: q ‘Any-time’ processing, and fast Computation time per time tick is constant q Nimble Memory space requirement is sub-linear of sequence length q Accurate Approximation introduces small error SIGMOD 2005 8 Y. Sakurai et al
Related Work n Sequence indexing q Agrawal et al. (FODO 1993) q Faloutsos et al. (SIGMOD 1994) q Keogh et al. (SIGMOD 2001) n Compression (wavelet and random projections) q Gilbert et al. (VLDB 2001) q Guha et al. (VLDB 2004) q Dobra et al.(SIGMOD 2002) q Ganguly et al.(SIGMOD 2003) SIGMOD 2005 9 Y. Sakurai et al
Related Work n Data Stream Management q Abadi et al. (VLDB Journal 2003) q Motwani et al. (CIDR 2003) q Chandrasekaran et al. (CIDR 2003) q Cranor et al. (SIGMOD 2003) SIGMOD 2005 10 Y. Sakurai et al
Related Work n Pattern discovery q Clustering for data streams Guha et al. (TKDE 2003) q Monitoring multiple streams Zhu et al. (VLDB 2002) q Forecasting Yi et al. (ICDE 2000) Papadimitriou et al. (VLDB 2003) n None of previously published methods focuses on the problem SIGMOD 2005 11 Y. Sakurai et al
Overview n Introduction / Related work n Background n Main ideas n Theoretical analysis n Experimental results SIGMOD 2005 12 Y. Sakurai et al
Background n Lag correlation positively correlated + g Correlation un-correlated anti-correlated (lower than - g ) Lag CCF (Cross-Correlation Function) SIGMOD 2005 13 Y. Sakurai et al
Background details n Definition of ‘ score ’, the absolute value of R ( l ) = score ( l ) R ( l ) å n - - ( x x )( y y ) - t t l = = + R ( l ) t l 1 å å - n n l - - 2 2 ( x x ) ( y y ) t t = + = t l 1 t 1 n Lag correlation > g q Given a threshold g , score ( l ) q A local maximum q The earliest such maximum, if more maxima exist SIGMOD 2005 14 Y. Sakurai et al
Overview n Introduction / Related work n Background n Main ideas n Theoretical analysis n Experimental results SIGMOD 2005 15 Y. Sakurai et al
Why not ‘ naive ’ ? n Naive solution: q Compute correlation coefficient for each lag l = 0, 1, 2, 3, …, n/2 n But, q O ( n ) space q O ( n 2 ) time Correlation or O ( n log n ) time w/ FFT t=n Time n/ 2 0 Lag SIGMOD 2005 16 Y. Sakurai et al
Main Idea (1) n Incremental computing: q the correlation coefficient of two sequences is ‘algebraic’ -> can be computed incrementally n we need to maintain only 6 ‘sufficient statistics’: q Sequence length n q Sum of X, Square sum of X q Sum of Y, Square sum of Y q Inner-product for X and the shifted Y SIGMOD 2005 17 Y. Sakurai et al
Main Idea (1) details n Incremental computing: n Sequence length n å = n = Sx ( 1 , n ) x n Sum of X : t t 1 å = n = 2 Sxx ( 1 , n ) x n Square sum of X : t t 1 å n = n Inner-product for X and the shifted Y : Sxy ( l ) x t y - t l = + t l 1 q Compute R ( l ) incrementally: C ( l ) = R ( l ) + × - Vx ( l 1 , n ) Vy ( 1 , n l ) n Covariance of X and Y: + × - Sx ( l 1 , n ) Sy ( 1 , n l ) = - C ( l ) Sxy ( l ) - n l n Variance of X: + 2 ( Sx ( l 1 , n )) + = + - Vx ( l 1 , n ) Sxx ( l 1 , n ) - n l SIGMOD 2005 18 Y. Sakurai et al
Main Idea (1) n Complexity Naive Naive BRAID (incremental) Space O ( n ) O ( n ) Comp. time O ( n log n ) O ( n ) Better, but not good enough! SIGMOD 2005 19 Y. Sakurai et al
Main Idea (2) n Geometric lag probing Correlation Lag SIGMOD 2005 20 Y. Sakurai et al
Main Idea (2) n Geometric lag probing n ie., compute the correlation coefficient for lag: l = 0, 1, 2, 4, ... 2 h Correlation O ( log n ) estimations 0 1 2 4 8 Lag SIGMOD 2005 21 Y. Sakurai et al
Main Idea (2) n Geometric lag probing Naive Naive BRAID (incremental) Space O ( n ) O ( n ) Comp. time O ( n log n ) O ( n ) O ( log n ) n But, so far, we still need O ( n ) space because the longest lag is n/2 SIGMOD 2005 22 Y. Sakurai et al
Main Idea (3) n Sequence smoothing Reminder: Naïve: Correlation t=n Time Lag SIGMOD 2005 23 Y. Sakurai et al
Main Idea (3) n Sequence smoothing q Means of windows for each level q Sufficient statistics computed from the means q CCF computed from the sufficient statistics q But, it allows a partial redundancy Correlation Level h= 0 t=n Time Lag SIGMOD 2005 24 Y. Sakurai et al
Putting it all together: n Geometric lag probing + smoothing q Use colored windows q Keep track of only a geometric progression of the lag values: l ={0,1,2,4,8,…,2 h ,…} Correlation Level h= 0 t=n Time Lag SIGMOD 2005 25 Y. Sakurai et al
Putting it all together: n Geometric lag probing + smoothing q Use colored windows q Keep track of only a geometric progression of the lag values: l ={0,1,2,4,8,…,2 h ,…} Y h= 0 Correlation l= 0 Level X h= 0 t=n Time Lag SIGMOD 2005 26 Y. Sakurai et al
Putting it all together: n Geometric lag probing + smoothing q Use colored windows q Keep track of only a geometric progression of the lag values: l ={0,1,2,4,8,…,2 h ,…} Y h= 0 Correlation l= 1 Level X h= 0 t=n Time Lag SIGMOD 2005 27 Y. Sakurai et al
Putting it all together: n Geometric lag probing + smoothing q Use colored windows q Keep track of only a geometric progression of the lag values: l ={0,1,2,4,8,…,2 h ,…} Y h= 1 Correlation l= 2 Level X h= 1 t h =n/ 2 Time Lag SIGMOD 2005 28 Y. Sakurai et al
Putting it all together: n Geometric lag probing + smoothing q Use colored windows q Keep track of only a geometric progression of the lag values: l ={0,1,2,4,8,…,2 h ,…} Y h= 2 Correlation l= 4 Level X h= 2 t h =n/ 4 Time Lag SIGMOD 2005 29 Y. Sakurai et al
Putting it all together: n Geometric lag probing + smoothing q Use colored windows q Keep track of only a geometric progression of the lag values: l ={0,1,2,4,8,…,2 h ,…} Y h= 3 Correlation l= 8 Level X h= 3 t h =n/ 8 Time Lag SIGMOD 2005 30 Y. Sakurai et al
Putting it all together: n Geometric lag probing + smoothing q Use colored windows q Keep track of only a geometric progression of the lag values: l ={0,1,2,4,8,…,2 h ,…} q Use a cubic spline to interpolate Correlation Level h= 0 t=n Time Lag SIGMOD 2005 31 Y. Sakurai et al
Thus: n Complexity Naive Naive BRAID (incremental) Space O ( n ) O ( n ) O ( log n ) Comp. time O ( n log n ) O ( n ) O (1) * (*) Computation time: O(logn) And actually, amortized time: O(1) SIGMOD 2005 32 Y. Sakurai et al
Overview details n Introduction / Related work n Background n Main ideas q enhancing the accuracy n Theoretical analysis n Experimental results SIGMOD 2005 33 Y. Sakurai et al
Enhanced Probing Scheme n Q: How to probe more densely than 2 h ? Correlation Level h=0 t=n Time Lag SIGMOD 2005 34 Y. Sakurai et al
Enhanced Probing Scheme n Q: How to probe more densely than 2 h ? n A: probe in a mixture of geometric and arithmetic progressions Correlation Level h=0 t=n Time Lag SIGMOD 2005 35 Y. Sakurai et al
Enhanced Probing Scheme n Basic scheme: b= 1 (one number for each level) n Enhanced scheme: b> 1 q Example of b= 4 q Probing the CCF in a mixture of geometric and arithmetic progressions: l ={0,1,…,7;8,10,12,14;16,20,24,28;32,40,…} Correlation step: 4 step:1 step: 2 Level h=0 t=n Time Lag SIGMOD 2005 36 Y. Sakurai et al
Overview n Introduction / Related work n Background n Main ideas n Theoretical analysis n Experimental results SIGMOD 2005 37 Y. Sakurai et al
Recommend
More recommend