Efficient Processing of Massive Data Streams for Mining and Monitoring Mirek Riedewald Department of Computer Science Cornell University
Acknowledgements � Al Demers � Abhinandan Das � Alin Dobra � Sasha Evfimievski � Johannes Gehrke � KD-D initiative (Art Becker et al.)
Introduction � Data streams versus databases � Infinite stream, continuous queries � Limited resources � Network monitoring � High arrival rates, approximation [CGJSS02] � Stock trading � Complex computation [ZS02] � Retail, E-business, Intelligence, Medical Surveillance � Identify relevant information on-the-fly, archive for data mining � Exact results, error guarantees
Information Spheres � Local Information Sphere � Within each organization � Continuous processing of distributed data streams � Online evaluation of thousands of triggers � Storage/archival of important data � Global Information Sphere � Between organizations � Share data in privacy preserving way
Local Information Sphere Distributed data stream event processing and online data mining � Technical challenges � Blocking operators, unbounded state � Graceful degradation under increasing load � Integration with archive � Processing of physically distributed streams
Event Matching, Correlation � Join of data streams Brand Mpix Price Mpix Price Canon 3.0 200 >2.0 <250
Event Matching, Correlation � Join of data streams Brand Mpix Price Mpix Price Canon 3.0 200 >2.0 <250 Fuji 3.0 100 >4.0 <400
Event Matching, Correlation � Join of data streams Brand Mpix Price Mpix Price Canon 3.0 180 > 2.0 < 250 Fuji 3.0 220 > 4.0 < 400 Kodak 4.0 340 = 3.0 < 200 � Equi-join, text similarity, geographical proximity,… � Problem: unbounded state, computation
Window Joins � Restrict join to window of most recent records (tuples) � Landmark window � Sliding window based on time or number of records � Problem definition � Window based on time: size w � Synchronous record arrival � Equi-join
Abstract Model � Data streams R(A,…), S(A,…) � Compute equi-join on A � Match all r and s of streams R, S such that r.A=s.A � Sliding window of size w R 1 1 1 (r0,s2), (r1,s2), (r2,s2) S 2 3 1
Abstract Model (cont.) � Data streams R(A,…), S(A,…) � Compute equi-join on A � Match all r and s of streams R, S such that r.A=s.A � Sliding window of size w R 1 1 1 3 (r0,s2), (r1,s2), (r2,s2) (r3,s1), (r1,s3), (r2,s3) S 2 3 1 1
Abstract Model (cont.) � Data streams R(A,…), S(A,…) � Compute equi-join on A � Match all r and s of streams R, S such that r.A=s.A � Sliding window of size w R 1 1 1 3 2 (r0,s2), (r1,s2), (r2,s2) (r3,s1), (r1,s3), (r2,s3) No new output S 2 3 1 1 4
Limited Resources � Focus on limited memory M<2w � State of the art: random load shedding [KNV03] � Random sample of streams � Desired approach: semantic load shedding � Goal: graceful degradation � Approximation � Set-valued result: Error measure?
Set-Approximation Error � What is a good error measure? � Information Retrieval, Statistics, Data Mining � Matching coefficient A ∩ | | B � Dice coefficient ∩ + 2 | | /(| | | |) A B A B ∩ ∪ � Jaccard coefficient | | / | | A B A B � Cosine coefficient ∩ + | | / | | | | A B A B A ∩ � Overlap coefficient | | / min{| |, | |} B A B � Earth Mover’s Distance (EMD) [RTG98] � Match And Compare (MAC) [IP99] � Join: subset of output result � EMD, Overlap coefficient trivially 0 or 1 � Others (except MAC) reduce to MAX-subset error measure
Optimization Problem Select records to be kept in memory such that the result size is maximized subject to memory constraints � Lightweight online technique � Adaptivity in presence of memory fluctuations
Optimal Offline Algorithm � What is the best possible that can be achieved? � Optimal sampling strategy for MAX-subset � Bottom-line for evaluation of any online algorithm � Same optimization problem, but knows future � Finite subsets of input streams � Formulate as linear flow problem
Generation of Flow Model M=2, w=3 -1 R=1,1,1,3 -1 -1 -1 Fixed memory allocation -1 -3 3 S=2,3,1,1 -1 cost Keep in memory Capacity: 0..1, linear cost Replace
Correspondence to Windows R=1,1,1,3 S=2,3,1,1
Correspondence to Windows R=1,1,1,3 S=2,3,1,1
Correspondence to Windows -1 R=1,1,1,3 -1 -1 S=2,3,1,1
Correspondence to Windows -1 R=1,1,1,3 -1 -1 -1 -1 S=2,3,1,1 -1
Complexity � Integer solution exists � Optimal solution found in O(n 2 m log n) � N input size of single stream � #nodes: n < 2wN + N + 2 � #arcs: m < 2n + M + 1 � Reasonable costs for benchmarking � Approx. 1GB memory (w=800, M=800) � Approx. 1h computation time
Optimal Flow M=2, w=3 -1 R=1,1,1,3 -1 -1 -1 Fixed memory allocation -1 -3 3 S=2,3,1,1 -1 cost Keep in memory Capacity: 0..1, linear cost Replace
Easy to Extend M=2, w=3 -1 R=1,1,1,3 -1 -1 -1 Variable memory allocation -1 -3 3 S=2,3,1,1 -1 cost Keep in memory Capacity: 0..1, linear cost Replace
Online Heuristics � Maximize expected output � PROB: sort tuples by join partner arrival probability � LIFE: sort tuples by product of partner arrival probability and remaining lifetime � Maintain stream statistics � Histograms (DGIM02, TGIK02), wavelets (GKMS01), quantiles (GKMS02, GK01)
Approximation Quality
Effect of Skew
Summary � Information sphere architecture � Optimal algorithm and fast efficient heuristic for sliding window joins � Open problems � Other set error measures, resource models � Other joins: compress records � Complex queries � Distributed processing � Integration with other techniques into local information sphere
Related Work � Aurora (Brown, MIT), STREAM (Stanford), Telegraph (Berkeley), NiagaraCQ (Wisconsin, OGI) � Memory requirements [ABBMW02,TM02] � Aggregation � Alon, Bar-Yossef, Datar, Dobra, Garofalakis, Gehrke, Gibbons, Gilbert, Indyk, Korn, Kotidis, Koudas, Matias, Motwani, Muthukrishnan, Rastogi, Srivastava, Strauss, Szegedy
Other Results [DGR03] � Integration with archive � Load smoothing, not shedding � Novel “error” measure: archive access cost � Static join for sensor networks � Maximize result size subject to constraints on energy consumption � Polynomial dynamic programming solution � Fast 2-approximation algorithms � NP-hardness proof for join of 3 or more streams
Other Results (cont.) [DGGR02] � Computation of aggregates over streams for multiple joins � Small pseudo-random sketch synopses (randomized linear projections) � Explicit, tunable error guarantees � Sketch partitioning to boost accuracy (intelligently partition join attribute space)
Thanks! ? ? ? Questions? ? ? ? ?
Recommend
More recommend