Matrix Sketching over Sliding Windows Zhewei Wei 1 , Xuancheng Liu 1 , Feifei Li 2 , Shuo Shang 1 Xiaoyong Du 1 , Ji-Rong Wen 1 1 School of Information, Renmin University of China 2 School of Computing, The University of Utah
Matrix data โข Modern data sets are modeled as large matrices. Think of ๐ต โ ๐ ๐ร๐ as n rows in ๐ ๐ . โข Data Rows Columns d n 10 5 โ 10 7 Textual Documents Words >10 10 10 1 โ 10 4 >10 7 Actions Users Types 10 5 โ 10 6 >10 8 Visual Images Pixels, SIFT 10 5 โ 10 6 >10 8 Audio Songs, tracks Frequencies 10 2 โ 10 4 >10 6 Machine Learning Examples Features 10 3 โ 10 5 Financial Prices Items, Stocks >10 6
Singular Value Decomposition (SVD) ๐ ๐ ๐ต ๐ ฮฃ ๐ค ๐1 ๐ค 11 โฆ โฆ ๐ 11 ๐ 1๐ ๐ฃ 11 ๐ฃ 1๐ ๐ 1 โฆ โฆ 0 0 ๐ 2 0 0 โฆ โฎ โฎ ร ร โฑ โฎ โฎ โฎ โฆ โฎ โฎ โฆ โฎ ๐ค 1๐ ๐ค ๐๐ โฆ ๐ ๐ โฆ 0 0 = โฆ 0 0 0 โฎ โฎ โฎ โฆ ๐ฃ ๐1 ๐ฃ ๐๐ โฆ โฆ 0 ๐ ๐1 ๐ ๐๐ 0 0 โฆ โข Principal component analysis (PCA) โข K-means clustering โข Latent semantic indexing (LSI)
SVD & Eigenvalue decomposition ๐ต ๐ต ๐ ๐ ๐1 ๐ 11 ๐ 11 ๐ 1๐ โฆ โฆ โฆ Covariance Matrix โฎ ร โฎ ๐ต ๐ ๐ต ๐ 1๐ ๐ ๐๐ โฆ โฆ โฎ โฎ ๐ ๐1 ๐ ๐๐ โฆ ๐ ๐ ๐ ฮฃ 2 ๐ค ๐1 ๐ค 11 ๐ค 1๐ ๐ค 11 โฆ 2 โฆ โฆ ๐ 1 0 0 2 ๐ 2 0 0 โฆ โฆ = โฎ โฎ โฎ โฎ ร ร โฑ โฎ โฎ ๐ค ๐1 ๐ค ๐๐ ๐ค 1๐ ๐ค ๐๐ โฆ 2 โฆ โฆ ๐ ๐ 0 0
Matrix Sketching ๐ โข Computing SVD is slow (and offline). ๐ถ ๐ ๐ ๐ โข Matrix sketching: approximate large matrix ๐ต โ ๐ ๐ร๐ with B โ ๐ ๐ร๐ , ๐ โช ๐ , in an online fashion. โข Row-update stream: each update receives a row. โข Covariance error [Liberty2013, Ghashami2014, 2 โค ๐ . Woodruff2016]: ๐ต ๐ ๐ต โ ๐ถ ๐ ๐ถ /||๐ต|| ๐บ ๐ต ๐ โข Feature hashing [Weinberger2009], random projection [Papadimitriou2011], โฆ โข Frequent Directions (FD) [Liberty2013]: ๐ ๐ ๏ง B โ ๐ ๐ร๐ , ๐ = 1 ๐ , s.t. covariance error โค ๐ .
Matrix Sketching over Sliding Windows โข Each row is associated with a timestamp. โข Maintain ๐ถ ๐ for ๐ต ๐ : rows in sliding window ๐. ๐ ๐ต ๐ โ ๐ถ ๐ ๐ ๐ถ ๐ ||/||๐ต ๐ || ๐บ 2 โค ๐ Covariance error: ||๐ต ๐ โข Sequence-based window: past N rows. ๐ต ๐ : ๐ rows โข Time-based window: rows in a past time period ฮ . ๐ต ๐ : rows in ฮ time units
Motivation 1: Sliding windows vs. unbounded streams โข Sliding window model is a more appropriate model in many real-world applications. โข Particularly so in the areas of data analysis wherein matrix sketching techniques are widely used. โข Applications: ๏ง Analyzing tweets for the past 24 hours. ๏ง Sliding window PCA for detecting changes and anomalies [Papadimitriou2006, Qahtan2015].
Motivation 2: Lower bound โข Unbounded stream solution: use O(๐ 2 ) space to store ๐ต ๐ ๐ต. ๏ง Update: ๐ต ๐ ๐ต โ ๐ต ๐ ๐ต + ๐ ๐ ๐ ๐ ๐ Theorem 4.1 An algorithm that returns ๐ต ๐ ๐ต for any sequence- based sliding window must use ฮฉ(๐๐) bits space. โข Matrix sketching is necessary for sliding window, even when dimension ๐ is small. โข Matrix sketching over sliding windows requires new techniques.
Three algorithms โข Sampling: ๏ง Sample ๐ ๐ w.p. proportional to ||๐ ๐ || 2 [Frieze2004]. ๏ง Priority sampling[Efraimidis2006] + Sliding window top-k. โข LM-FD: Exponential Histogram (Logarithmic method) [Datar2002] + Frequent Directions. โข DI-FD: Dyadic interval techniques [Arasu2004] + Frequent Directions. Sketches Update Space Window Interpretable? ๐ ๐ ๐ 2 log log ๐๐ ๐ 2 log ๐๐ Sampling Sequence & time Yes 1 ๐ log ๐๐๐ ๐ 2 log ๐๐๐ LM-FD Sequence & time No ๐ ๐ log ๐ ๐ ๐ log ๐ DI-FD Sequence No ๐ ๐
Three algorithms โข Sampling: ๏ง Sample ๐ ๐ w.p. proportional to ||๐ ๐ || 2 [Frieze2004]. ๏ง Priority sampling[Efraimidis2006] + Sliding window top-k. โข LM-FD: Exponential Histogram (Logarithmic method) [Datar2002] + Frequent Directions. โข DI-FD: Dyadic interval techniques [Arasu2004] + Frequent Directions. Sketches Update Space Window Interpretable? Sampling Slow Large Sequence & time Yes LM-FD Fast Small Sequence & time No Best for small ๐ DI-FD Slow Sequence No โข Interpretable: rows of the sketch ๐ถ come from ๐ต . โข ๐ : ratio between maximum squared norm and minimum squared norms.
Experiments: space vs. error ๐ = 8.35 ๐ = 1 ๐ = 90089 Sketches Update Space Window Interpretable? Sampling Slow Large Sequence & time Yes LM-FD Fast Small Sequence & time No Best for small ๐ DI-FD Slow Sequence No โข Interpretable: rows of the sketch ๐ถ come from ๐ต . โข ๐ : ratio between maximum squared norm and minimum squared norms.
Experiments: time vs. space ๐ = 8.35 ๐ = 1 ๐ = 90089 Sketches Update Space Window Interpretable? Sampling Slow Large Sequence & time Yes LM-FD Fast Small Sequence & time No Best for small ๐ DI-FD Slow Sequence No โข Interpretable: rows of the sketch ๐ถ come from ๐ต . โข ๐ : ratio between maximum squared norm and minimum squared norms.
Conclusions โข First attempt to tackle the sliding window matrix sketching problem. โข Lower bounds show that the sliding window model is different from unbounded streaming model for the matrix sketching problem. โข Propose algorithms for both time-based and sequence- based windows with theoretical guarantee and experimental evaluation.
Thanks!
Recommend
More recommend