scalable machine learning
play

Scalable Machine Learning 3. Data Streams Alex Smola Yahoo! - PowerPoint PPT Presentation

Scalable Machine Learning 3. Data Streams Alex Smola Yahoo! Research and ANU http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12 3. Data Streams Building realtime *Analytics at home Data Streams Data & Applications Moments


  1. Scalable Machine Learning 3. Data Streams Alex Smola Yahoo! Research and ANU http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12

  2. 3. Data Streams Building realtime *Analytics at home

  3. Data Streams Data & Applications • Moments • Flajolet-Martin counter • Alon-Matias-Szegedy sketch • Heavy hitter detection • Lossy counting • Space saving • Semiring statistics • Bloom filter • CountMin sketch • Realtime analytics • Fault tolerance and scalability • Interpolating sketches

  4. 3.1 Streams

  5. Data Streams • Cannot replay data • Limited memory / computation / realtime analytics • Time series Observe instances (x t , t) stock symbols, acceleration data, video, server logs, surveillance • Cash register Observe instances x i (weighted), always positive increments query stream, user activity, network traffic, revenue, clicks • Turnstile Increments and decrements (possibly require nonnegativity) caching, windowed statistics

  6. Website Analytics NIPS • Continuous stream of users (tracked with cookie) • Many sites signed up for analytics service • Find hot links / frequent users / click probability / right now

  7. Query Stream • Item stream • Find heavy hitters • Detect trends early (e.g. Obsama bin Laden killed) • Frequent combinations (cf. frequent items) • Source distribution • In real time

  8. Network traffic analysis • TCP/IP packets • On switch with limited memory footprint • Realtime analytics • Busiest connections • Trends • Protocol-level data • Distributed information gathering

  9. Financial Time Series • real time prediction • missing data • metadata (news, quarterly reports, financial background) • time-stamped data stream • multiple sources • different time resolution

  10. News • Realtime news stream • Multiple sources (Reuters, AP, CNN, ...) • Same story from multiple sources • Stories are related

  11. 3.2 Moments

  12. Warmup ? ... • Stream of m items x i • Want to compute statistics of what we’ve seen • Small cardinality n • Trivial to compute aggregate counts (dictionary lookup) • Memory is O(n) • Computation is O(log n) for storage & lookup • Large cardinality n • Exact storage of counts impossible • Exact test for previous occurrence impossible • Need approximate (dynamic) data structure

  13. Warmup ? ... • Stream of m items x i • Want to compute statistics of what we’ve seen • Small cardinality n • Trivial to compute aggregate counts (dictionary lookup) • Memory is O(n) • Computation is O(log n) for storage & lookup • Large cardinality n • Exact storage of counts impossible • Exact test for previous occurrence impossible • Need approximate (dynamic) data structure

  14. Finding the missing item • Sequence of instances [1..N] • One of them is missing • Identify it • Algorithm N X • Compute sum s := i i =1 • For each item decrement s via s ← s − x i • At the end identify missing item • We only need least significant log N bits

  15. Finding the missing item • Sequence of instances [1..N] • One of them is missing • Identify it • Algorithm N X • Compute sum s := i i =1 • For each item decrement s via s ← s − x i • At the end identify missing item • We only need least significant log N bits

  16. Finding the missing item • Sequence of instances [1..N] • One of them is missing • Identify it • Algorithm N X • Compute sum s := i i =1 • For each item decrement s via s ← s − x i • At the end identify missing item • We only need least significant log N bits

  17. Finding the missing item • Sequence of instances [1..N] • Up to k of them are missing • Identify them • Algorithm N X i p • Compute sum for p up to k s p := i =1 • For each item decrement all s p via s p ← s p − x p i • Identify missing item by solving polynomial system • We only need least significant log N bits

  18. Finding the missing item • Sequence of instances [1..N] • Up to k of them are missing • Identify them • Algorithm N X i p • Compute sum for p up to k s p := i =1 • For each item decrement all s p via s p ← s p − x p i • Identify missing item by solving polynomial system • We only need least significant log N bits

  19. Estimating F k

  20. Moments • Characterize the skewness of distribution • Sequence of instances • Instantaneous estimates X n p F p := x • Special cases x ∈ X • F 0 is number of distinct items • F 1 is number of items (trivial to estimate) • F 2 describes ‘variance’ (used e.g. for database query plans)

  21. Flajolet-Martin counter • Assume perfect hash functions (simplifies proof) • Design hash with Pr( h ( x ) = j ) = 2 − j log n bits 0 0 1 0 0 1 1 0 0 2 1 0 0 1 1 0 1 1 4 0 0 1 0 1 1 1 1 • Position of the rightmost 0 (LSB is position 1) • CDF for maximum over n items F ( j ) = (1 − 2 − j ) n (CDF of maximum over n random variables is F n )

  22. Flajolet-Martin counter 0 0 1 0 0 1 1 0 0 2 1 0 0 1 1 0 1 1 4 0 0 1 0 1 1 1 1 • Intuitively expect that max x ∈ X h ( j ) ≈ log |X| • Repetitions of same element do not matter • Need O(log log |X|) bits to store counter • High probability bounding range ✓� � ◆ ≤ 2 � � Pr � max x ∈ X h ( j ) − log |X| � > log c � � c

  23. Proof (for a version with 2-way independent hash functions see Alon, Matias and Szegedy) • Upper bound trivial |X| · 2 − j ≤ 1 ⇒ 2 j ≥ c |X| c = With probability at most 1/c the upper bound is exceeded (using union bound) • Lower bound • Probability of not exceeding j is bounded by ≤ 1 (1 − 2 − j ) |X| ≤ exp |X| · 2 − j � c ≤ e − c � Solve for j to obtain 2 j ≥ |X| c

  24. Variations on FM counter • Lossy counting • Increment counter j to c with probability p -c for p<0.5 • Yields estimate of log-count (normalization!) • FM instead of bits inside Bloom filter ... more later • log n rather than log log n array • Set bit according to hash waste waste 0 0 0 0 0 1 0 1 1 0 1 1 1 1 • Count consecutive 1 instead of largest bit and fill gaps. • The log log bounds are tight (see AMS lower bound)

  25. Computing F 2 • Strategy • Design random variable with E [ X ij ] = F 2 • Take average over subsets a X i := 1 X ¯ X ij a j =1 • Estimate is median ⇥ ¯ ¯ X 1 , . . . , ¯ ⇤ X := med X b • Random variable # 2 " X X ij := σ ( x, i, j ) x ∈ stream • σ is Rademacher hash with equiprobable {± 1 } • In expectation all cross terms cancel out yielding F 2

  26. Average-Median Theorem • Random variables X ij with mean μ , variance σ a ⇥ ¯ • Mean estimate and X i := 1 ¯ X 1 , . . . , ¯ X ¯ ⇤ X := med X b X ij a j =1 • The probability of deviation is bounded by ≤ � for a = 8 � 2 ✏ − 2 and b = − 8 | ¯ � X − µ | ≥ ✏ 3 log � Pr • Note - Alon, Matias & Szegedy claim b = − 2 log δ but the Chernoff bounds don’t work out AFAIK

  27. Proof • Bounding the mean Pick and apply Chebyshev bound to see a = 8 � 2 ✏ − 2 ≤ 1 that | ¯ � X i − µ | > ✏ Pr 8 • Bounding the median • Ensure that for at least half deviation is small ¯ X i • Failure probability is at most 1/8 • Chernoff (Mitzenmacher & Upfahl Theorem 4.4) Pr { x ≥ (1 + δ ) µ ) } ≤ e − µ δ 2 3 Plug in ✓ ◆ ✏ = 3; µ = b − 3 b and b ≤ − 8 8 hence � ≤ exp 3 log � 8

  28. Computing F 2 • Mean # 2 # 2 " "X X X n 2 E [ X ij ] = E σ ( x, i, j ) = E σ ( x, i, j ) n x = x x ∈ stream x ∈ X x ∈ X • Variance # 4 " X X X X 2 n 2 x n 2 n 4 ⇥ ⇤ = E σ ( x, i, j ) = 3 x 0 − 2 E ij x x ∈ stream x,x 0 ∈ X x ∈ X X X X 2 n 2 x n 2 n 4 x ≤ 2 F 2 ⇥ ⇤ − [ E [ X ij ]] 2 = 2 x 0 − 2 E ij 2 x,x 0 ∈ X x ∈ X • Plugging into the Average-Median theorem shows that algorithm uses bits O ( ✏ − 2 log(1 / � ) log |X| n )

  29. Computing F k in general • Random variable with expectation F k • Pick uniformly random element in sequence • Start counting instances until end a s r a n d o m a s c a n b e 3 1 2 3 1 1 • Use count r ij for r k ij − ( r ij − 1) k � � X ij = m • Apply the Average-Median theorem

  30. More F k • Mean via telescoping sum h 1 k + (2 k − 1 k ) + . . . + ( n k 1 − ( n 1 − 1) k ) E [ X ij ] = i + . . . + ( n k |X| − ( n |X| − 1) k ) X n k = x = F k x ∈ X • Variance by brute force algebra Var [ X ij ] ≤ E [ X ij ] ≤ k |X| 1 − 1 /k F 2 k • We need at most O ( k |X| 1 − 1 /k ✏ − 2 log 1 / � (log m + log |X| ) bits to estimate F k . The rate is tight.

  31. More F k • Mean via telescoping sum h 1 k + (2 k − 1 k ) + . . . + ( n k 1 − ( n 1 − 1) k ) E [ X ij ] = i + . . . + ( n k |X| − ( n |X| − 1) k ) no better than brute X n k = x = F k force for large k x ∈ X • Variance by brute force algebra Var [ X ij ] ≤ E [ X ij ] ≤ k |X| 1 − 1 /k F 2 k • We need at most O ( k |X| 1 − 1 /k ✏ − 2 log 1 / � (log m + log |X| ) bits to estimate F k . The rate is tight.

  32. Uniform sampling

Recommend


More recommend