data streams tutorial
play

Data Streams Tutorial Andrew McGregor University of Massachusetts, - PowerPoint PPT Presentation

Data Streams Tutorial Andrew McGregor University of Massachusetts, Amherst Data Stream Model [Morris 78] [Munro, Paterson 78] [Flajolet, Martin 85] [Alon, Matias, Szegedy 96] [Henzinger, Raghavan, Rajagopalan 98] Data Stream


  1. First Idea: Sketches       f 1 t 1 Z f 2 t 2       =               t k  .  .   .     f n • Algorithm uses a (random) projection matrix Z such that the relevant properties of f can be estimated from the sketch Zf. • Easy to Update: On seeing “i”, add i th column of Z to sketch

  2. First Idea: Sketches       f 1 t 1 Z f 2 t 2       =               t k  .  .   .     f n • Algorithm uses a (random) projection matrix Z such that the relevant properties of f can be estimated from the sketch Zf. • Easy to Update: On seeing “i”, add i th column of Z to sketch • Store Matrix Implicitly: Need to be able to efficiently generate any entry of Z from a “small” random seed.

  3. First Idea: Sketches       f 1 t 1 Z f 2 t 2       =               t k  .  .   .     f n • Algorithm uses a (random) projection matrix Z such that the relevant properties of f can be estimated from the sketch Zf. • Easy to Update: On seeing “i”, add i th column of Z to sketch • Store Matrix Implicitly: Need to be able to efficiently generate any entry of Z from a “small” random seed. • Gives Õ(k) space algorithm with seed & precision assumptions.

  4. Algorithm for Estimating F 2       f 1 t 1 Z f 2 t 2       =               t k  .  .   .     f n

  5. Algorithm for Estimating F 2       f 1 t 1 Z f 2 t 2       =               t k  .  .   .     f n Consider a row z of the projection matrix.

  6. Algorithm for Estimating F 2       f 1 t 1 Z f 2 t 2       =               t k  .  .   .     f n Consider a row z of the projection matrix. Let entries of z be uniform in {-1,1} chosen with 4-wise independence. Let t=z.f.

  7. Algorithm for Estimating F 2       f 1 t 1 Z f 2 t 2       =               t k  .  .   .     f n Consider a row z of the projection matrix. Let entries of z be uniform in {-1,1} chosen with 4-wise independence. Let t=z.f.

  8. Algorithm for Estimating F 2       f 1 t 1 Z f 2 t 2       =       Square of entry is       concentrated around F 2 .   t k  .  .   .     f n Consider a row z of the projection matrix. Let entries of z be uniform in {-1,1} chosen with 4-wise independence. Let t=z.f.

  9. Algorithm for Estimating F 2       f 1 t 1 Z f 2 t 2       =       Square of entry is       concentrated around F 2 .   t k  .  .   .     f n Consider a row z of the projection matrix. Let entries of z be uniform in {-1,1} chosen with 4-wise independence. Let t=z.f. Expectation: E(t 2 ) = ∑ i,j E(z i z j )f i f j = F 2

  10. Algorithm for Estimating F 2       f 1 t 1 Z f 2 t 2       =       Square of entry is       concentrated around F 2 .   t k  .  .   .     f n Consider a row z of the projection matrix. Let entries of z be uniform in {-1,1} chosen with 4-wise independence. Let t=z.f. Expectation: E(t 2 ) = ∑ i,j E(z i z j )f i f j = F 2 Variance: Var(t 2 ) ≤ ∑ i,j,k,l E(z i z j z k z l )f i f j f k f l < 6F 22

  11. Algorithm for Estimating F 2       f 1 t 1 Z f 2 t 2       =       Square of entry is       concentrated around F 2 .   t k  .  .   .     f n Consider a row z of the projection matrix. Let entries of z be uniform in {-1,1} chosen with 4-wise independence. Let t=z.f. Expectation: E(t 2 ) = ∑ i,j E(z i z j )f i f j = F 2 Variance: Var(t 2 ) ≤ ∑ i,j,k,l E(z i z j z k z l )f i f j f k f l < 6F 22 By Chebyshev, setting k=O( ε -2 log δ -1 ) ensures with prob. 1- δ , average of squared entries is (1± ε ) F 2 .

  12. Second Idea: Sampling

  13. Second Idea: Sampling • Let’s sample from S=[a 1 , a 2 , a 3 , ... , a m ] where each a i ∈ R [n]

  14. Second Idea: Sampling • Let’s sample from S=[a 1 , a 2 , a 3 , ... , a m ] where each a i ∈ R [n] • Distribution Sampling: Return i with probability f i /m

  15. Second Idea: Sampling • Let’s sample from S=[a 1 , a 2 , a 3 , ... , a m ] where each a i ∈ R [n] • Distribution Sampling: Return i with probability f i /m • Universe Sampling: Return (i,f i ) where i ∈ R [n]

  16. Second Idea: Sampling • Let’s sample from S=[a 1 , a 2 , a 3 , ... , a m ] where each a i ∈ R [n] • Distribution Sampling: Return i with probability f i /m • Universe Sampling: Return (i,f i ) where i ∈ R [n] • AMS Sampling: Return (i,r) with i chosen w/p f i /m and r ∈ R [f i ]

  17. Second Idea: Sampling • Let’s sample from S=[a 1 , a 2 , a 3 , ... , a m ] where each a i ∈ R [n] • Distribution Sampling: Return i with probability f i /m • Universe Sampling: Return (i,f i ) where i ∈ R [n] • AMS Sampling: Return (i,r) with i chosen w/p f i /m and r ∈ R [f i ] • Sample a j for j ∈ R [m], let i= a j and compute r=|{j ′≥ j : a j ′ =a j }|

  18. Second Idea: Sampling • Let’s sample from S=[a 1 , a 2 , a 3 , ... , a m ] where each a i ∈ R [n] • Distribution Sampling: Return i with probability f i /m • Universe Sampling: Return (i,f i ) where i ∈ R [n] • AMS Sampling: Return (i,r) with i chosen w/p f i /m and r ∈ R [f i ] • Sample a j for j ∈ R [m], let i= a j and compute r=|{j ′≥ j : a j ′ =a j }| • Useful for estimating ∑ i g(f i ) because E[m(g(r)-g(r-1))] = ∑ i g(f i )

  19. Second Idea: Sampling • Let’s sample from S=[a 1 , a 2 , a 3 , ... , a m ] where each a i ∈ R [n] • Distribution Sampling: Return i with probability f i /m • Universe Sampling: Return (i,f i ) where i ∈ R [n] • AMS Sampling: Return (i,r) with i chosen w/p f i /m and r ∈ R [f i ] • Sample a j for j ∈ R [m], let i= a j and compute r=|{j ′≥ j : a j ′ =a j }| • Useful for estimating ∑ i g(f i ) because E[m(g(r)-g(r-1))] = ∑ i g(f i ) • L p Sampling: Return i with probability f ik /F k

  20. L 0 Sampling

  21. L 0 Sampling Suppose we know F 0 . Pick hash function h:[n] → [F 0 ]

  22. L 0 Sampling Suppose we know F 0 . Pick hash function h:[n] → [F 0 ] Algorithm: Maintain values c and id, initially 0.

  23. L 0 Sampling Suppose we know F 0 . Pick hash function h:[n] → [F 0 ] Algorithm: Maintain values c and id, initially 0. For each j in stream: if h(j)=1, c ← c+1, id ← id+j

  24. L 0 Sampling Suppose we know F 0 . Pick hash function h:[n] → [F 0 ] Algorithm: Maintain values c and id, initially 0. For each j in stream: if h(j)=1, c ← c+1, id ← id+j Return id/c if all elts hashing to 1 were same

  25. L 0 Sampling Suppose we know F 0 . Pick hash function h:[n] → [F 0 ] Algorithm: Maintain values c and id, initially 0. For each j in stream: if h(j)=1, c ← c+1, id ← id+j Return id/c if all elts hashing to 1 were same Claim: This happens with constant probability.

  26. L 0 Sampling Suppose we know F 0 . Pick hash function h:[n] → [F 0 ] Algorithm: Maintain values c and id, initially 0. For each j in stream: if h(j)=1, c ← c+1, id ← id+j Return id/c if all elts hashing to 1 were same Claim: This happens with constant probability. Claim: Need to check elts hashing to 1 were same.

  27. L 0 Sampling Suppose we know F 0 . Pick hash function h:[n] → [F 0 ] Algorithm: Maintain values c and id, initially 0. For each j in stream: if h(j)=1, c ← c+1, id ← id+j Return id/c if all elts hashing to 1 were same Claim: This happens with constant probability. Claim: Need to check elts hashing to 1 were same. Run O(log n) copies guessing F 0 =2 i . At least one instantiation works with constant probability.

  28. L 0 Sampling Suppose we know F 0 . Pick hash function h:[n] → [F 0 ] Algorithm: Maintain values c and id, initially 0. For each j in stream: if h(j)=1, c ← c+1, id ← id+j Return id/c if all elts hashing to 1 were same Claim: This happens with constant probability. Claim: Need to check elts hashing to 1 were same. Run O(log n) copies guessing F 0 =2 i . At least one instantiation works with constant probability. Algorithm is a sketch and works with deletions!

  29. Third Idea: Lower Bounds

  30. Third Idea: Lower Bounds x ∈ {0,1} n y ∈ {0,1} n • Many space lower bounds in data stream model use reductions from communication complexity.

  31. Third Idea: Lower Bounds x ∈ {0,1} n y ∈ {0,1} n • Many space lower bounds in data stream model use reductions from communication complexity.

  32. Third Idea: Lower Bounds x ∈ {0,1} n y ∈ {0,1} n • Many space lower bounds in data stream model use reductions from communication complexity. • Example: Alice and Bob have x,y ∈ {0,1} n and Bob wants to check DISJOINTNESS i.e., is there an i with x i =y i =1?

  33. Third Idea: Lower Bounds x ∈ {0,1} n y ∈ {0,1} n • Many space lower bounds in data stream model use reductions from communication complexity. • Example: Alice and Bob have x,y ∈ {0,1} n and Bob wants to check DISJOINTNESS i.e., is there an i with x i =y i =1? • Thm: Any 1/3-error protocol for DISJOINTNESS requires Ω (n) bits of communication.

  34. Third Idea: Lower Bounds x ∈ {0,1} n y ∈ {0,1} n • Many space lower bounds in data stream model use reductions from communication complexity. • Example: Alice and Bob have x,y ∈ {0,1} n and Bob wants to check DISJOINTNESS i.e., is there an i with x i =y i =1? • Thm: Any 1/3-error protocol for DISJOINTNESS requires Ω (n) bits of communication. • Corollary: Any 1/3-error stream algorithm that checks if a graph is triangle-free needs Ω (n 2 ) bits of memory.

  35. Lower Bound for Triangle Detection

  36. Lower Bound for Triangle Detection Alice and Bob have X,Y ∈ {0,1} nxn . For Bob to check if X ij =Y ij =1 for some i,j needs Ω (n 2 ) communication.

  37. Lower Bound for Triangle Detection Alice and Bob have X,Y ∈ {0,1} nxn . For Bob to check if X ij =Y ij =1 for some i,j needs Ω (n 2 ) communication. Let A be an s-space alg that checks for triangles.

  38. Lower Bound for Triangle Detection Alice and Bob have X,Y ∈ {0,1} nxn . For Bob to check if X ij =Y ij =1 for some i,j needs Ω (n 2 ) communication. Let A be an s-space alg that checks for triangles. Consider 3-layer graph (U,V ,W) with |U|=|V|=|W|=n

  39. Lower Bound for Triangle Detection Alice and Bob have X,Y ∈ {0,1} nxn . For Bob to check if X ij =Y ij =1 for some i,j needs Ω (n 2 ) communication. Let A be an s-space alg that checks for triangles. Consider 3-layer graph (U,V ,W) with |U|=|V|=|W|=n

  40. Lower Bound for Triangle Detection Alice and Bob have X,Y ∈ {0,1} nxn . For Bob to check if X ij =Y ij =1 for some i,j needs Ω (n 2 ) communication. Let A be an s-space alg that checks for triangles. Consider 3-layer graph (U,V ,W) with |U|=|V|=|W|=n Alice runs A on E 1 ={u i w i : 1 ≤ i ≤ n} and E 2 ={u i v j : X ij =1}

  41. Lower Bound for Triangle Detection Alice and Bob have X,Y ∈ {0,1} nxn . For Bob to check if X ij =Y ij =1 for some i,j needs Ω (n 2 ) communication. Let A be an s-space alg that checks for triangles. Consider 3-layer graph (U,V ,W) with |U|=|V|=|W|=n Alice runs A on E 1 ={u i w i : 1 ≤ i ≤ n} and E 2 ={u i v j : X ij =1}

  42. Lower Bound for Triangle Detection Alice and Bob have X,Y ∈ {0,1} nxn . For Bob to check if X ij =Y ij =1 for some i,j needs Ω (n 2 ) communication. Let A be an s-space alg that checks for triangles. Consider 3-layer graph (U,V ,W) with |U|=|V|=|W|=n Alice runs A on E 1 ={u i w i : 1 ≤ i ≤ n} and E 2 ={u i v j : X ij =1}

  43. Lower Bound for Triangle Detection Alice and Bob have X,Y ∈ {0,1} nxn . For Bob to check if X ij =Y ij =1 for some i,j needs Ω (n 2 ) communication. Let A be an s-space alg that checks for triangles. Consider 3-layer graph (U,V ,W) with |U|=|V|=|W|=n Alice runs A on E 1 ={u i w i : 1 ≤ i ≤ n} and E 2 ={u i v j : X ij =1} Sends memory to Bob who runs A on E 3 ={v j w i :Y ij =1}

  44. Lower Bound for Triangle Detection Alice and Bob have X,Y ∈ {0,1} nxn . For Bob to check if X ij =Y ij =1 for some i,j needs Ω (n 2 ) communication. Let A be an s-space alg that checks for triangles. Consider 3-layer graph (U,V ,W) with |U|=|V|=|W|=n Alice runs A on E 1 ={u i w i : 1 ≤ i ≤ n} and E 2 ={u i v j : X ij =1} Sends memory to Bob who runs A on E 3 ={v j w i :Y ij =1}

  45. Lower Bound for Triangle Detection Alice and Bob have X,Y ∈ {0,1} nxn . For Bob to check if X ij =Y ij =1 for some i,j needs Ω (n 2 ) communication. Let A be an s-space alg that checks for triangles. Consider 3-layer graph (U,V ,W) with |U|=|V|=|W|=n Alice runs A on E 1 ={u i w i : 1 ≤ i ≤ n} and E 2 ={u i v j : X ij =1} Sends memory to Bob who runs A on E 3 ={v j w i :Y ij =1} Output of A resolves matrix question so s= Ω (n 2 ).

  46. Useful Communication Results

  47. Useful Communication Results • Indexing: • Alice has x ∈ {0,1} n , Bob has i ∈ [n]. Bob want’s to learn x i . • One-way communication requires Ω (n) bits even if Bob also knows first i-1 bits of x.

  48. Useful Communication Results • Indexing: • Alice has x ∈ {0,1} n , Bob has i ∈ [n]. Bob want’s to learn x i . • One-way communication requires Ω (n) bits even if Bob also knows first i-1 bits of x. • Gap-Hamming: • Alice and Bob have x,y ∈ {0,1} n . Distinguish Δ (x,y)<n/2- √ n from Δ (x,y)>n/2+ √ n. • Requires Ω (n) communication.

  49. Useful Communication Results • Indexing: • Alice has x ∈ {0,1} n , Bob has i ∈ [n]. Bob want’s to learn x i . • One-way communication requires Ω (n) bits even if Bob also knows first i-1 bits of x. • Gap-Hamming: • Alice and Bob have x,y ∈ {0,1} n . Distinguish Δ (x,y)<n/2- √ n from Δ (x,y)>n/2+ √ n. • Requires Ω (n) communication. • Multi-Party Disjointness: • t players have x 1 ,x 2 , ... , x t ∈ {0,1} n . Need to distinguish x 1i =x 2i = ... =x ti =1 for some i from all vectors orthogonal. • Requires Ω (n/t) communication.

  50. Bonus! The Fourth Idea

  51. Bonus! The Fourth Idea • Algorithmic tools will only get you so far, sometimes you need to come up with neat ad hoc solutions.

  52. Bonus! The Fourth Idea • Algorithmic tools will only get you so far, sometimes you need to come up with neat ad hoc solutions. • Graph Distances: Given a stream of edges, approximate the shortest path distance between any two nodes.

  53. Bonus! The Fourth Idea • Algorithmic tools will only get you so far, sometimes you need to come up with neat ad hoc solutions. • Graph Distances: Given a stream of edges, approximate the shortest path distance between any two nodes. • k-Center: Given a stream of points, find a set of centers that minimizes max distance from a point to nearest center.

  54. Approximate Distances

  55. Approximate Distances Edges define shortest path graph metric d G .

  56. Approximate Distances Edges define shortest path graph metric d G . An α -spanner of G = (V ,E) is a subgraph H = (V ,E’) such that ∀ u,v: d G (u,v) ≤ d H (u,v) ≤ α d G (u,v)

  57. Approximate Distances Edges define shortest path graph metric d G . An α -spanner of G = (V ,E) is a subgraph H = (V ,E’) such that ∀ u,v: d G (u,v) ≤ d H (u,v) ≤ α d G (u,v) Algorithm: Let E ′ be initially empty Add (u,v) to E ′ if d H (u,v) > 2t-1

  58. Approximate Distances Edges define shortest path graph metric d G . An α -spanner of G = (V ,E) is a subgraph H = (V ,E’) such that ∀ u,v: d G (u,v) ≤ d H (u,v) ≤ α d G (u,v) Algorithm: Let E ′ be initially empty Add (u,v) to E ′ if d H (u,v) > 2t-1 Analysis: Each distance increase by at most factor 2t-1 |E ′ | = O(n 1+1/t ) because all cycles of length > 2t

  59. k-Center Clustering

  60. k-Center Clustering 2 approx in O(k) space if you already know OPT.

  61. k-Center Clustering 2 approx in O(k) space if you already know OPT. (2+ ε ) approx in O(k ε -1 log Δ ) space if 1 ≤ OPT ≤ Δ

  62. k-Center Clustering 2 approx in O(k) space if you already know OPT. (2+ ε ) approx in O(k ε -1 log Δ ) space if 1 ≤ OPT ≤ Δ Better Algorithm O(k ε -1 log ε -1 ): Instantiate basic algorithm with guesses 1, (1+ ε ), (1+ ε ) 2 , ... , 2 ε − 1

  63. k-Center Clustering 2 approx in O(k) space if you already know OPT. (2+ ε ) approx in O(k ε -1 log Δ ) space if 1 ≤ OPT ≤ Δ Better Algorithm O(k ε -1 log ε -1 ): Instantiate basic algorithm with guesses 1, (1+ ε ), (1+ ε ) 2 , ... , 2 ε − 1 If guess r stops working at (j+1) th point: Let q 1 ,...,q k be centers chosen so far. Then p 1 ,...,p j are all at most 2r from some q i . OPT for {q 1 ,...,q k ,p j+1 ,...,p n } is at most OPT+2r.

Recommend


More recommend