sketching and streaming for distributions
play

Sketching and Streaming for Distributions Piotr Indyk Andrew - PowerPoint PPT Presentation

Sketching and Streaming for Distributions Piotr Indyk Andrew McGregor Massachusetts Institute of Technology University of California, San Diego Main Material: Stable distributions, pseudo-random generators, embeddings, and data stream


  1. Sketching and Streaming for Distributions Piotr Indyk Andrew McGregor Massachusetts Institute of Technology University of California, San Diego Main Material: Stable distributions, pseudo-random generators, embeddings, and data stream computation Piotr Indyk (FOCS 2000) Sketching information divergences Sudipto Guha, Piotr Indyk, Andrew McGregor (COLT 2007) Declaring independence via the sketching of sketches Piotr Indyk, Andrew McGregor (SODA 2008)

  2. The Problem

  3. The Problem • List of m red values and m green values in [ n ] 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7,3,4, ...

  4. The Problem • List of m red values and m green values in [ n ] 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7,3,4, ... • Define distributions ( p 1 , ..., p n ) and ( q 1 , ..., q n )

  5. The Problem • List of m red values and m green values in [ n ] 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7,3,4, ... • Define distributions ( p 1 , ..., p n ) and ( q 1 , ..., q n ) • How “different” are p and q ?

  6. The Problem • List of m red values and m green values in [ n ] 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7,3,4, ... • Define distributions ( p 1 , ..., p n ) and ( q 1 , ..., q n ) • How “different” are p and q ? Variational: � | p i − q i | Kullback-Leibler: � p i log( p i / q i ) Hellinger: � ( √ p i − √ q i ) 2 Euclidean: � ( p i − q i ) 2

  7. The Problem • List of m red values and m green values in [ n ] 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7,3,4, ... • Define distributions ( p 1 , ..., p n ) and ( q 1 , ..., q n ) • How “different” are p and q ? D f ( p , q ) = � p i f ( q i / p i ) B F ( p , q ) = � [ F ( p i ) − F ( q i ) − ( p i − q i ) F ′ ( q i )] where f and F are convex and f (1)=0.

  8. The Catch...

  9. The Catch... • What if m and n are huge and you can’t store the list?

  10. The Catch... • What if m and n are huge and you can’t store the list? • Applications: monitoring internet traffic, I/O efficient external memory, processing huge log files, database query planning, sensor networks, ...

  11. The Catch... • What if m and n are huge and you can’t store the list? • Applications: monitoring internet traffic, I/O efficient external memory, processing huge log files, database query planning, sensor networks, ... • Data Stream Model: No control over the order of the stream Limited working memory, e.g. , polylog(n,m) space Limited time to process each element

  12. The Catch... • What if m and n are huge and you can’t store the list? • Applications: monitoring internet traffic, I/O efficient external memory, processing huge log files, database query planning, sensor networks, ... • Data Stream Model: No control over the order of the stream Limited working memory, e.g. , polylog(n,m) space Limited time to process each element • Previous work: quantiles, frequency moments, histograms, clustering, entropy, graph problems... see, e.g., Muthukrishnan “Data Streams: Algorithms and Applications”

  13. Today’s Talk

  14. Today’s Talk • Sketching L p distances (0<p ≤ 2): • (1+ ε )- approx. with prob. 1- δ in Õ( ε -2 ln δ -1 ) space • Stable distributions and pseudo-random generators • Stable distributions, pseudo-random generators, embeddings & data stream computation (Indyk, FOCS 2000)

  15. Today’s Talk • Sketching L p distances (0<p ≤ 2): • (1+ ε )- approx. with prob. 1- δ in Õ( ε -2 ln δ -1 ) space • Stable distributions and pseudo-random generators • Stable distributions, pseudo-random generators, embeddings & data stream computation (Indyk, FOCS 2000) • Impossibility of Extending to Other Divergences: • Can we sketch other divergences such as Hellinger? • Lower bounds via communication complexity • Sketching information divergences (Guha, Indyk, McGregor, COLT 2007)

  16. Today’s Talk • Sketching L p distances (0<p ≤ 2): • (1+ ε )- approx. with prob. 1- δ in Õ( ε -2 ln δ -1 ) space • Stable distributions and pseudo-random generators • Stable distributions, pseudo-random generators, embeddings & data stream computation (Indyk, FOCS 2000) • Impossibility of Extending to Other Divergences: • Can we sketch other divergences such as Hellinger? • Lower bounds via communication complexity • Sketching information divergences (Guha, Indyk, McGregor, COLT 2007) • Using sketches to test independence: • Testing independence between data streams • Declaring independence via the sketching of sketches (Indyk, McGregor, SODA 2008)

  17. 1. Sketching L p distances p-stable distributions, pseudo-random generators 2. The Unsketchables information divergences, communication complexity 3. Sketching Sketches identifying correlations in data streams

  18. 1. Sketching L p distances p-stable distributions, pseudo-random generators 2. The Unsketchables information divergences, communication complexity 3. Sketching Sketches identifying correlations in data streams

  19. Stable Distributions

  20. Stable Distributions • A p-stable distribution μ has the following property: If X, Y, Z ∼ µ and a, b ∈ R then : aX + bY ∼ ( | a | p + | b | p ) 1 /p Z

  21. Stable Distributions • A p-stable distribution μ has the following property: If X, Y, Z ∼ µ and a, b ∈ R then : aX + bY ∼ ( | a | p + | b | p ) 1 /p Z • Examples: 1 e − x 2 / 2 Normal(0,1) is 2-stable: √ 2 π 1 1 Cauchy is 1-stable: 1 + x 2 π

  22. Approximating L 1 and L 2

  23. Approximating L 1 and L 2 • Let μ be a p -stable distribution (0< p ≤ 1)

  24. Approximating L 1 and L 2 • Let μ be a p -stable distribution (0< p ≤ 1) • Ideal Algorithm: For i = 1 to k: Let x be a length n vector with x j ~ μ Compute t i = |x.(p-q)| Return median(t 1 , t 2 , ... , t n )/median(| μ |)

  25. Approximating L 1 and L 2 • Let μ be a p -stable distribution (0< p ≤ 1) • Ideal Algorithm: For i = 1 to k: Let x be a length n vector with x j ~ μ Compute t i = |x.(p-q)| Return median(t 1 , t 2 , ... , t n )/median(| μ |) Easy to compute x .( p - q ): for stream 3,5,3,7,5, ... compute x 3 -x 5 +x 3 -x 7 -x 5 - ... and scale.

  26. Approximating L 1 and L 2 • Let μ be a p -stable distribution (0< p ≤ 1) • Ideal Algorithm: For i = 1 to k: Let x be a length n vector with x j ~ μ Compute t i = |x.(p-q)| Return median(t 1 , t 2 , ... , t n )/median(| μ |) Easy to compute x .( p - q ): for stream 3,5,3,7,5, ... compute x 3 -x 5 +x 3 -x 7 -x 5 - ... and scale. • Lemma: Returns (1± ε ) L p ( p-q ) with prob. 1- δ , if k =Õ( ε -2 ln δ -1 ) .

  27. Approximating L 1 and L 2 • Let μ be a p -stable distribution (0< p ≤ 1) • Ideal Algorithm: For i = 1 to k: Let x be a length n vector with x j ~ μ Compute t i = |x.(p-q)| Return median(t 1 , t 2 , ... , t n )/median(| μ |) Easy to compute x .( p - q ): for stream 3,5,3,7,5, ... compute x 3 -x 5 +x 3 -x 7 -x 5 - ... and scale. • Lemma: Returns (1± ε ) L p ( p-q ) with prob. 1- δ , if k =Õ( ε -2 ln δ -1 ) . • Proof: • Each t i ~ L 1 ( p-q ) | μ | by p -stablity property. • Apply Chernoff bounds.

  28. Sketches and Space

  29. Sketches and Space • Sketch/Embedding into Small Dimension:

  30. Sketches and Space • Sketch/Embedding into Small Dimension: • Let x 1 , x 2 , ... , x k be length n vector with x ji ~ μ

  31. Sketches and Space • Sketch/Embedding into Small Dimension: • Let x 1 , x 2 , ... , x k be length n vector with x ji ~ μ • Let C(y)= (x 1 .y, ... , x k .y)

  32. Sketches and Space • Sketch/Embedding into Small Dimension: • Let x 1 , x 2 , ... , x k be length n vector with x ji ~ μ • Let C(y)= (x 1 .y, ... , x k .y) • Approximate L 1 (p-q) from C(p) and C(p)

  33. Sketches and Space • Sketch/Embedding into Small Dimension: • Let x 1 , x 2 , ... , x k be length n vector with x ji ~ μ • Let C(y)= (x 1 .y, ... , x k .y) • Approximate L 1 (p-q) from C(p) and C(p) • CAUTION : Not an embedding into a normed space.

  34. Sketches and Space • Sketch/Embedding into Small Dimension: • Let x 1 , x 2 , ... , x k be length n vector with x ji ~ μ • Let C(y)= (x 1 .y, ... , x k .y) • Approximate L 1 (p-q) from C(p) and C(p) • CAUTION : Not an embedding into a normed space. • Can we also construct sketch in small space:

  35. Sketches and Space • Sketch/Embedding into Small Dimension: • Let x 1 , x 2 , ... , x k be length n vector with x ji ~ μ • Let C(y)= (x 1 .y, ... , x k .y) • Approximate L 1 (p-q) from C(p) and C(p) • CAUTION : Not an embedding into a normed space. • Can we also construct sketch in small space: • Storing all x i requires Ω (nk) space.

  36. Sketches and Space • Sketch/Embedding into Small Dimension: • Let x 1 , x 2 , ... , x k be length n vector with x ji ~ μ • Let C(y)= (x 1 .y, ... , x k .y) • Approximate L 1 (p-q) from C(p) and C(p) • CAUTION : Not an embedding into a normed space. • Can we also construct sketch in small space: • Storing all x i requires Ω (nk) space. • Generate x i with Nisan’s pseudo-random generator.

Recommend


More recommend