Data Streams Tutorial Andrew McGregor University of Massachusetts, - PowerPoint PPT Presentation

First Idea: Sketches       f 1 t 1 Z f 2 t 2       =               t k  .  .   .     f n • Algorithm uses a (random) projection matrix Z such that the relevant properties of f can be estimated from the sketch Zf. • Easy to Update: On seeing “i”, add i th column of Z to sketch

First Idea: Sketches       f 1 t 1 Z f 2 t 2       =               t k  .  .   .     f n • Algorithm uses a (random) projection matrix Z such that the relevant properties of f can be estimated from the sketch Zf. • Easy to Update: On seeing “i”, add i th column of Z to sketch • Store Matrix Implicitly: Need to be able to efficiently generate any entry of Z from a “small” random seed.

First Idea: Sketches       f 1 t 1 Z f 2 t 2       =               t k  .  .   .     f n • Algorithm uses a (random) projection matrix Z such that the relevant properties of f can be estimated from the sketch Zf. • Easy to Update: On seeing “i”, add i th column of Z to sketch • Store Matrix Implicitly: Need to be able to efficiently generate any entry of Z from a “small” random seed. • Gives Õ(k) space algorithm with seed & precision assumptions.

Algorithm for Estimating F 2       f 1 t 1 Z f 2 t 2       =               t k  .  .   .     f n

Algorithm for Estimating F 2       f 1 t 1 Z f 2 t 2       =               t k  .  .   .     f n Consider a row z of the projection matrix.

Algorithm for Estimating F 2       f 1 t 1 Z f 2 t 2       =               t k  .  .   .     f n Consider a row z of the projection matrix. Let entries of z be uniform in {-1,1} chosen with 4-wise independence. Let t=z.f.

Algorithm for Estimating F 2       f 1 t 1 Z f 2 t 2       =       Square of entry is       concentrated around F 2 .   t k  .  .   .     f n Consider a row z of the projection matrix. Let entries of z be uniform in {-1,1} chosen with 4-wise independence. Let t=z.f.

Algorithm for Estimating F 2       f 1 t 1 Z f 2 t 2       =       Square of entry is       concentrated around F 2 .   t k  .  .   .     f n Consider a row z of the projection matrix. Let entries of z be uniform in {-1,1} chosen with 4-wise independence. Let t=z.f. Expectation: E(t 2 ) = ∑ i,j E(z i z j )f i f j = F 2

Algorithm for Estimating F 2       f 1 t 1 Z f 2 t 2       =       Square of entry is       concentrated around F 2 .   t k  .  .   .     f n Consider a row z of the projection matrix. Let entries of z be uniform in {-1,1} chosen with 4-wise independence. Let t=z.f. Expectation: E(t 2 ) = ∑ i,j E(z i z j )f i f j = F 2 Variance: Var(t 2 ) ≤ ∑ i,j,k,l E(z i z j z k z l )f i f j f k f l < 6F 22

Algorithm for Estimating F 2       f 1 t 1 Z f 2 t 2       =       Square of entry is       concentrated around F 2 .   t k  .  .   .     f n Consider a row z of the projection matrix. Let entries of z be uniform in {-1,1} chosen with 4-wise independence. Let t=z.f. Expectation: E(t 2 ) = ∑ i,j E(z i z j )f i f j = F 2 Variance: Var(t 2 ) ≤ ∑ i,j,k,l E(z i z j z k z l )f i f j f k f l < 6F 22 By Chebyshev, setting k=O( ε -2 log δ -1 ) ensures with prob. 1- δ , average of squared entries is (1± ε ) F 2 .

Second Idea: Sampling

Second Idea: Sampling • Let’s sample from S=[a 1 , a 2 , a 3 , ... , a m ] where each a i ∈ R [n]

Second Idea: Sampling • Let’s sample from S=[a 1 , a 2 , a 3 , ... , a m ] where each a i ∈ R [n] • Distribution Sampling: Return i with probability f i /m

Second Idea: Sampling • Let’s sample from S=[a 1 , a 2 , a 3 , ... , a m ] where each a i ∈ R [n] • Distribution Sampling: Return i with probability f i /m • Universe Sampling: Return (i,f i ) where i ∈ R [n]

Second Idea: Sampling • Let’s sample from S=[a 1 , a 2 , a 3 , ... , a m ] where each a i ∈ R [n] • Distribution Sampling: Return i with probability f i /m • Universe Sampling: Return (i,f i ) where i ∈ R [n] • AMS Sampling: Return (i,r) with i chosen w/p f i /m and r ∈ R [f i ]

Second Idea: Sampling • Let’s sample from S=[a 1 , a 2 , a 3 , ... , a m ] where each a i ∈ R [n] • Distribution Sampling: Return i with probability f i /m • Universe Sampling: Return (i,f i ) where i ∈ R [n] • AMS Sampling: Return (i,r) with i chosen w/p f i /m and r ∈ R [f i ] • Sample a j for j ∈ R [m], let i= a j and compute r=|{j ′≥ j : a j ′ =a j }|

Second Idea: Sampling • Let’s sample from S=[a 1 , a 2 , a 3 , ... , a m ] where each a i ∈ R [n] • Distribution Sampling: Return i with probability f i /m • Universe Sampling: Return (i,f i ) where i ∈ R [n] • AMS Sampling: Return (i,r) with i chosen w/p f i /m and r ∈ R [f i ] • Sample a j for j ∈ R [m], let i= a j and compute r=|{j ′≥ j : a j ′ =a j }| • Useful for estimating ∑ i g(f i ) because E[m(g(r)-g(r-1))] = ∑ i g(f i )

Second Idea: Sampling • Let’s sample from S=[a 1 , a 2 , a 3 , ... , a m ] where each a i ∈ R [n] • Distribution Sampling: Return i with probability f i /m • Universe Sampling: Return (i,f i ) where i ∈ R [n] • AMS Sampling: Return (i,r) with i chosen w/p f i /m and r ∈ R [f i ] • Sample a j for j ∈ R [m], let i= a j and compute r=|{j ′≥ j : a j ′ =a j }| • Useful for estimating ∑ i g(f i ) because E[m(g(r)-g(r-1))] = ∑ i g(f i ) • L p Sampling: Return i with probability f ik /F k

L 0 Sampling

L 0 Sampling Suppose we know F 0 . Pick hash function h:[n] → [F 0 ]

L 0 Sampling Suppose we know F 0 . Pick hash function h:[n] → [F 0 ] Algorithm: Maintain values c and id, initially 0.

L 0 Sampling Suppose we know F 0 . Pick hash function h:[n] → [F 0 ] Algorithm: Maintain values c and id, initially 0. For each j in stream: if h(j)=1, c ← c+1, id ← id+j

L 0 Sampling Suppose we know F 0 . Pick hash function h:[n] → [F 0 ] Algorithm: Maintain values c and id, initially 0. For each j in stream: if h(j)=1, c ← c+1, id ← id+j Return id/c if all elts hashing to 1 were same

L 0 Sampling Suppose we know F 0 . Pick hash function h:[n] → [F 0 ] Algorithm: Maintain values c and id, initially 0. For each j in stream: if h(j)=1, c ← c+1, id ← id+j Return id/c if all elts hashing to 1 were same Claim: This happens with constant probability.

L 0 Sampling Suppose we know F 0 . Pick hash function h:[n] → [F 0 ] Algorithm: Maintain values c and id, initially 0. For each j in stream: if h(j)=1, c ← c+1, id ← id+j Return id/c if all elts hashing to 1 were same Claim: This happens with constant probability. Claim: Need to check elts hashing to 1 were same.

L 0 Sampling Suppose we know F 0 . Pick hash function h:[n] → [F 0 ] Algorithm: Maintain values c and id, initially 0. For each j in stream: if h(j)=1, c ← c+1, id ← id+j Return id/c if all elts hashing to 1 were same Claim: This happens with constant probability. Claim: Need to check elts hashing to 1 were same. Run O(log n) copies guessing F 0 =2 i . At least one instantiation works with constant probability.

L 0 Sampling Suppose we know F 0 . Pick hash function h:[n] → [F 0 ] Algorithm: Maintain values c and id, initially 0. For each j in stream: if h(j)=1, c ← c+1, id ← id+j Return id/c if all elts hashing to 1 were same Claim: This happens with constant probability. Claim: Need to check elts hashing to 1 were same. Run O(log n) copies guessing F 0 =2 i . At least one instantiation works with constant probability. Algorithm is a sketch and works with deletions!

Third Idea: Lower Bounds

Third Idea: Lower Bounds x ∈ {0,1} n y ∈ {0,1} n • Many space lower bounds in data stream model use reductions from communication complexity.

Third Idea: Lower Bounds x ∈ {0,1} n y ∈ {0,1} n • Many space lower bounds in data stream model use reductions from communication complexity. • Example: Alice and Bob have x,y ∈ {0,1} n and Bob wants to check DISJOINTNESS i.e., is there an i with x i =y i =1?

Third Idea: Lower Bounds x ∈ {0,1} n y ∈ {0,1} n • Many space lower bounds in data stream model use reductions from communication complexity. • Example: Alice and Bob have x,y ∈ {0,1} n and Bob wants to check DISJOINTNESS i.e., is there an i with x i =y i =1? • Thm: Any 1/3-error protocol for DISJOINTNESS requires Ω (n) bits of communication.

Third Idea: Lower Bounds x ∈ {0,1} n y ∈ {0,1} n • Many space lower bounds in data stream model use reductions from communication complexity. • Example: Alice and Bob have x,y ∈ {0,1} n and Bob wants to check DISJOINTNESS i.e., is there an i with x i =y i =1? • Thm: Any 1/3-error protocol for DISJOINTNESS requires Ω (n) bits of communication. • Corollary: Any 1/3-error stream algorithm that checks if a graph is triangle-free needs Ω (n 2 ) bits of memory.

Lower Bound for Triangle Detection

Lower Bound for Triangle Detection Alice and Bob have X,Y ∈ {0,1} nxn . For Bob to check if X ij =Y ij =1 for some i,j needs Ω (n 2 ) communication.

Lower Bound for Triangle Detection Alice and Bob have X,Y ∈ {0,1} nxn . For Bob to check if X ij =Y ij =1 for some i,j needs Ω (n 2 ) communication. Let A be an s-space alg that checks for triangles.

Lower Bound for Triangle Detection Alice and Bob have X,Y ∈ {0,1} nxn . For Bob to check if X ij =Y ij =1 for some i,j needs Ω (n 2 ) communication. Let A be an s-space alg that checks for triangles. Consider 3-layer graph (U,V ,W) with |U|=|V|=|W|=n

Lower Bound for Triangle Detection Alice and Bob have X,Y ∈ {0,1} nxn . For Bob to check if X ij =Y ij =1 for some i,j needs Ω (n 2 ) communication. Let A be an s-space alg that checks for triangles. Consider 3-layer graph (U,V ,W) with |U|=|V|=|W|=n Alice runs A on E 1 ={u i w i : 1 ≤ i ≤ n} and E 2 ={u i v j : X ij =1}

Lower Bound for Triangle Detection Alice and Bob have X,Y ∈ {0,1} nxn . For Bob to check if X ij =Y ij =1 for some i,j needs Ω (n 2 ) communication. Let A be an s-space alg that checks for triangles. Consider 3-layer graph (U,V ,W) with |U|=|V|=|W|=n Alice runs A on E 1 ={u i w i : 1 ≤ i ≤ n} and E 2 ={u i v j : X ij =1} Sends memory to Bob who runs A on E 3 ={v j w i :Y ij =1}

Lower Bound for Triangle Detection Alice and Bob have X,Y ∈ {0,1} nxn . For Bob to check if X ij =Y ij =1 for some i,j needs Ω (n 2 ) communication. Let A be an s-space alg that checks for triangles. Consider 3-layer graph (U,V ,W) with |U|=|V|=|W|=n Alice runs A on E 1 ={u i w i : 1 ≤ i ≤ n} and E 2 ={u i v j : X ij =1} Sends memory to Bob who runs A on E 3 ={v j w i :Y ij =1} Output of A resolves matrix question so s= Ω (n 2 ).

Useful Communication Results

Useful Communication Results • Indexing: • Alice has x ∈ {0,1} n , Bob has i ∈ [n]. Bob want’s to learn x i . • One-way communication requires Ω (n) bits even if Bob also knows first i-1 bits of x.

Useful Communication Results • Indexing: • Alice has x ∈ {0,1} n , Bob has i ∈ [n]. Bob want’s to learn x i . • One-way communication requires Ω (n) bits even if Bob also knows first i-1 bits of x. • Gap-Hamming: • Alice and Bob have x,y ∈ {0,1} n . Distinguish Δ (x,y)<n/2- √ n from Δ (x,y)>n/2+ √ n. • Requires Ω (n) communication.

Useful Communication Results • Indexing: • Alice has x ∈ {0,1} n , Bob has i ∈ [n]. Bob want’s to learn x i . • One-way communication requires Ω (n) bits even if Bob also knows first i-1 bits of x. • Gap-Hamming: • Alice and Bob have x,y ∈ {0,1} n . Distinguish Δ (x,y)<n/2- √ n from Δ (x,y)>n/2+ √ n. • Requires Ω (n) communication. • Multi-Party Disjointness: • t players have x 1 ,x 2 , ... , x t ∈ {0,1} n . Need to distinguish x 1i =x 2i = ... =x ti =1 for some i from all vectors orthogonal. • Requires Ω (n/t) communication.

Bonus! The Fourth Idea

Bonus! The Fourth Idea • Algorithmic tools will only get you so far, sometimes you need to come up with neat ad hoc solutions.

Bonus! The Fourth Idea • Algorithmic tools will only get you so far, sometimes you need to come up with neat ad hoc solutions. • Graph Distances: Given a stream of edges, approximate the shortest path distance between any two nodes.

Bonus! The Fourth Idea • Algorithmic tools will only get you so far, sometimes you need to come up with neat ad hoc solutions. • Graph Distances: Given a stream of edges, approximate the shortest path distance between any two nodes. • k-Center: Given a stream of points, find a set of centers that minimizes max distance from a point to nearest center.

Approximate Distances

Approximate Distances Edges define shortest path graph metric d G .

Approximate Distances Edges define shortest path graph metric d G . An α -spanner of G = (V ,E) is a subgraph H = (V ,E’) such that ∀ u,v: d G (u,v) ≤ d H (u,v) ≤ α d G (u,v)

Approximate Distances Edges define shortest path graph metric d G . An α -spanner of G = (V ,E) is a subgraph H = (V ,E’) such that ∀ u,v: d G (u,v) ≤ d H (u,v) ≤ α d G (u,v) Algorithm: Let E ′ be initially empty Add (u,v) to E ′ if d H (u,v) > 2t-1

Approximate Distances Edges define shortest path graph metric d G . An α -spanner of G = (V ,E) is a subgraph H = (V ,E’) such that ∀ u,v: d G (u,v) ≤ d H (u,v) ≤ α d G (u,v) Algorithm: Let E ′ be initially empty Add (u,v) to E ′ if d H (u,v) > 2t-1 Analysis: Each distance increase by at most factor 2t-1 |E ′ | = O(n 1+1/t ) because all cycles of length > 2t

k-Center Clustering

k-Center Clustering 2 approx in O(k) space if you already know OPT.

k-Center Clustering 2 approx in O(k) space if you already know OPT. (2+ ε ) approx in O(k ε -1 log Δ ) space if 1 ≤ OPT ≤ Δ

k-Center Clustering 2 approx in O(k) space if you already know OPT. (2+ ε ) approx in O(k ε -1 log Δ ) space if 1 ≤ OPT ≤ Δ Better Algorithm O(k ε -1 log ε -1 ): Instantiate basic algorithm with guesses 1, (1+ ε ), (1+ ε ) 2 , ... , 2 ε − 1

k-Center Clustering 2 approx in O(k) space if you already know OPT. (2+ ε ) approx in O(k ε -1 log Δ ) space if 1 ≤ OPT ≤ Δ Better Algorithm O(k ε -1 log ε -1 ): Instantiate basic algorithm with guesses 1, (1+ ε ), (1+ ε ) 2 , ... , 2 ε − 1 If guess r stops working at (j+1) th point: Let q 1 ,...,q k be centers chosen so far. Then p 1 ,...,p j are all at most 2r from some q i . OPT for {q 1 ,...,q k ,p j+1 ,...,p n } is at most OPT+2r.

Data Streams Tutorial Andrew McGregor University of Massachusetts, - PowerPoint PPT Presentation

Data Streams Tutorial Andrew McGregor University of Massachusetts, Amherst Data Stream Model [Morris 78] [Munro, Paterson 78] [Flajolet, Martin 85] [Alon, Matias, Szegedy 96] [Henzinger, Raghavan, Rajagopalan 98] Data Stream

Tutorial Tutorial A2 is out, its called Inpainting Tutorial Tutorial A2 is out, its called

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Querying and Mining Data Streams: Querying and Mining Data Streams: You Only Get One Look You

Data Streams Many large sources of data are generated as streams of updates: IP Network

Data Streams Many large sources of data are generated as streams of updates: IP Network

Stream Bank Stabilization in Open Space Streams in open space There are approximately 35

CSE 143 Streams as C++ Classes Streams are C++ classes Streams have lots of built-in

A GAMS TUTORIAL A GAMS TUTORIAL A GAMS TUTORIAL WHAT IS GAMS ? General Algebraic Modeling

Comparing Data Streams Using Hamming Norms Graham Cormode, Mayur Datar, Piotr Indyk, S.

A P A P A Proposal for Publishing Data A Proposal for Publishing Data l f l f P bli hi P bli

Streams and File I/O Fundamentals of Computer Science Outline Overview of Streams and File

Excel Tutorial 1 Getting Started with Excel Tutorial 2 Formatting a Workbook Tutorial 3

Scalable Machine Learning 3. Data Streams Alex Smola Yahoo! Research and ANU

Frequency Counts Frequency Counts over over Data Streams Data Streams Gurmeet Singh Manku

EDP 613 Fall 2020 Chapter 2 Slides Abhik Roy Abhik.Roy@mail.wvu.edu West Virginia University

Effects of a Time-Varying String Tension & String Repulsion in Momentum Space Tau-dependent

Noncrossing partitions, interval partitions and the Bruhat order Philippe Biane CNRS, IGM,

General Edgeworth expansions with One-split branching random walks applications to profiles of

1 A statistical definition of probability: frequentist 2 concepts: 1. Sample space , S , is the

Exploration of a Threshold for Similarity based on Uncertainty in Word Embedding Navid Rekabsaz,

Random-Variate Generation Banks, Carson, Nelson & Nicol Discrete-Event System Simulation

Effect of ivabradine on recurrent hospitalization for worsening heart failure: findings from

Data Streams Tutorial Andrew McGregor University of Massachusetts, - PowerPoint PPT Presentation

Data Streams Tutorial Andrew McGregor University of Massachusetts, Amherst Data Stream Model [Morris 78] [Munro, Paterson 78] [Flajolet, Martin 85] [Alon, Matias, Szegedy 96] [Henzinger, Raghavan, Rajagopalan 98] Data Stream

Tutorial Tutorial A2 is out, its called Inpainting Tutorial Tutorial A2 is out, its called

WITH C++ Prof. Amr Goneid AUC Part 9. Streams &amp; Files Prof. amr Goneid, AUC 1 Streams

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data &amp; Real Time Data Streams

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Querying and Mining Data Streams: Querying and Mining Data Streams: You Only Get One Look You

Data Streams Many large sources of data are generated as streams of updates: IP Network

Data Streams Many large sources of data are generated as streams of updates: IP Network

Stream Bank Stabilization in Open Space Streams in open space There are approximately 35

CSE 143 Streams as C++ Classes Streams are C++ classes Streams have lots of built-in

A GAMS TUTORIAL A GAMS TUTORIAL A GAMS TUTORIAL WHAT IS GAMS ? General Algebraic Modeling

Comparing Data Streams Using Hamming Norms Graham Cormode, Mayur Datar, Piotr Indyk, S.

A P A P A Proposal for Publishing Data A Proposal for Publishing Data l f l f P bli hi P bli

Streams and File I/O Fundamentals of Computer Science Outline Overview of Streams and File

Excel Tutorial 1 Getting Started with Excel Tutorial 2 Formatting a Workbook Tutorial 3

Scalable Machine Learning 3. Data Streams Alex Smola Yahoo! Research and ANU

Frequency Counts Frequency Counts over over Data Streams Data Streams Gurmeet Singh Manku

EDP 613 Fall 2020 Chapter 2 Slides Abhik Roy Abhik.Roy@mail.wvu.edu West Virginia University

Effects of a Time-Varying String Tension &amp; String Repulsion in Momentum Space Tau-dependent

Noncrossing partitions, interval partitions and the Bruhat order Philippe Biane CNRS, IGM,

General Edgeworth expansions with One-split branching random walks applications to profiles of

1 A statistical definition of probability: frequentist 2 concepts: 1. Sample space , S , is the

Exploration of a Threshold for Similarity based on Uncertainty in Word Embedding Navid Rekabsaz,

Random-Variate Generation Banks, Carson, Nelson &amp; Nicol Discrete-Event System Simulation

Effect of ivabradine on recurrent hospitalization for worsening heart failure: findings from

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams

Effects of a Time-Varying String Tension & String Repulsion in Momentum Space Tau-dependent

Random-Variate Generation Banks, Carson, Nelson & Nicol Discrete-Event System Simulation