b669 sublinear algorithms for big data
play

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: - PowerPoint PPT Presentation

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: Sublinear in Space 2-1 The model and challenge The data stream model (Alon, Matias and Szegedy 1996) RAM a n a 2 a 1 CPU Why hard? Cannot store everything. Applications : Internet


  1. B669 Sublinear Algorithms for Big Data Qin Zhang 1-1

  2. Part 1: Sublinear in Space 2-1

  3. The model and challenge The data stream model (Alon, Matias and Szegedy 1996) RAM a n a 2 a 1 CPU Why hard? Cannot store everything. Applications : Internet router, stock data, ad auction, flight logs on tape, etc. (next 4 slides, in courtesy of Jeff Phillps) 3-1

  4. Network routers Packets limited space Router Internet Router • data per day: at least 1 Terabyte • packet takes 8 nanoseconds to pass through router • few million packets per second What statistics can we keep on data? For example, want to detect anomalies for security. 4-1

  5. Telephone Switch txt, msg limited space Switch Cell phones connect through switches • each message 1000 Bytes • 500 million calls / day • 1 Terabyte per month second Search for characteristics for dropped calls? 5-1

  6. Ad Auction limited space page view ad served ad click Server keyword search delivery model Serving Ads on web Google, Yahoo!, Microsoft • Yahoo.com viewed 77 trillion times • 2 million / hour • Each page serves ads; which ones? How to update ad delivery model? 6-1

  7. Flight Logs on Tape CPU statistics All airplane logs over Washington, DC • About 500 - 1000 flights per day. • 50 years, total about 9 million flights • Each flight has trajectory, passenger count, control dialog. Stored on Tape. Can only make 1 (or O (1)) pass! What statistics can be gathered? 7-1

  8. (Last lecture) Maintain a sample for Sliding Windows Tasks : Find a uniform sample from the last w items . 8-1

  9. (Last lecture) Maintain a sample for Sliding Windows Tasks : Find a uniform sample from the last w items . Algorithm : – For each x i , we pick a random value v i ∈ (0 , 1). – In a window < x j − w +1 , . . . , x j > , return value x i with smallest v i . – To do this, maintain the set of all x i in sliding window whose v i value is minimal among subsequent values. 8-2

  10. (Last lecture) Maintain a sample for Sliding Windows Tasks : Find a uniform sample from the last w items . Algorithm : – For each x i , we pick a random value v i ∈ (0 , 1). – In a window < x j − w +1 , . . . , x j > , return value x i with smallest v i . – To do this, maintain the set of all x i in sliding window whose v i value is minimal among subsequent values. Correctness : Obvious. Space (expected): 1 / w + 1 / ( w − 1) + . . . + 1 / 1 = log w . 8-3

  11. § 1 . 0 An overview of problems 9-1

  12. Statistics Denote the stream by A = a 1 , a 2 , . . . , a m , where m is the length of the stream, which is unknown at the beginning. Let n be the item universe. Let f i be the frequency of item i in the steam. On seen a i = ( i , ∆), update f i ← f i + ∆ (special case: ∆ = { 1 , − 1 } , corresponding to ins/del). 10-1

  13. Statistics Denote the stream by A = a 1 , a 2 , . . . , a m , where m is the length of the stream, which is unknown at the beginning. Let n be the item universe. Let f i be the frequency of item i in the steam. On seen a i = ( i , ∆), update f i ← f i + ∆ (special case: ∆ = { 1 , − 1 } , corresponding to ins/del). Entropy: emprical entropy of the data set : m log m f i H ( A ) = � f i , i ∈ [ n ] App: Very useful in “change” (e.g., anomalous events) detection. 10-2

  14. Statistics Denote the stream by A = a 1 , a 2 , . . . , a m , where m is the length of the stream, which is unknown at the beginning. Let n be the item universe. Let f i be the frequency of item i in the steam. On seen a i = ( i , ∆), update f i ← f i + ∆ (special case: ∆ = { 1 , − 1 } , corresponding to ins/del). Entropy: emprical entropy of the data set : m log m f i H ( A ) = � f i , i ∈ [ n ] App: Very useful in “change” (e.g., anomalous events) detection. i f p Frequent moments: F p = � i • F 0 : number of distinct items. • F 1 : total number of items. • F 2 : size of self-join. General F P ( p > 1), good measurements of the skewness of the data. 10-3

  15. Statistics (cont.) Heavy-hitter: a set of items whose frequency ≥ a threshold. App: popular IP destinations, . . . Included 0 . 01 m 1 2 3 4 5 6 7 8 | A | = m 11-1

  16. Statistics (cont.) Heavy-hitter: a set of items whose frequency ≥ a threshold. App: popular IP destinations, . . . Quantile: The φ -quantile of A is some x such Included that there are at most φ m items of A that are smaller than x and at most (1 − φ ) m items of A that are greater than x . 0 . 01 m All-quantile: a data structure from which all φ -quantiles for any 1 2 3 4 5 6 7 8 0 ≤ φ ≤ 1 can be extracted. | A | = m App: distribution of package sizes . . . 11-2

  17. Statistics (cont.) L p sampling: Let x ∈ R n be a non-zero vector. For p > 0 we call the L p distribution corresponding to x the distribution on [ n ] that takes i with probability | x i | p , � x i � p p i ∈ [ n ] | x i | p ) 1 / p . In particular, for p = 0, the with � x � p = ( � L 0 sampling is to select an element uniform at random from the non-zero coordinates of x . App: an extremely useful tool for constructing graph sketches, finding duplications, etc. 12-1

  18. Graphs Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = (( u i , v i ) , insert/delete), where ( u i , v i ) is an edge. 13-1

  19. Graphs Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = (( u i , v i ) , insert/delete), where ( u i , v i ) is an edge. Connectivity: Test if a graph is connected. Matching: Estimate the size of the maximum matching of a graph. Diameter: Compute the diameter of a graph (that is, the maximum distance between two nodes). 13-2

  20. Graphs Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = (( u i , v i ) , insert/delete), where ( u i , v i ) is an edge. Connectivity: Test if a graph is connected. Matching: Estimate the size of the maximum matching of a graph. Diameter: Compute the diameter of a graph (that is, the maximum distance between two nodes). Triangle counting: Compute # triangles of a graph. App: Useful for finding communities in a social network. (fraction of v’s neighbors who are neighbors themselves) 13-3

  21. Graphs (cont.) Spanner: Given a graph G = ( V , E ), we say that a subgraph H = ( V , E ′ ) is an α -spanner for G if ∀ u , v , ∈ V , d G ( u , v ) ≤ d H ( u , v ) ≤ α · d G ( u , v ) A subgraph (approximately) maintains pair-wise distances. 14-1

  22. Graphs (cont.) Spanner: Given a graph G = ( V , E ), we say that a subgraph H = ( V , E ′ ) is an α -spanner for G if ∀ u , v , ∈ V , d G ( u , v ) ≤ d H ( u , v ) ≤ α · d G ( u , v ) A subgraph (approximately) maintains pair-wise distances. Graph sparcification: Given a graph G = ( V , E ), denote the minimum cut of G by λ ( G ), and λ A ( G ) the capacity of the cut ( A , V \ A ). We say that a weighted subgraph H = ( V , E ′ , w ) is an ǫ -sparsification for G if ∀ A ⊂ V , (1 − ǫ ) λ A ( G ) ≤ λ A ( H ) ≤ (1 + ǫ ) λ A ( G ) . App: Synopses for massive graphs. A graph synopse is a subgraph of much smaller size that keeps properties of the original graph. 14-2

  23. Geometry Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( location , ins/del). 15-1

  24. Geometry Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( location , ins/del). Earth-mover distance: Given two multisets A , B in the grid [∆] 2 of the same size, the earth-mover distance is defined as the minimum cost of a perfect matching between points in A and B . � EMD ( A , B ) = min � a − π ( a ) � . π : A → B a bijection a ∈ A App: a good measurement of the similarity of two images 15-2

  25. Geometry Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( location , ins/del). Earth-mover distance: Given two multisets A , B in the grid [∆] 2 of the same size, the earth-mover distance is defined as the minimum cost of a perfect matching between points in A and B . � EMD ( A , B ) = min � a − π ( a ) � . π : A → B a bijection a ∈ A App: a good measurement of the similarity of two images Clustering: ( k -Center) Cluster a set of points X = ( x 1 , x 2 , . . . , x m ) to clusters c 1 , c 2 , . . . , c k with representatives r 1 ∈ c 1 , r 2 ∈ c 2 , . . . , r k ∈ c k to minimize max min d ( x i , r j ) i j . App: (see wiki page) 15-3

  26. Strings Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( i , ins/del). 16-1

  27. Strings Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( i , ins/del). Distance to the sortedness: LIS( A )= length of longest increasing subsequence of sequence A . DistSort( A )= minimum number of elements needed to be deleted from A to get a sorted sequence = | A | − LIS( A ). App: a good measurement of network latency. 16-2

  28. Strings Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( i , ins/del). Distance to the sortedness: LIS( A )= length of longest increasing subsequence of sequence A . DistSort( A )= minimum number of elements needed to be deleted from A to get a sorted sequence = | A | − LIS( A ). App: a good measurement of network latency. Edit distance: Given two strings A and B , the number of insertion/deletion/substitution that is needed to convert A to B . App: a standard measurement of the similarity of two strings/documents 16-3

Recommend


More recommend