sublinear algorithms for big data
play

Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 3: Sublinear - PowerPoint PPT Presentation

Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 3: Sublinear in Time 2-1 Sublinear in time Given a social network graph, if we have no time to ask everyone, can we still compute something non-trivial? For example, the average # of


  1. Sublinear Algorithms for Big Data Qin Zhang 1-1

  2. Part 3: Sublinear in Time 2-1

  3. Sublinear in time Given a social network graph, if we have no time to ask everyone, can we still compute something non-trivial? For example, the average # of individual’s friends? 3-1

  4. Average degree of a graph Problem definition : Given a simple (no parallel edges, self-loops) graph G = ( V , E ), its average degree � v ∈ V d ( v ) ¯ d = . | V | 4-1

  5. Average degree of a graph Problem definition : Given a simple (no parallel edges, self-loops) graph G = ( V , E ), its average degree � v ∈ V d ( v ) ¯ d = . | V | Representation of G : degree + adjacency list. Our algorithms only make the following operations (queries) • Degree queries: on v return d ( v ). • Neighbor queries: for ( v , j ) return j -th neighbor of v . 4-2

  6. Naive approach fails Naive sampling: pick a set of S consisting s random � i ∈ S d ( v i ) nodes, output . s How large s should be, in order to get an O (1) multiplicative approx? Ω( n )! 5-1

  7. Naive approach fails Naive sampling: pick a set of S consisting s random � i ∈ S d ( v i ) nodes, output . s How large s should be, in order to get an O (1) multiplicative approx? Ω( n )! In general, if given n numbers and we want to estimate their average, Ω( n ) queries are needed. 5-2

  8. Naive approach fails Naive sampling: pick a set of S consisting s random � i ∈ S d ( v i ) nodes, output . s How large s should be, in order to get an O (1) multiplicative approx? Ω( n )! In general, if given n numbers and we want to estimate their average, Ω( n ) queries are needed. But, maybe the degree sequences are special, and we can make use of that? • ( n − 1 , 0 , . . . , 0) is NOT possible. • ( n − 1 , 1 , . . . , 1) is possible. 5-3

  9. Some lower bounds for approximation An extreme case: graph with 0 edge VS graph with 1 edge. Require Ω( n ) queries to distinguish (i.e., get any multiplicative approx). 6-1

  10. Some lower bounds for approximation An extreme case: graph with 0 edge VS graph with 1 edge. Require Ω( n ) queries to distinguish (i.e., get any multiplicative approx). Another example: • n -cycle. • ( n − c √ n )-cycle + c √ n -clique Require Ω( √ n ) queries to find a clique node. 6-2

  11. Some lower bounds for approximation An extreme case: graph with 0 edge VS graph with 1 edge. Require Ω( n ) queries to distinguish (i.e., get any multiplicative approx). Another example: • n -cycle. • ( n − c √ n )-cycle + c √ n -clique Require Ω( √ n ) queries to find a clique node. We will assume the graph has Ω( n ) edges from now on. 6-3

  12. (2 + ǫ )-approximation 7-1

  13. The algorithm Algorithm 1. Take subsets S 1 , S 2 , . . . , S 8 /ǫ independently at random from V , each of size Θ( √ n /ǫ O (1) ) 2. Output the smallest number in { d S 1 , d S 2 , . . . , d S 8 /ǫ } , where d S i is the average degree of nodes in set S i . Analysis on board. Theorem This algorithm runs in time O ( √ n /ǫ O (1) ), and with probability 2 / 3, outputs a (2 + ǫ )-approximation 8-1

  14. (1 + ǫ )-approximation 9-1

  15. The idea Idea: group nodes of similar degrees, estimate average within each group. Buckets: set β = ǫ/ c ( c is a const), t = O (log n /ǫ ) (# buckets) For i ∈ { 0 , . . . , t − 1 } , set B i = { v | (1 + β ) i − 1 < d ( v ) ≤ (1 + β ) i } . The total degree of nodes in B i (let d ( X ) = � x ∈ X d ( x )), d ( B i ) ∈ ((1 + β ) i − 1 | B i | , (1 + β ) i | B i | ]. The total degree of nodes in V , i (1 + β ) i − 1 | B i | , � i (1 + β ) i | B i | ]. d ( V ) ∈ ( � 10-1

  16. The first try Algorithm � 1. Take a sample S of size s = 10000 n /ǫ · t . 2. Let S i := S ∩ B i (samples that fall into the i -th bucket). 3. Estimate average degree of B i using S i , that is, ρ i = | S i | / s . Note: ∀ i , E[ ρ i ] = E[ | S i | / s ] = | B i | / n . i ρ i (1 + β ) i − 1 . 4. Output � Does this work? What is for a level i , | S i | is small (that is, | B i | is small)? For those i ’s, ρ i will not be very accurate... 11-1

  17. The second try Idea: set 0 for small buckets. Algorithm n /ǫ · t . Set η = 10000 � 1. Take a sample S of size s = 10000 c 2. Let S i := S ∩ B i (samples that fall into the i -th bucket). 3. For each i , set ρ i = 0 if | S i | ≤ η ; ρ i = | S i | / s otherwise. i ρ i (1 + β ) i − 1 . 4. Output � Note that we don’t have ∀ i , E[ ρ i ] = E[ | S i | / s ] = | B i | / n now. But we can still show some good bounds (on board). Theorem This algorithm runs in time O ( √ n /ǫ O (1) ), and with probability 2 / 3, outputs a (2 + ǫ )-approximation. 12-1

  18. An improved algorithm (if allow neighbor queries) Idea: try to estimate degrees contributed by large-small (see the analysis on board) more precisely. Algorithm n /ǫ · t . Set η = 10000 � 1. Take a sample S of size s = 1000 c 2. For all i , (a) If | S i | ≥ η , set ρ i = | S i | / s ; otherwise set ρ i = 0. (b) For all v ∈ S i , pick a random neighbor u of v , set χ ( v ) = 1 if u ′ is in a small bucket B j . (c) Set α i = |{ v ∈ S i | χ ( v )=1 }| | S i | i ρ i (1 + α i )(1 + β ) i − 1 . 3. Output � Theorem This algorithm runs in time O ( √ n /ǫ O (1) ), and with probability 2 / 3, outputs a (1 + ǫ )-approximation. 13-1

  19. Minimum Spanning Tree 14-1

  20. The connection between # CC and MST Assume a connected graph G ( V , E ) has max degree D , and the weights of all edges are in { 1 , 2 , . . . , W } . Let G i = ( V , E i ) denote the subgraph containing edges of weights at most i , and let c i be the # connected components in G i . We have W − 1 � c i . MST ( G ) = n − W + i =1 15-1

  21. The connection between # CC and MST Assume a connected graph G ( V , E ) has max degree D , and the weights of all edges are in { 1 , 2 , . . . , W } . Let G i = ( V , E i ) denote the subgraph containing edges of weights at most i , and let c i be the # connected components in G i . We have W − 1 � c i . MST ( G ) = n − W + i =1 We thus only need to approximate c i for each i = 1 , . . . , W − 1. 15-2

  22. A sublinear algorithm for # CC 1. Sample a random set of r = c 0 /ǫ 2 vertices u 1 , . . . , u r 2. For each sampled vertex u i , we grow a BFS tree T u i rooted u i as follows. (a) Choose X according to Pr[ X ≥ k ] = 1 / k . (b) Run BFS starting at u i until either i. the whole connected component containing u i has been fully explored, or ii. X vertices have been explored. (c) If BFS stopped in the first case, then set α i = 1, otherwise set α i = 0. � r 3. Output n i =1 α i . r Theorem This algorithm runs in time O ( D log n / ( ǫ 2 ρ )) ( D is the maximum degree of nodes in G ), and with probability 1 − ρ , outputs an answer with additive error ǫ n . 16-1

  23. An improved algorithm for # CC 1. Sample a random set of r = c 0 /ǫ 2 vertices u 1 , . . . , u r 2. For each sampled vertex u i , we grow a BFS tree T u i rooted u i as follows. Set α i = 0 and f = 0. * Flip a coin. Set f = f + 1. If (head) ∧ ( | T u i | < W = 4 /ǫ ) ∧ (no visitied vertex has degree > d ∗ = O (¯ d /ǫ )), then Let B = | T u i | . We continue to grow T u i by B steps. i. If during any of the B steps the component of G containing u i has been fully explored, then set α i = 2 if B ′ = 0 and d u i 2 f / B ′ otherwise, where B ′ ∈ [ B , 2 B ] is the # edges visited in the BFS so far. ii. Else, we repeat step ∗ . � r n 3. Output i =1 α i . 2 r Theorem ¯ This algorithm runs in time O (¯ ǫ / ( ǫ 2 ρ )), and with probability d d log 1 − ρ , outputs an answer with additive error ǫ n . 17-1

  24. Back to MST Set ǫ = φ/ (2 W ), and ρ = 1 / (4 W ) when approx. all c i . The total running time will be O ( D · W 3 · log n /φ 2 ) (to approximate the MST up to a factor of 1 + φ ). 18-1

  25. Back to MST Set ǫ = φ/ (2 W ), and ρ = 1 / (4 W ) when approx. all c i . The total running time will be O ( D · W 3 · log n /φ 2 ) (to approximate the MST up to a factor of 1 + φ ). Can be improved to ˜ O ( DW /φ 2 ). 18-2

  26. Some slides are based on Ronitt Rubinfeld’s course http://stellar.mit.edu/S/course/6/sp13/6.893 19-1

Recommend


More recommend