big data algorithms overview reference http
play

Big-Data Algorithms: Overview Reference: - PowerPoint PPT Presentation

Big-Data Algorithms: Overview Reference: http://www.sketchingbigdata.org/fall17/lec/lec1.pdf Whats the problem here? I So far, linear (i.e., linear-cost) algorithms have been gold standard. I What if linear algorithms arent good


  1. Big-Data Algorithms: Overview Reference: http://www.sketchingbigdata.org/fall17/lec/lec1.pdf

  2. What’s the problem here? I So far, linear (i.e., linear-cost) algorithms have been “gold standard”. I What if linear algorithms aren’t good enough?

  3. What’s the problem here? I So far, linear (i.e., linear-cost) algorithms have been “gold standard”. I What if linear algorithms aren’t good enough? Example: Search the web for pages of interest.

  4. Topics of Interest I Sketching: Compression of a data set that allows queries. I Compression C ( x ) of some data set x that allows us to query f ( x ). I May want to compute f ( x , y ) from C ( x ) and C ( y ). I May want composable compression: if x = x 1 x 2 . . . x n , would like to compute C ( x 1 x 2 . . . x n x n +1 ) = C ( x x n +1 ) using just C ( x ) and x n +1 .

  5. Topics of Interest I Sketching: Compression of a data set that allows queries. I Compression C ( x ) of some data set x that allows us to query f ( x ). I May want to compute f ( x , y ) from C ( x ) and C ( y ). I May want composable compression: if x = x 1 x 2 . . . x n , would like to compute C ( x 1 x 2 . . . x n x n +1 ) = C ( x x n +1 ) using just C ( x ) and x n +1 . I Streaming: May not be able to store a huge dataset. Need to process stream of data, coming in one chunk at a time, on the fly. Must answer queries with sublinear memory.

  6. Topics of Interest I Sketching: Compression of a data set that allows queries. I Compression C ( x ) of some data set x that allows us to query f ( x ). I May want to compute f ( x , y ) from C ( x ) and C ( y ). I May want composable compression: if x = x 1 x 2 . . . x n , would like to compute C ( x 1 x 2 . . . x n x n +1 ) = C ( x x n +1 ) using just C ( x ) and x n +1 . I Streaming: May not be able to store a huge dataset. Need to process stream of data, coming in one chunk at a time, on the fly. Must answer queries with sublinear memory. I Dimensionality reduction: For example, spam filtering. Bag-of-words model: Let d be a dictionary of words. Represent email by vector v , where v i is the number of times d i appears in msg. Then dim v = | d | .

  7. I Large-scale matrix computation , such as least squares regression : Suppose we want to learn f : R n ! R , where f = h b , · i for some b 2 R n , where n X 8 u , v 2 R n . h u , v i = u i v i j =1 Collect data { ( x i 2 R n , y i 2 R ) : 1  i  m } . Want to compute b minimizing ✓ n ◆ 1 / 2 ( y i � h b , x i i ) 2 k Xb � y k 2 X 2 = , j =1 where X 2 R m × n is composed of the (column) vectors x T 1 , . . . , x T p m and k · k 2 = h · , · i is ` 2 -norm. Also, principal component analysis, given by singular value decomposition of matrix: which features are most important?

  8. Approximate Counting Problem: Monitor a sequence of events, allow approximate count of number of events so far at any time.

  9. Approximate Counting Problem: Monitor a sequence of events, allow approximate count of number of events so far at any time. Create data structure maintaining a single integer n (initialize to zero) and supporting the operations I init() : set n 0. I update() : increments n . I query() : prints (estimate of) n

  10. Approximate Counting Problem: Monitor a sequence of events, allow approximate count of number of events so far at any time. Create data structure maintaining a single integer n (initialize to zero) and supporting the operations I init() : set n 0. I update() : increments n . I query() : prints (estimate of) n Why approximation? If we want exact value, then can store n via a counter, a sequence of d log n e bits (“log” is “log 2 ”).

  11. Approximate Counting Problem: Monitor a sequence of events, allow approximate count of number of events so far at any time. Create data structure maintaining a single integer n (initialize to zero) and supporting the operations I init() : set n 0. I update() : increments n . I query() : prints (estimate of) n Why approximation? If we want exact value, then can store n via a counter, a sequence of d log n e bits (“log” is “log 2 ”). Can’t do better: If we use f ( n ) bits to store n , then there are 2 f ( n ) configurations. To store exact value of all integers up to n , must have 2 f ( n ) � n

  12. Approximate Counting Problem: Monitor a sequence of events, allow approximate count of number of events so far at any time. Create data structure maintaining a single integer n (initialize to zero) and supporting the operations I init() : set n 0. I update() : increments n . I query() : prints (estimate of) n Why approximation? If we want exact value, then can store n via a counter, a sequence of d log n e bits (“log” is “log 2 ”). Can’t do better: If we use f ( n ) bits to store n , then there are 2 f ( n ) configurations. To store exact value of all integers up to n , must have 2 f ( n ) � n = ) f ( n ) � log n

  13. Approximate Counting Problem: Monitor a sequence of events, allow approximate count of number of events so far at any time. Create data structure maintaining a single integer n (initialize to zero) and supporting the operations I init() : set n 0. I update() : increments n . I query() : prints (estimate of) n Why approximation? If we want exact value, then can store n via a counter, a sequence of d log n e bits (“log” is “log 2 ”). Can’t do better: If we use f ( n ) bits to store n , then there are 2 f ( n ) configurations. To store exact value of all integers up to n , must have 2 f ( n ) � n = ) f ( n ) � log n = ) f ( n ) � d log n e

  14. Approximate Counting Problem: Monitor a sequence of events, allow approximate count of number of events so far at any time. Create data structure maintaining a single integer n (initialize to zero) and supporting the operations I init() : set n 0. I update() : increments n . I query() : prints (estimate of) n Why approximation? If we want exact value, then can store n via a counter, a sequence of d log n e bits (“log” is “log 2 ”). Can’t do better: If we use f ( n ) bits to store n , then there are 2 f ( n ) configurations. To store exact value of all integers up to n , must have 2 f ( n ) � n = ) f ( n ) � log n = ) f ( n ) � d log n e since n 2 Z

  15. If we want sublinear-space algorithm, need an estimate ˜ n of n . Want to know that for some " , � 2 (0 , 1), we have P ( | ˜ n � n | > " n ) < � .

  16. If we want sublinear-space algorithm, need an estimate ˜ n of n . Want to know that for some " , � 2 (0 , 1), we have P ( | ˜ n � n | > " n ) < � . Equivalently: P ( | ˜ n � n |  " n ) � 1 � � .

  17. Morris’ algorithm : Uses an integer counter X , with data structure operations I init() : sets X 0 I update() : increments X with probability 2 − X n = 2 X � 1 I query() : outputs ˜ Intuitively, X attempts to store a value approximately log n . How good is this?

  18. Morris’ algorithm : Uses an integer counter X , with data structure operations I init() : sets X 0 I update() : increments X with probability 2 − X n = 2 X � 1 I query() : outputs ˜ Intuitively, X attempts to store a value approximately log n . How good is this? Not so great; we’ll see that 1 P ( | ˜ n � n | > " n ) < 2 " 2 Since " < 1, RHS exceeds 1 2 , which means that estimator may always be zero!

  19. Improvement Morris+: Create s independent copies of Morris, and average their outputs. Calling these estimators ˜ n 1 , . . . , ˜ n s , then output is n n = 1 X ˜ n i . ˜ s i =1 Then 1 P ( | ˜ n � n | > " n ) < 2 s " 2 So 1 P ( | ˜ n � n | > " n ) < � for s > 2 " 2 � = Θ (1 / � ) Better!

  20. Improvement Morris++: Reduces dependence of failure probability from Θ (1 / � ) to Θ (log 1 / � ).

  21. Improvement Morris++: Reduces dependence of failure probability from Θ (1 / � ) to Θ (log 1 / � ). Run t instances of Morris+, each with failure probability 1 3 . So s = Θ (1 / " 2 ) for each instance. Now output median estimate of these t Morris+ instances. Calling this output ˜ n , it turns out that P ( | ˜ n � n | > " n ) < � for t = Θ (log 1 / � ) .

  22. Probability Review Let X be a random variable taking values in S ✓ R . The expected value of X is X E X = j · P ( X = j ) . j ∈ S The variance of X is ( X � E X ) 2 � � Var[ X ] = E . Linearity of expected value: Let X and Y be random variables. Than E ( aX + bY ) = a E X + b E Y 8 a , b 2 R . Markov’s inequality: If X is a nonnegative random variable, then P ( X > � ) < E X 8 � > 0 . �

  23. Chebyshev’s inequality: Let X be a nonnegative random variable. Then P ( | X � E X | > � ) < E ( X � E X ) 2 = Var[ X ] 8 � > 0 . � 2 � 2 More generally, if p � 1, then P ( | X � E X | > � ) < E ( X � E X ) p . 8 � > 0 . � p Cherno ff ’s inequality: Suppose X 1 , X 2 , . . . , X n are independent random variables with X i 2 [0 , 1]. Let X = P n i =1 X i . Then P ( | X � E X | > " E X )  2 · e − ε 2 µ/ 3 8 " 2 (0 , 1) .

  24. Analysis of Morris’ algorithm Let X n be X after n updates. Claim: E 2 X n = n + 1 for n 2 N 0 . Proof of claim: By induction, the base case n = 0 being E 2 X n = E 2 X 0 = E 1 = n + 1 .

  25. Induction step: Suppose that E 2 X n = n + 1 for some n 2 N 0 . Then ∞ E 2 X n +1 = P ( X n = j ) · E (2 X n +1 | X n = j ) X j =0 ∞ ✓✓ 1 � 1 ◆ 2 j + 1 ◆ X 2 j · 2 j +1 = P ( X n = j ) · 2 j j =0 ∞ ∞ P ( X n = j ) 2 j + X X = P ( X n = j ) j =0 j =0 = E 2 X n + 1 = ( n + 1) + 1 , as required.

Recommend


More recommend