algorithms for big data iv
play

Algorithms for Big Data (IV) Chihao Zhang Shanghai Jiao Tong - PowerPoint PPT Presentation

Algorithms for Big Data (IV) Chihao Zhang Shanghai Jiao Tong University Oct. 11, 2019 Algorithms for Big Data (IV) 1/19 Review of the Last Lecture Last time, we introduced AMS algorithm for counting distinct elements in the streaming


  1. Algorithms for Big Data (IV) Chihao Zhang Shanghai Jiao Tong University Oct. 11, 2019 Algorithms for Big Data (IV) 1/19

  2. Review of the Last Lecture Last time, we introduced AMS algorithm for counting distinct elements in the streaming Algorithms for Big Data (IV) 2/19 model. We are given a sequence of numbers ⟨ a 1 , . . . , a m ⟩ where each a i ∈ [ n ] . � }� �{ � . It defines a frequency vector f = ( f 1 , . . . , f n ) where f i = k ∈ [ m ] : a k = i � }� �{ � . We want to compute the number d = i ∈ [ n ] : f i > 0

  3. Algorithm AMS Algorithm for Counting Distinct Elements Init: On Input y : end if Output: Algorithms for Big Data (IV) 3/19 A random Hash function h : [ n ] → [ n ] from a 2 -universal family Z ← 0 if zero ( h ( y )) > Z then Z ← zero ( h ( y )) d = 2 Z + 1 � 2 .

  4. We also introduced the BJKST algorithm, a refinement of the AMS algorithm. Pr Algorithms for Big Data (IV) 4/19 Using O ( log 1 δ log n ) bits of memory, we can obtain [ d ] 3 ≤ � d ≤ 3 d ≥ 1 − δ . We will show today that the BJKST algorithm can produce � d which is a 1 ± ε approximation of d for any ε > 0 .

  5. The BJKST Algorithm The following refinement is due to Bar-Yossef, Jayram, Kumar, Sivakumar and Trevisan. Algorithms for Big Data (IV) end if end while 5/19 Algorithm BJKST Algorithm for Counting Distinct Elements On Input y : Init: Random Hash functions h : [ n ] → [ n ] , g : [ n ] → [ b ε − 4 log 2 n ] , both from 2 - universal families; Z ← 0 , B ← � if zero ( h ( y )) ≥ Z then { } B ← B ∪ ( g ( y ) , zeros ( h ( y ))) while | B | ≥ c / ε 2 do Z ← Z + 1 Remove all ( α , β ) with β < Z from B Output: � d = | B | 2 Z

  6. than the current Z . c Therefore, the size of B is a trade-ofg between the memory consumption and the accuracy of the algorithm. Algorithms for Big Data (IV) 6/19 The algorithm maintains a bucket B , which stores those y whose zeros ( h ( y )) is larger We set a cap L = ε 2 for the size of B : ▶ if L = ∞ , B stores all entries, and the algorithm is exact; ▶ if L = 2 , the algorithm is equivalent to AMS.

  7. Analysis zeros. Algorithms for Big Data (IV) 7/19 To analyze the algorithm, we first assume that g is simply the identity function from [ n ] to [ n ] , namely g ( y ) = y for all y ∈ [ n ] . We need to store the whole B , whose size is O ( ε − 2 ) . Similar to AMS, for every k ∈ [ n ] , X k , r is the indicator that h ( k ) has at least r trailing Define Y r = ∑ k ∈ [ n ]: f k > 0 X k , r as the number of h ( a i ) with trailing zero at least r . We already know from the last lecture that E [ Y r ] = d 2 r and Var [ Y r ] ≤ d 2 r .

  8. 8/19 We will bound the probability of A using the following argument Algorithms for Big Data (IV) If Z = t at the end of the algorithm, then Y t = | B | and � d = Y t 2 t . � � � Y t 2 t − d � ≥ ε d , or equivalently We use A to denote the bad event that � � � � � ≥ ε d � � � Y t − d 2 t . 2 t ▶ if t is small, then E [ Y t ] = d 2 t is large, so we can apply concentration inequalities; ▶ the value t is unlikely to be very large. We let s be the threshold for small/large value mentioned above.

  9. 9/19 Pr Algorithms for Big Data (IV) Pr (depending on c ). log n log n [� � ] ∑ � � � ≥ ε d � � Pr [ A ] = � Y r − d 2 r ∧ t = r 2 r r =1 [� � ] ∑ s − 1 ∑ � � � ≥ ε d � � + Pr [ t = r ] ≤ � Y r − d 2 r 2 r r =1 r = s ∑ s − 1 [ Y s − 1 ≥ c / ε 2 ] Pr [ | Y r − E [ Y r ] | ≥ ε d /2 r ] + Pr = r =1 ∑ s − 1 ε 2 d + ε 2 d ε 2 d + ε 2 d 2 r c 2 s − 1 ≤ 2 s ≤ c 2 s − 1 . r =1 2 s = Θ ( ε − 2 ) , Pr [ A ] can be bounded by any constant So if we choose s such that d

  10. Space Complexity We need to store Algorithms for Big Data (IV) This helps to reduce the memory needed (Exercise). probability). The botuleneck is to store B . . c 10/19 c ▶ the function h : O ( log n ) ; ▶ the function g : O ( log n ) ; ( ) ( ) ▶ the bucket B : O ε 2 · log ran ( g ) = O ε 2 log n Instead of using identity function g , we can tolerate collisions (with at most constant

  11. Freqency Estimation It is closely related to the Frequency problem which asks for the set . Algorithms for Big Data (IV) 11/19 Consider a stream of numbers ⟨ a 1 , . . . , a m ⟩ and its frequency vector f = ( f 1 , . . . , f n ) . Another fundamental problem is to estimate f a for each query a ∈ [ n ] . { } j : f j > m / k We now describe a deterministic algorithm for Frequency-Estimation .

  12. Misra-Gries Algorithm Misra-Gries Algorithm for Frequency-Estimation Algorithms for Big Data (IV) end if end for end if else On Input y : else if Init: A table A 12/19 if y ∈ keys ( A ) then A [ y ] ← A [ y ] + 1 � � � keys ( A ) � ≤ k − 1 then A [ j ] ← 1 for all ℓ ∈ keys ( A ) do A [ ℓ ] ← A [ ℓ ] − 1 if A [ ℓ ] = 0 then Remove ℓ from A

  13. Algorithm Misra-Gries (cont’d)) Output: On query j , else end if Algorithms for Big Data (IV) 13/19 if j ∈ keys ( A ) then � f j = A [ j ] � f j = 0

  14. Algorithms for Big Data (IV) Analysis 14/19 The algorithm uses O ( k ( log m + log n )) bits of memory. It is not hard to see that for each j ∈ [ n ] , the output � f j satisfies k ≤ � f j − m f j ≤ f j . If f j > m / k , then j is in the table A . The reverse is not correct!

  15. In Misra-Gries, we compute a table A The table A stores information about the stream, so we can extract frequency from it. However, Misra-Gries sufgers from the following main drawbacks: sketches); Algorithms for Big Data (IV) 15/19 ▶ given two tables A 1 and A 2 with respect to σ 1 and σ 2 respectively, we don’t know how to obtain the table for σ 1 ◦ σ 2 (algorithms with this property are called ▶ it does not extend to the turnstile model. In the turnstile model, each entry of the stream is a pair ( a j , ∆ j ) . Upon receiving ( a j , ∆ j ) , we update f a j to f a j + ∆ j .

  16. Count Sketch Algorithm Count Sketch Init: Output: On query a : Algorithms for Big Data (IV) 16/19 An array C [ j ] for j ∈ [ k ] where k = 3 ε 2 . A random Hash function h : [ n ] → [ k ] from a 2 -universal family. A random Hash function g : [ n ] → {− 1 , 1 } from a 2 -universal family. On Input ( y , ∆) : C [ h ( y )] ← C [ h ( y )] + ∆ · g ( y ) Output � f a = g ( a ) · C [ h ( a )] .

  17. Analysis n Algorithms for Big Data (IV) We have 17/19 Let X = � f a be the output on the query a . For every j ∈ [ n ] , let Y j be the indicator of h ( j ) = h ( a ) . ∑ X = g ( a ) · f j · g ( j ) · Y j . j =1   ∑     E [ X ] = E g ( a ) · g ( a ) · f a · Y a + g ( a ) · f j · g ( j ) · Y j = f a .     j ∈ [ n ] \{ a }   Let Z ≜ ∑ j ∈ [ n ] \{ a } f j · g ( a ) · g ( j ) · Y j , then X = f a + Z and Var [ X ] = Var [ Z ] .

  18. E j Algorithms for Big Data (IV) k k j E Therefore Y j j E j 18/19   ∑   [ Z 2 ]   = E f j · g ( a ) · g ( j ) Y j     j ∈ [ n ] \{ a }     ∑ ∑     f 2 j · Y 2 f j · f j ′ · g ( j ) · g ( j ′ ) · Y j · Y j ′ = E j +     j ∈ [ n ] \{ a } j , j ′ ∈ [ n ] \{ a } : j � j ;     [ ] ∑ ∑     f 2 j · Y 2 f 2 Y 2 = E = j · E     j ∈ [ n ] \{ a } j ∈ [ n ] \{ a }   Note that for every j � a , [ ] [ ] = Pr [ h ( j ) = h ( a )] = 1 Y 2 = E k . ∑ j ∈ [ n ] \{ a } f 2 ≤ ∥ f ∥ 2 [ Z 2 ] 2 = .

  19. 19/19 k Algorithms for Big Data (IV) By Chebyshev, Pr and Count Sketch (Exercise). Compare the performance (in terms of accuracy and space consumption) of Misra-Gries bits of memeory. − ( E [ Z ]) 2 ≤ ∥ f ∥ 2 [ Z 2 ] 2 Var [ X ] = Var [ Z ] = E . [� � ] 1 k ε 2 = 1 � � � � f a − f a � ≥ ε ∥ f ∥ 2 ≤ 3 . [� � ] We can then use Median trick to boost the algorithm so that � � � ▶ Pr � f a − f a � ≥ ε ∥ f ∥ 2 ≤ δ ; ( ) ε 2 log 1 1 ▶ it costs O δ ( log m + log n )

Recommend


More recommend