Algorithms for Big Data (IV) Chihao Zhang Shanghai Jiao Tong University Oct. 11, 2019 Algorithms for Big Data (IV) 1/19
Review of the Last Lecture Last time, we introduced AMS algorithm for counting distinct elements in the streaming Algorithms for Big Data (IV) 2/19 model. We are given a sequence of numbers ⟨ a 1 , . . . , a m ⟩ where each a i ∈ [ n ] . � }� �{ � . It defines a frequency vector f = ( f 1 , . . . , f n ) where f i = k ∈ [ m ] : a k = i � }� �{ � . We want to compute the number d = i ∈ [ n ] : f i > 0
Algorithm AMS Algorithm for Counting Distinct Elements Init: On Input y : end if Output: Algorithms for Big Data (IV) 3/19 A random Hash function h : [ n ] → [ n ] from a 2 -universal family Z ← 0 if zero ( h ( y )) > Z then Z ← zero ( h ( y )) d = 2 Z + 1 � 2 .
We also introduced the BJKST algorithm, a refinement of the AMS algorithm. Pr Algorithms for Big Data (IV) 4/19 Using O ( log 1 δ log n ) bits of memory, we can obtain [ d ] 3 ≤ � d ≤ 3 d ≥ 1 − δ . We will show today that the BJKST algorithm can produce � d which is a 1 ± ε approximation of d for any ε > 0 .
The BJKST Algorithm The following refinement is due to Bar-Yossef, Jayram, Kumar, Sivakumar and Trevisan. Algorithms for Big Data (IV) end if end while 5/19 Algorithm BJKST Algorithm for Counting Distinct Elements On Input y : Init: Random Hash functions h : [ n ] → [ n ] , g : [ n ] → [ b ε − 4 log 2 n ] , both from 2 - universal families; Z ← 0 , B ← � if zero ( h ( y )) ≥ Z then { } B ← B ∪ ( g ( y ) , zeros ( h ( y ))) while | B | ≥ c / ε 2 do Z ← Z + 1 Remove all ( α , β ) with β < Z from B Output: � d = | B | 2 Z
than the current Z . c Therefore, the size of B is a trade-ofg between the memory consumption and the accuracy of the algorithm. Algorithms for Big Data (IV) 6/19 The algorithm maintains a bucket B , which stores those y whose zeros ( h ( y )) is larger We set a cap L = ε 2 for the size of B : ▶ if L = ∞ , B stores all entries, and the algorithm is exact; ▶ if L = 2 , the algorithm is equivalent to AMS.
Analysis zeros. Algorithms for Big Data (IV) 7/19 To analyze the algorithm, we first assume that g is simply the identity function from [ n ] to [ n ] , namely g ( y ) = y for all y ∈ [ n ] . We need to store the whole B , whose size is O ( ε − 2 ) . Similar to AMS, for every k ∈ [ n ] , X k , r is the indicator that h ( k ) has at least r trailing Define Y r = ∑ k ∈ [ n ]: f k > 0 X k , r as the number of h ( a i ) with trailing zero at least r . We already know from the last lecture that E [ Y r ] = d 2 r and Var [ Y r ] ≤ d 2 r .
8/19 We will bound the probability of A using the following argument Algorithms for Big Data (IV) If Z = t at the end of the algorithm, then Y t = | B | and � d = Y t 2 t . � � � Y t 2 t − d � ≥ ε d , or equivalently We use A to denote the bad event that � � � � � ≥ ε d � � � Y t − d 2 t . 2 t ▶ if t is small, then E [ Y t ] = d 2 t is large, so we can apply concentration inequalities; ▶ the value t is unlikely to be very large. We let s be the threshold for small/large value mentioned above.
9/19 Pr Algorithms for Big Data (IV) Pr (depending on c ). log n log n [� � ] ∑ � � � ≥ ε d � � Pr [ A ] = � Y r − d 2 r ∧ t = r 2 r r =1 [� � ] ∑ s − 1 ∑ � � � ≥ ε d � � + Pr [ t = r ] ≤ � Y r − d 2 r 2 r r =1 r = s ∑ s − 1 [ Y s − 1 ≥ c / ε 2 ] Pr [ | Y r − E [ Y r ] | ≥ ε d /2 r ] + Pr = r =1 ∑ s − 1 ε 2 d + ε 2 d ε 2 d + ε 2 d 2 r c 2 s − 1 ≤ 2 s ≤ c 2 s − 1 . r =1 2 s = Θ ( ε − 2 ) , Pr [ A ] can be bounded by any constant So if we choose s such that d
Space Complexity We need to store Algorithms for Big Data (IV) This helps to reduce the memory needed (Exercise). probability). The botuleneck is to store B . . c 10/19 c ▶ the function h : O ( log n ) ; ▶ the function g : O ( log n ) ; ( ) ( ) ▶ the bucket B : O ε 2 · log ran ( g ) = O ε 2 log n Instead of using identity function g , we can tolerate collisions (with at most constant
Freqency Estimation It is closely related to the Frequency problem which asks for the set . Algorithms for Big Data (IV) 11/19 Consider a stream of numbers ⟨ a 1 , . . . , a m ⟩ and its frequency vector f = ( f 1 , . . . , f n ) . Another fundamental problem is to estimate f a for each query a ∈ [ n ] . { } j : f j > m / k We now describe a deterministic algorithm for Frequency-Estimation .
Misra-Gries Algorithm Misra-Gries Algorithm for Frequency-Estimation Algorithms for Big Data (IV) end if end for end if else On Input y : else if Init: A table A 12/19 if y ∈ keys ( A ) then A [ y ] ← A [ y ] + 1 � � � keys ( A ) � ≤ k − 1 then A [ j ] ← 1 for all ℓ ∈ keys ( A ) do A [ ℓ ] ← A [ ℓ ] − 1 if A [ ℓ ] = 0 then Remove ℓ from A
Algorithm Misra-Gries (cont’d)) Output: On query j , else end if Algorithms for Big Data (IV) 13/19 if j ∈ keys ( A ) then � f j = A [ j ] � f j = 0
Algorithms for Big Data (IV) Analysis 14/19 The algorithm uses O ( k ( log m + log n )) bits of memory. It is not hard to see that for each j ∈ [ n ] , the output � f j satisfies k ≤ � f j − m f j ≤ f j . If f j > m / k , then j is in the table A . The reverse is not correct!
In Misra-Gries, we compute a table A The table A stores information about the stream, so we can extract frequency from it. However, Misra-Gries sufgers from the following main drawbacks: sketches); Algorithms for Big Data (IV) 15/19 ▶ given two tables A 1 and A 2 with respect to σ 1 and σ 2 respectively, we don’t know how to obtain the table for σ 1 ◦ σ 2 (algorithms with this property are called ▶ it does not extend to the turnstile model. In the turnstile model, each entry of the stream is a pair ( a j , ∆ j ) . Upon receiving ( a j , ∆ j ) , we update f a j to f a j + ∆ j .
Count Sketch Algorithm Count Sketch Init: Output: On query a : Algorithms for Big Data (IV) 16/19 An array C [ j ] for j ∈ [ k ] where k = 3 ε 2 . A random Hash function h : [ n ] → [ k ] from a 2 -universal family. A random Hash function g : [ n ] → {− 1 , 1 } from a 2 -universal family. On Input ( y , ∆) : C [ h ( y )] ← C [ h ( y )] + ∆ · g ( y ) Output � f a = g ( a ) · C [ h ( a )] .
Analysis n Algorithms for Big Data (IV) We have 17/19 Let X = � f a be the output on the query a . For every j ∈ [ n ] , let Y j be the indicator of h ( j ) = h ( a ) . ∑ X = g ( a ) · f j · g ( j ) · Y j . j =1 ∑ E [ X ] = E g ( a ) · g ( a ) · f a · Y a + g ( a ) · f j · g ( j ) · Y j = f a . j ∈ [ n ] \{ a } Let Z ≜ ∑ j ∈ [ n ] \{ a } f j · g ( a ) · g ( j ) · Y j , then X = f a + Z and Var [ X ] = Var [ Z ] .
E j Algorithms for Big Data (IV) k k j E Therefore Y j j E j 18/19 ∑ [ Z 2 ] = E f j · g ( a ) · g ( j ) Y j j ∈ [ n ] \{ a } ∑ ∑ f 2 j · Y 2 f j · f j ′ · g ( j ) · g ( j ′ ) · Y j · Y j ′ = E j + j ∈ [ n ] \{ a } j , j ′ ∈ [ n ] \{ a } : j � j ; [ ] ∑ ∑ f 2 j · Y 2 f 2 Y 2 = E = j · E j ∈ [ n ] \{ a } j ∈ [ n ] \{ a } Note that for every j � a , [ ] [ ] = Pr [ h ( j ) = h ( a )] = 1 Y 2 = E k . ∑ j ∈ [ n ] \{ a } f 2 ≤ ∥ f ∥ 2 [ Z 2 ] 2 = .
19/19 k Algorithms for Big Data (IV) By Chebyshev, Pr and Count Sketch (Exercise). Compare the performance (in terms of accuracy and space consumption) of Misra-Gries bits of memeory. − ( E [ Z ]) 2 ≤ ∥ f ∥ 2 [ Z 2 ] 2 Var [ X ] = Var [ Z ] = E . [� � ] 1 k ε 2 = 1 � � � � f a − f a � ≥ ε ∥ f ∥ 2 ≤ 3 . [� � ] We can then use Median trick to boost the algorithm so that � � � ▶ Pr � f a − f a � ≥ ε ∥ f ∥ 2 ≤ δ ; ( ) ε 2 log 1 1 ▶ it costs O δ ( log m + log n )
Recommend
More recommend