Lecture 2 Barna Saha AT&T-Labs Research September 12, 2013
Outline Concentration Inequalities Revisited Universal Family of Hash Functions Counting Distinct Items Analysis of Algorithm from Lecture 0 AMS Algorithm for Counting Distinct Element
First and Second Moment Bounds ◮ Markov Inequality For any positive random variable X and t > 0 � � ≤ E X � � Pr X > t t ◮ Chebyshev Inequality For any random variable X and t > 0 � � ≤ Var X � � � � Pr | X − E X | > t t 2
The Chernoff Bound ◮ Let X 1 , X 2 ... X n be n independent Bernoulli random variables with Pr( X i = 1) = p i . Let X = � X i . Hence, �� � � � � E [ X ] = E X i = E [ X i ] = Pr( X i = 1) = p i = µ (say) . Then the Chernoff Bound says for any ǫ > 0 � µ e ǫ � Pr( X > (1 + ǫ ) µ ) ≤ and (1 + ǫ ) ǫ � µ e − ǫ � Pr( X < (1 − ǫ ) µ ) ≤ (1 − ǫ ) 1 − ǫ When 0 < ǫ < 1 the above expression can be further simplified to − µǫ 2 3 and Pr( X > (1 + ǫ ) µ ) ≤ e − µǫ 2 Pr( X < (1 − ǫ ) µ ) ≤ e 2 Hence − µǫ 2 Pr( | X − µ | > ǫµ ) ≤ 2 e 3
Universal Hash Family A family of hash functions H = { h | h : [ N ] − − > [ M ] } is called a pairwise independent family of hash functions if for all i � = j ∈ [ N ] and any k , l ∈ [ M ] 1 � � Pr h ←H h ( i ) = k ∧ h ( j ) = l = M 2 strongly universal hash family (1) Hash functions are uniform over [ M ], = 1 � � Pr h ←H h ( i ) = k (2) M = 1 � � Pr h ←H h ( i ) = h ( j ) M weakly universal hash family (3) ◮ Construction Let p be a prime. For any a , b ∈ Z p = { 0 , 1 , 2 , .., p − 1 } , define h a , b : Z p → Z p by h a , b ( x ) = ax + b mod p . Then the collection of functions H = { h a , b | a , b ∈ Z p } is a pairwise independent hash family.
Counting Distinct Items Algorithm 1 [ a , ǫ, δ ] ǫ ′ = ǫ/ 2 for t = 1 , ⌈ (1 + ǫ ′ ) ⌉ , ⌈ (1 + ǫ ′ ) 2 ⌉ , ... ⌈ (1 + ǫ ′ ) log 1+ ǫ ′ n ⌉ do δ ′ = ǫ ′ δ log n { Run in parallel } b t = ESTIMATE( a , t , ǫ ′ , δ ′ ) { b t is a boolean variable YES/ NO } end for return the smallest value of t such that b t − 1 =YES and b t =NO, if no such t exists, return n
Counting Distinct Items Algorithm 2 [ESTIMATE( a , t , ǫ ′ , δ ′ )] count ← 0 ǫ ′ 2 log 1 c for i = 1 to δ ′ do Select a hash function h i uniformly and randomly from a fully- independent hash family H { run in parallel } b i t ← NO repeat Consider the current element in the stream a , say a l = ( j , ν ) if h i ( j ) == 1 then b i t ← YES, BREAK end if until a is exhausted if b i t == NO then count = count + 1 end if end for
Counting Distinct Items Algorithm 3 [ESTIMATE( a , t , ǫ ′ , δ ′ )]continued if count ≥ 1 ǫ ′ 2 log 1 c δ ′ then e return NO else return YES end if ◮ Space Complexity: O ( 1 ǫ 3 log n (log 1 δ + log log n + log 1 ǫ )) ◮ Time Complexity: O ( 1 ǫ 3 log n (log 1 δ + log log n + log 1 ǫ ))
Counting Distinct Items ◮ Lemma Consider the ith round of ESTIMATE( a , t , ǫ ′ , δ ′ ) for any i ∈ [ C ǫ 2 log 1 δ ′ ] ◮ If DE > (1 + ǫ ′ ) t then Pr � b i � ≤ 1 e − ǫ t == NO 2 e . ◮ If DE < (1 − ǫ ′ ) t then Pr � b i � ≥ 1 e + ǫ t == NO 2 e . ◮ Lemma ◮ If DE > (1 + ǫ ′ ) t then Pr ≤ δ ′ � � b t == NO 2 . ≤ δ ′ ◮ If DE < (1 − ǫ ′ ) t then Pr � � b t == YES 2 . ◮ Lemma ◮ If | DE − t | > ǫ ′ t then Pr � � ERROR ≤ δ ′ .
Counting Distinct Items ◮ Lemma For all t such that | DE − t | > ǫ ′ t Pr � � ERROR ≤ δ . ◮ Theorem Algorithm 1 returns an estimate of DE within (1 ± ǫ ) with probability ≥ (1 − δ ) .
AMS Sketch for Counting Distinct Element ◮ Uses pair-wise independent hash function ◮ Improved space and time complexity ◮ Worse approximation Algorithm 4 AMS Counting Distinct Items Initialize z ← 0 End Initialize Process( a l = ( j , ν )) if zeros ( h ( j )) > z then z ← zeros ( h ( j )) end if End Process Estimate return 2 z + 1 2 End Estimate
AMS Sketch for Counting Distinct Element ◮ Define X r j = 1 if zeros ( h ( j )) ≥ r and 0 otherwise. Define Y r = � j X r j . ◮ Lemma ◮ E � X r � = 1 j 2 r ◮ E � Y r � = DE 2 r ◮ Var � Y r � ≤ 1 2 r ◮ Lemma ◮ Consider the largest level a such that 2 a + 1 2 < DE 3 . √ 2 � � Pr z ≤ a < 3 . ◮ Consider the smallest level b such that 2 b + 1 2 > 3 DE. √ 2 � � Pr z ≥ b < 3 . � DE √ 3 < 2 z + 1 ◮ Pr ≥ 1 − 2 2 2 < 3 DE � 3 .
AMS Sketch for Counting Distinct Element ◮ Boosting the confidence Median Trick. Keep C log 1 δ copies and return the median estimate ◮ Theorem There exists a randomized algorithm that returns an estimate of � DE 3 < 2 z + 1 2 < 3 DE � DE satisfying Pr ≥ 1 − δ using space O (log 1 δ log n )
Recommend
More recommend