Lecture 7 Barna Saha AT&T-Labs Research September 26, 2013
Outline Sampling Estimating F k [AMS’96] Reservoir Sampling Priority Sampling
Estimating F k ◮ Suppose, you know m , the stream length ◮ Sample a index p uniformly and randomly with probability 1 m . Suppose a p = l ◮ Compute r = |{ q : q ≥ p , a q = l }| –the number of occurrences of l in the stream starting from a p ◮ Return X = m ( r k − ( r − 1) k ) ≤ n 1 − 1 ◮ Show E � � � � k ( F k ) 2 . X = F k , Var X
Estimating F k ◮ Maintain s 1 = O ( kn 1 − 1 k ) such estimates X 1 , X 2 , ..., X s 1 . Take ǫ 2 the average, Y = 1 � s 1 i =1 X i . s 1 ◮ Maintain s 2 = O (log 1 δ ) of these average estimates, Y 1 , Y 2 , ..., Y s 2 and take the median. ◮ Follows (1 ± ǫ ) approximation with probability ≥ (1 − δ ).
Estimating F k Lemma � � E = F k X f i n � 1 � � � � � E Y = E X | i is sampled on j th occurrence m i =1 j =1 n f i m (( f i − j + 1) k − ( f i − j ) k ) 1 � � = m i =1 j =1 n � 1 k + (2 k − 1 k ) + (3 k − 2 k ) + ... + ( f k � � i − ( f i − 1) k ) = i =1 = F k
Estimating F k Lemma ≤ kn 1 − 1 k ( F k ) 2 � � Var X f i n � 1 X 2 | i is sampled on j th occurrence � � Y 2 � � � E = E m i =1 j =1 n f i m 2 (( f i − j + 1) k − ( f i − j ) k ) 2 1 � � = m i =1 j =1 n � 1 2 k + (2 k − 1 k ) 2 + (3 k − 2 k ) 2 + ... + ( f k i − ( f i − 1) k ) 2 � � = m i =1 n k 1 2 k − 1 + k 2 k − 1 (2 k − 1 k ) + ..... + f k − 1 � ( f k i − ( f i − 1) k ) ≤ m i i =1 Using a k − b k = ( a − b )( a k − 1 + ba k − 2 + .. + b k − 1 ) ≤ ( a − b ) ka k − 1
Estimating F k n k 1 2 k − 1 + k 2 k − 1 (2 k − 1 k ) + ..... + f k − 1 � ( f k i − ( f i − 1) k ) m i i =1 n 1 2 k − 1 + 2 2 k − 1 + ... + f 2 k − 1 � < mk = mkF 2 k − 1 i i =1 � n � 2 kF 1 F 2 k − 1 ≤ kn 1 − 1 = kn 1 − 1 � f k k ( F k ) 2 = k i i =1 Reference: The space complexity of approximating the frequency moment by Alon, Matias, Szegedy.
Uniform Random Sample from Stream Without Replacement ◮ What happens when you do not know m ? Check out: Algorithms Every Data Scientist Should Know: Reservoir Sampling http://blog.cloudera.com/blog/2013/04/hadoop-stratified- randosampling-algorithm/
Reservoir Sampling ◮ Find a uniform sample s from stream if you do not know m ? ◮ Initially s = a 1 ◮ On seeing the t -th element set s = a t with probability 1 t � � � � = 1 1 1 1 − 1 = 1 � � � � Pr s = a i 1 − 1 − ... i i +1 i +2 t t ◮ Can you extend AMS algorithm to a single pass now ?
Reservoir Sampling of size k ◮ Find a uniform sample s of size k from stream if you do not know m ? ◮ Initially s = { a 1 , a 2 , ..., a k } ◮ On seeing the t -th element set, pick a number r ∈ [1 , t ] uniformly and randomly ◮ If r ≤ k , replace the r th element by a t � � � � � � = k 1 1 � 1 − 1 � = k Pr a i ∈ s 1 − 1 − ... i i +1 i +2 t t
Priority Sampling ◮ Element i has weight w i . ◮ Keep a sample of size k such that any subset sum query can be answered later. ◮ Uniform Sampling: Misses few heavy hitters ◮ Weighted Sampling with Replacements: duplicates of heavy hitters ◮ Weighted Sampling Without Replacement: Very complicated expression-does not work for subset sum
Priority Sampling ◮ For each item i = 0 , 1 , .., n − 1 generate a random number α i ∈ [0 , 1] uniformly and randomly. ◮ Assign priority q i = w i α i to the ith element. ◮ Select the k highest priority items in the sample S .
Priority Sampling ◮ Let τ be the priority of the ( k + 1)th highest priority. ◮ Set ˆ w i = max ( w i , τ ) if i is in the sample and 0 otherwise. ◮ E � � w i ˆ = w i
Priority Sampling ◮ A ( τ ′ ):Event τ ′ is the k th highest priority among all j � = i . ◮ For any value of τ ′ , � � � � E w i | A ( τ ′ ) ˆ = Pr i ∈ S | A ( τ ′ ) max ( w i , τ ′ ) � w i ◮ Pr α i < w i = min (1 , w i � � α i > τ ′ � � � i ∈ S | A ( τ ′ ) = Pr = Pr τ ′ ) τ ′ = max ( w i , τ ′ ) min (1 , w i ◮ E � � w i | A ( τ ′ ) ˆ τ ′ ) = w i ◮ Holds for all τ ′ , hence holds unconditionally.
Priority Sampling ◮ Near optimality: variance of the weight estimator is minimal among all k + 1-sparse unbiased estimators.
Recommend
More recommend