Count-Min Sketch Anil Maheshwari Majority element Count-Min Sketch Count-Min Sketch Complexity Analysis Markov’s Inequality Anil Maheshwari Proof of the claim Conclusions anil@scs.carleton.ca School of Computer Science Carleton University Canada
Outline Count-Min Sketch Anil Maheshwari Majority element Majority element Count-Min Sketch 1 Complexity Analysis Markov’s Count-Min Sketch 2 Inequality Proof of the claim Conclusions Complexity Analysis 3 Markov’s Inequality 4 Proof of the claim 5 Conclusions 6
Problem Count-Min Sketch Anil Maheshwari Majority element Finding the Majority Element Count-Min Sketch Input: A stream consisting of n elements and it is given Complexity Analysis that it has a majority element, i.e. it occurs at least Markov’s 1 + b n 2 c times Inequality Output: The majority element. Proof of the claim Conclusions An Example: n = 19 Input Stream = [3 2 4 7 2 2 3 2 2 1 4 2 2 2 1 1 2 3 2]
Straightforward Solutions Count-Min Sketch Anil Maheshwari Solution 1: Store the stream in an array A . Majority element Sort and pick the middle element. Count-Min Sketch Complexity Complexity: O ( n log n ) time and O ( n ) space Analysis Markov’s Solution 2: Count frequency of each element. Inequality Input: 3 2 4 7 2 2 3 2 2 1 4 2 2 2 1 1 2 3 2 Proof of the claim Conclusions Element 1 2 3 4 7 Frequency 3 10 3 2 1 Complexity: ?
Do we need that much space? Count-Min Sketch Anil Maheshwari Majority element Finding the Majority Element Count-Min Sketch Input: A stream consisting of n elements and it is given Complexity Analysis that it has a majority element. Markov’s Output: The majority element. Inequality Proof of the claim Conclusions Memory required in Solutions 1 & 2 � Number of distinct elements in the stream. What if we can only use O (1) space?
Majority Algorithm Count-Min Sketch Anil Maheshwari Input: Array A of size n consisting a majority element Majority element Output: The majority element Count-Min Sketch c ← 0 1 for i = 1 to n do Complexity 2 if c = 0 then Analysis 3 current ← A [ i ] ; c ← c + 1 4 Markov’s end 5 Inequality else 6 if A [ i ] = current then 7 Proof of the claim c ← c + 1 8 Conclusions end 9 else 10 c ← c − 1 11 end 12 end 13 end 14 return current 15 A [ i ] 3 2 4 7 2 2 3 2 . . . current . . . c 0 . . .
Analysis of Majority Algorithm Count-Min Sketch Anil Maheshwari Majority element Observations Count-Min Sketch Algorithm maintains only two variables: c and 1 Complexity Analysis current. Markov’s Inequality Correctness: Each non-majority element can ‘kill’ at 2 Proof of the claim most one majority element. Conclusions Claim By performing a single pass, using only O (1) additional space, we can report the majority element of A (if it exists).
Misra & Gries [82] Algorithm Count-Min Sketch Anil Maheshwari Majority element Finding Heavy Hitters Count-Min Sketch Input: A stream consisting of n elements and fixed Complexity Analysis integer k < n . Markov’s Output: Report all heavy hitters, i.e. elements that occur Inequality � n/k times. Proof of the claim Conclusions 1 Initialize k bins, each with null element and a counter with 0. 2 For each element x in the stream do if x ∈ Bin b then increment bin b ’s counter elseif find a bin whose counter is 0 and Assign x to this bin Assign 1 to its counter else decrement the counter of every bin. 3 Output elements in the bins.
Analysis of Misra and Gries Algorithm Count-Min Sketch Anil Maheshwari Majority element Claim Count-Min Sketch Let f ∗ x = Frequency of x in the stream. Each heavy hitter Complexity Analysis x is in one of the bins with counter value � f ∗ x � n/k . Markov’s Inequality Correctness: What can be the minimum value of the Proof of the claim counter of a heavy hitter? Conclusions Running Time: Initializing k bins: O ( k ) time Processing each element requires looking at O ( k ) bins. Total Run Time = O ( nk ) Space: O ( k ) Reference: J. Misra and D. Gries,“Finding repeated elements” in Science of Computer Programming, Vol. 2 (2): 143 -152, 1982.
Count-Min Sketch Count-Min Sketch Anil Maheshwari Majority element Problem Count-Min Sketch For a data stream, using very little space, we are Complexity Analysis interested to report Markov’s All the elements that occur frequently, e.g at least 2% Inequality 1 times. Proof of the claim Conclusions For each element, its (approximate) frequency. 2
Count-Min Sketch Data Structure Count-Min Sketch Anil Maheshwari Input: An array (stream) A consisting of n numbers and r hash Majority element functions h 1 , . . . , h r , where h i : N → { 1 , . . . , b } Count-Min Sketch Output: CMS [ · , · ] table consisting of r rows and b columns Complexity Analysis 1 for i = 1 to r do Markov’s for j = 1 to b do 2 Inequality CMS [ i, j ] ← 0 3 Proof of the claim end 4 Conclusions 5 end 6 for i = 1 to n do for j = 1 to r do 7 CMS [ j, h j ( A [ i ])] ← CMS [ j, h j ( A [ i ])] + 1 8 end 9 10 end 11 return CMS [ · , · ]
Illustration of Algorithm Count-Min Sketch Anil Maheshwari Let b = 10 and r = 3 . Majority element Assume that stream A = xyy . Count-Min Sketch Complexity Assume the following h -values for x and y : Analysis For x : h 1 ( x ) = 3 , h 2 ( x ) = 8 , and h 3 ( x ) = 5 Markov’s Inequality For y : h 1 ( y ) = 6 , h 2 ( y ) = 8 , and h 3 ( y ) = 1 Proof of the claim Conclusions 1 2 3 4 5 6 7 8 9 10 1 CMS [ ⇤ , ⇤ ] = 2 3 for i = 1 to n do for j = 1 to r do CMS [ j, h j ( A [ i ])] ← CMS [ j, h j ( A [ i ])] + 1 end end
Observations Count-Min Sketch Anil Maheshwari Let n = Total number of items in the stream. Majority element f ∗ x = True frequency of x in the stream. Count-Min Sketch Complexity Analysis Let f x = min { CMS [1 , h 1 ( x )] , . . . , CMS [ r, h r ( x )] } . Markov’s Inequality Report f x as the estimate on the frequency of x . Proof of the claim Conclusions Observations: The size of CMS table ( = br ) is independent of n . 1 CMS table can be computed in O ( br + nr ) time. 2 For any x 2 A , and for any j = 1 , . . . , r , 3 CMS [ j, h j ( x )] � f ∗ x f x is an overestimate as f x � f ∗ 4 x
Assume - Proof comes later Count-Min Sketch Anil Maheshwari Majority element Claim Count-Min Sketch Let b = 2 x � ✏ n ] 1 ✏ . Then Pr [ f x � f ∗ Complexity 2 r Analysis Markov’s Inequality Proof of the claim Conclusions Corollary With probability at least 1 � 1 / 2 r , f ∗ x f x f ∗ x + ✏ n
Reporting Frequent Elements Count-Min Sketch Anil Maheshwari Suppose we want to report all the elements of A that Majority element occur approximately � n/k times for some integer k . Count-Min Sketch Complexity In the Claim, set ✏ = 1 / 3 k . Then b = 2 ✏ = 6 k . Analysis Markov’s Construct CMS table of size br = 6 kr Inequality Scan A and compute the entries in the CMS table Proof of the claim Conclusions Maintain a set of O ( k ) items that occur most frequently among all the elements in A scanned so far.
Heap Data Structure Count-Min Sketch Anil Maheshwari The items are stored in a HEAP with f x values as the key. Majority element Count-Min Sketch What is a Heap? Complexity Analysis An array that stores n elements and supports: Markov’s Inequality Find Max or Min: Report the element with the Proof of the claim smallest/largest key value in Heap in O (1) time. Conclusions Insert ( x, k ) : Insert element x with key k in Heap in O (log n ) time. Delete ( x ) : Delete element x from Heap in O (log n ) time. . . .
Reporting Frequent Elements contd. Count-Min Sketch Anil Maheshwari Assume we have scanned i � 1 items and have updated Majority element the CMS table and the heap. Count-Min Sketch Complexity Consider the i -th item (say x = A [ i ] ) and we perform the Analysis following: Markov’s Inequality For j = 1 to r : update the CMS table by executing 1 Proof of the claim CMS [ j, h j ( x )] CMS [ j, h j ( x )] + 1 . Conclusions Let f x = min { CMS [1 , h 1 ( x )] , . . . , CMS [ r, h r ( x )] } . 2 If f x � i/k , do: If x 2 heap, delete x and re-insert it again with the 1 updated f x value. If x 62 heap, then insert it in the heap and remove all 2 the elements whose count is less than i/k .
Reporting Frequent Elements contd. Count-Min Sketch Anil Maheshwari Majority element Claim [Cormode and Muthukrishnan 2005] Count-Min Sketch Elements that occur approx. n/k times in a data stream Complexity Analysis of size n can be reported in O ( kr + nr + n log k ) time Markov’s using O ( kr ) space with high probability. Inequality Proof of the claim Proof. Conclusions Recall Corollary: f ∗ x f x f ∗ x + ✏ n = f ∗ x + n/ 3 k . This implies: Heap contains elements whose frequency is at least n/k � n/ 3 k = 0 . 667 n/k (with high probability). Size of heap = O ( k ) Time Complexity: O ( br + nr + n log k ) = O ( kr + nr + n log k ) as b = 2 ✏ = 6 k . Total Space= O ( br + k ) = O ( kr )
Markov’s Inequality Count-Min Sketch Anil Maheshwari Majority element Theorem Count-Min Sketch Let X be a non-negative discrete random variable and Complexity Analysis s > 0 be a constant. Then P ( X � s ) E [ X ] /s . Markov’s Inequality Proof of the claim Conclusions
Recommend
More recommend