Count-Min Sketch Anil Maheshwari Review Count-Min Sketch Complexity Count-Min Sketch Analysis Probability Preliminaries Proof of the claim Anil Maheshwari Conclusions School of Computer Science Carleton University Canada
Outline Count-Min Sketch Anil Maheshwari Review Count-Min Sketch Review 1 Complexity Analysis Probability Preliminaries Count-Min Sketch 2 Proof of the claim Conclusions Complexity Analysis 3 Probability Preliminaries 4 Proof of the claim 5 Conclusions 6
Majority Element Problem Count-Min Sketch Anil Maheshwari Review Count-Min Sketch Complexity Analysis Finding the Majority Element Probability Preliminaries Input: A stream consisting of n elements and it is given Proof of the claim that it has a majority element. Conclusions Output: The majority element. Store the stream in an array A . Sort and pick the middle element (if elements can be ordered). Count frequency of each element. Issue: May need O ( n ) memory.
Majority Algorithm Count-Min Sketch Anil Maheshwari Input: Array A of size n consisting a majority element Review Output: The majority element Count-Min Sketch 1 c ← 0 Complexity Analysis 2 for i = 1 to n do Probability if c = 0 then 3 Preliminaries current ← A [ i ] ; c ← c + 1 4 Proof of the claim end Conclusions 5 else 6 if A [ i ] = current then 7 c ← c + 1 8 end 9 else 10 c ← c − 1 11 end 12 end 13 14 end 15 return current
Analysis of Majority Algorithm Count-Min Sketch Anil Maheshwari Review Count-Min Sketch Complexity Observations Analysis Probability Algorithm maintains only two variables: c and 1 Preliminaries current. Proof of the claim Conclusions Correctness: Each non-majority element can ‘kill’ at 2 most one majority element. Claim By performing a single pass, using only O (1) additional space, we can report the majority element of A (if it exists).
Misra & Gries [82] Algorithm Count-Min Sketch Anil Maheshwari Review Finding Heavy Hitters Count-Min Sketch Complexity Input: A stream consisting of n elements and fixed Analysis integer k < n . Probability Preliminaries Output: Report all elements that occur ≥ n/k times. Proof of the claim Conclusions Initialize k bins, each with null element and a counter 1 with 0. For each element x in the stream do 2 if x ∈ Bin b then increment bin b ’s counter elseif find a bin whose counter is 0 and Assign x to this bin Assign 1 to its counter else decrement the counter of every bin. Output elements in the bins. 3
Analysis of Misra and Gries Algorithm Count-Min Sketch Anil Maheshwari Review Count-Min Sketch Complexity Analysis Claim Probability Let f ∗ x = Frequency of x in the stream. Preliminaries Proof of the claim Each heavy hitter x is in one of the bins with counter Conclusions value ≥ f ∗ x − n/k . Running Time Initializing k bins: O ( k ) time Processing each element requires looking at O ( k ) bins. Total Run Time = O ( nk )
Count-Min Sketch Anil Maheshwari Review Count-Min Sketch Complexity Analysis Generalize More Probability For a data stream, using very little space, we are Preliminaries Proof of the claim interested to report Conclusions All the elements that occur frequently, e.g at least 2% 1 times. For each element, its (approximate) frequency. 2
Count-Min Sketch Data Structure Count-Min Sketch Anil Maheshwari Review Input: An array (stream) A consisting of n numbers and r Count-Min Sketch hash functions h 1 , . . . , h r , where h i : N → { 1 , . . . , b } Complexity Analysis Output: CMS [ · , · ] table consisting of r rows and b columns Probability 1 for i = 1 to r do Preliminaries for j = 1 to b do Proof of the claim 2 CMS [ i, j ] ← 0 Conclusions 3 end 4 5 end 6 for i = 1 to n do for j = 1 to r do 7 CMS [ j, h j ( A [ i ])] ← CMS [ j, h j ( A [ i ])] + 1 8 end 9 10 end 11 return CMS [ · , · ]
Updating CMS table Count-Min Sketch Anil Maheshwari Review Count-Min Sketch Complexity Analysis An example with b = 10 and r = 3 and assume that Probability stream A = xyy Preliminaries Proof of the claim Conclusions After Initialization: 1 2 3 4 5 6 7 8 9 10 1 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0
Execution of Algorithm Count-Min Sketch Anil Maheshwari Review Count-Min Sketch Complexity An example with b = 10 and r = 3 and assume that Analysis stream A = xyy Probability Preliminaries Assume the following h -values for x and y : Proof of the claim Conclusions For x : h 1 ( x ) = 3 , h 2 ( x ) = 8 , and h 3 ( x ) = 5 For y : h 1 ( y ) = 6 , h 2 ( y ) = 8 , and h 3 ( y ) = 1 1 2 3 4 5 6 7 8 9 10 1 2 3
Updating CMS table Count-Min Sketch Anil Maheshwari Review Count-Min Sketch Insertion of x : h 1 ( x ) = 3 , h 2 ( x ) = 8 , and h 3 ( x ) = 5 : Complexity Analysis 1 2 3 4 5 6 7 8 9 10 Probability Preliminaries 1 0 0 0 0 0 0 0 0 0 0 Proof of the claim 2 0 0 0 0 0 0 0 0 0 0 Conclusions 3 0 0 0 0 0 0 0 0 0 0 After inserting x : 1 2 3 4 5 6 7 8 9 10 1 0 0 1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 1 0 0 3 0 0 0 0 1 0 0 0 0 0
Updating CMS table Count-Min Sketch Anil Maheshwari Review Insertion of 1st y : h 1 ( y ) = 6 , h 2 ( y ) = 8 , and h 3 ( y ) = 1 that Count-Min Sketch hashes to locations 6,8, and 1: Complexity Analysis Probability 1 2 3 4 5 6 7 8 9 10 Preliminaries 1 0 0 1 0 0 0 0 0 0 0 Proof of the claim 2 0 0 0 0 0 0 0 1 0 0 Conclusions 3 0 0 0 0 1 0 0 0 0 0 After inserting 1st y : 1 2 3 4 5 6 7 8 9 10 1 0 0 1 0 0 1 0 0 0 0 2 0 0 0 0 0 0 0 2 0 0 3 1 0 0 0 1 0 0 0 0 0
Updating CMS table Count-Min Sketch Anil Maheshwari Review Count-Min Sketch Insertion of 2nd y (hashes to same locations 6,8, and 1): Complexity Analysis 1 2 3 4 5 6 7 8 9 10 Probability Preliminaries 1 0 0 1 0 0 1 0 0 0 0 Proof of the claim 2 0 0 0 0 0 0 0 2 0 0 Conclusions 3 1 0 0 0 1 0 0 0 0 0 After inserting 2nd y : 1 2 3 4 5 6 7 8 9 10 1 0 0 1 0 0 2 0 0 0 0 2 0 0 0 0 0 0 0 3 0 0 3 2 0 0 0 1 0 0 0 0 0
Observations on CMS Table Entries Count-Min Sketch Anil Maheshwari Review Count-Min Sketch Complexity Let n = total # items in the stream. Analysis f ∗ x = true frequency of x in the stream. Probability Preliminaries Let f x = min { CMS [1 , h 1 ( x )] , . . . , CMS [ r, h r ( x )] } . This is Proof of the claim the estimate on the frequency of x that we report. Conclusions The size of CMS table ( = br ) is independent of n . 1 CMS table can be computed in O ( br + nr ) time. 2 For any x ∈ A , and for any j = 1 , . . . , r , 3 CMS [ j, h j ( x )] ≥ f ∗ x . Therefore, f x ≥ f ∗ x (i.e., f x is an overestimate). 4
Assume - Proof comes later Count-Min Sketch Anil Maheshwari Review Count-Min Sketch Complexity Analysis Claim Probability Preliminaries Let b = 2 ǫ . Then Pr [ f x − f ∗ x ≥ ǫn ] ≤ 1 Proof of the claim 2 r Conclusions Corollary With probability at least 1 − 1 / 2 r , f ∗ x ≤ f x ≤ f ∗ x + ǫn
Reporting Frequent Elements Count-Min Sketch Anil Maheshwari Review Count-Min Sketch Complexity Analysis Suppose we want to report all the elements of A that Probability occur approximately ≥ n/k times for some integer k . Preliminaries In the Claim, set ǫ = 1 / 3 k . Then b = 2 ǫ = 6 k . Proof of the claim Conclusions Construct CMS table of size br = 6 kr . Scan A and compute the entries in the CMS table. Maintain a set of O ( k ) items that occur most frequently among all the elements in A scanned so far. How?
Heap Data Structure Count-Min Sketch Anil Maheshwari Review Count-Min Sketch The items are stored in a HEAP with f x values as the key. Complexity Analysis What is a Heap? Probability Preliminaries An array that stores n elements and supports: Proof of the claim Conclusions Find Max or Min: Report the element with the smallest/largest key value in Heap in O (1) time. Insert ( x, k ) : Insert element x with key k in Heap in O (log n ) time. Delete ( x ) : Delete element x from Heap in O (log n ) time. . . .
Reporting Frequent Elements contd. Count-Min Sketch Anil Maheshwari Review Count-Min Sketch Assume we have scanned i − 1 items and have updated Complexity Analysis the CMS table and the heap. Probability Consider the i -th item (say x = A [ i ] ) and we perform the Preliminaries following: Proof of the claim Conclusions For j = 1 to r : update the CMS table by executing 1 CMS [ j, h j ( x )] ← CMS [ j, h j ( x )] + 1 . Let f x = min { CMS [1 , h 1 ( x )] , . . . , CMS [ r, h r ( x )] } . 2 If f x ≥ i/k , do: If x ∈ heap, delete x and re-insert it again with the 1 updated f x value. If x �∈ heap, then insert it in the heap and remove all 2 the elements whose count is less than i/k .
Recommend
More recommend