CS 498ABD: Algorithms for Big Data, Spring 2019 CountMin and Count Sketches Lecture 10 February 14, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 / 18
Heavy Hitters Problem Heavy Hitters Problem: Find all items i such that f i > m / k for some fixed k . Heavy hitters are very frequent items. We saw Misra-Gries deterministic algorithm that in O ( k ) space finds the heavy hitters assuming they exist. Two pass algorithm correctly identifies heavy hitters. Chandra (UIUC) CS498ABD 2 Spring 2019 2 / 18
(Strict) Turnstile Model Turnstile model: each update is ( i j , ∆ j ) where ∆ j can be positive or negative Strict turnstile: need x i ≥ 0 at all time for all i In terms of frequent items we want additive error to x i Chandra (UIUC) CS498ABD 3 Spring 2019 3 / 18
Basic Hashing/Sampling Idea Heavy Hitters Problem: Find all items i such that f i > m / k . Let b 1 , b 2 , . . . , b k be the k heavy hitters Suppose we pick h : [ n ] → [ ck ] for some c > 1 h spreads b 1 , . . . , b k among the buckets ( k balls into ck bins) In ideal situation each bucket can be used to count a separate heavy hitter Chandra (UIUC) CS498ABD 4 Spring 2019 4 / 18
Part I CountMin Sketch Chandra (UIUC) CS498ABD 5 Spring 2019 5 / 18
CountMin Sketch [Cormode-Muthukrishnan] CountMin-Sketch( w , d ): h 1 , h 2 , . . . , h d are pair-wise independent hash functions from [ n ] → [ w ] . While (stream is not empty) do e t = ( i t , ∆ t ) is current item for ℓ = 1 to d do C [ ℓ, h ℓ ( i j )] ← C [ ℓ, h ℓ ( i j )] + ∆ t endWhile x i = min d For i ∈ [ n ] set ˜ ℓ =1 C [ ℓ, h ℓ ( i )] . Counter C [ ℓ, j ] simply counts the sum of all x i such that h ℓ ( i ) = j . That is, � C [ ℓ, j ] = x i . i : h ℓ ( i )= j Chandra (UIUC) CS498ABD 6 Spring 2019 6 / 18
Intuition Suppose there are k heavy hitters b 1 , b 2 , . . . , b k Consider b i : Hash function h ℓ sends b i to h ℓ ( b i ) . C [ ℓ, h ( b i )] counts x b i and also other items that hash to same bucket h ( b i ) so we always overcount (since strict turnstile model) Repeating with many hash functions and taking minimum is right thing to do: for b i the goal is to avoid other heavy hitters colliding with it Chandra (UIUC) CS498ABD 7 Spring 2019 7 / 18
Property of CountMin Sketch Lemma Let d = Ω(log 1 δ ) and w > 2 ǫ . Then for any fixed i ∈ [ n ] , x i ≤ ˜ x i and Pr[˜ x i ≥ x i + ǫ � x � 1 ] ≤ δ. Chandra (UIUC) CS498ABD 8 Spring 2019 8 / 18
Property of CountMin Sketch Lemma Let d = Ω(log 1 δ ) and w > 2 ǫ . Then for any fixed i ∈ [ n ] , x i ≤ ˜ x i and Pr[˜ x i ≥ x i + ǫ � x � 1 ] ≤ δ. Unlike Misra-Greis we have over estimates Actual items are not stored (requires work to recover heavy hitters) Works in strict turnstile model and hence can handle deletions Space usage is O ( log(1 /δ ) ) counters and hence ǫ O ( log(1 /δ ) log m ) bits ǫ Chandra (UIUC) CS498ABD 8 Spring 2019 8 / 18
Analysis Fix ℓ : h ℓ ( i ) is the bucket that h ℓ hashes i to. Chandra (UIUC) CS498ABD 9 Spring 2019 9 / 18
Analysis Fix ℓ : h ℓ ( i ) is the bucket that h ℓ hashes i to. Z ℓ = C [ ℓ, h ℓ ( i )] is the counter value that i is hashed to. Chandra (UIUC) CS498ABD 9 Spring 2019 9 / 18
Analysis Fix ℓ : h ℓ ( i ) is the bucket that h ℓ hashes i to. Z ℓ = C [ ℓ, h ℓ ( i )] is the counter value that i is hashed to. i ′ � = i Pr[ h ℓ ( i ′ ) = h ℓ ( i )] x i ′ E[ Z ℓ ] = x i + � Chandra (UIUC) CS498ABD 9 Spring 2019 9 / 18
Analysis Fix ℓ : h ℓ ( i ) is the bucket that h ℓ hashes i to. Z ℓ = C [ ℓ, h ℓ ( i )] is the counter value that i is hashed to. i ′ � = i Pr[ h ℓ ( i ′ ) = h ℓ ( i )] x i ′ E[ Z ℓ ] = x i + � By pairwise-independence E[ Z ℓ ] = x i + � i ′ � = i x i ′ / w ≤ x i + ǫ � x � 1 / 2 Chandra (UIUC) CS498ABD 9 Spring 2019 9 / 18
Analysis Fix ℓ : h ℓ ( i ) is the bucket that h ℓ hashes i to. Z ℓ = C [ ℓ, h ℓ ( i )] is the counter value that i is hashed to. i ′ � = i Pr[ h ℓ ( i ′ ) = h ℓ ( i )] x i ′ E[ Z ℓ ] = x i + � By pairwise-independence E[ Z ℓ ] = x i + � i ′ � = i x i ′ / w ≤ x i + ǫ � x � 1 / 2 Via Markov applied to Z ℓ − x i (we use strict turnstile here) Pr[ Z ℓ ] ≥ x i + ǫ � x � 1 ≤ 1 / 2 Chandra (UIUC) CS498ABD 9 Spring 2019 9 / 18
Analysis Fix ℓ : h ℓ ( i ) is the bucket that h ℓ hashes i to. Z ℓ = C [ ℓ, h ℓ ( i )] is the counter value that i is hashed to. i ′ � = i Pr[ h ℓ ( i ′ ) = h ℓ ( i )] x i ′ E[ Z ℓ ] = x i + � By pairwise-independence E[ Z ℓ ] = x i + � i ′ � = i x i ′ / w ≤ x i + ǫ � x � 1 / 2 Via Markov applied to Z ℓ − x i (we use strict turnstile here) Pr[ Z ℓ ] ≥ x i + ǫ � x � 1 ≤ 1 / 2 Since the d hash functions are independent Pr[min ℓ Z ℓ ≥ x i + ǫ � x � 1 ] ≤ 1 / 2 d ≤ δ Chandra (UIUC) CS498ABD 9 Spring 2019 9 / 18
Summarizing Lemma Let d = Ω(log 1 δ ) and w > 2 ǫ . Then for any fixed i ∈ [ n ] , x i ≤ ˜ x i and Pr[˜ x i ≥ x i + ǫ � x � 1 ] ≤ δ. Choose d = 2 ln n and w = 2 /ǫ : we have x i ≥ x i + ǫ � x � 1 ] ≤ 1 / n 2 . Pr[˜ By union bound, with probability (1 − 1 / n ) , for all i ∈ [ n ] , x i ≤ x i + ǫ � x � 1 ˜ Chandra (UIUC) CS498ABD 10 Spring 2019 10 / 18
Summarizing Lemma Let d = Ω(log 1 δ ) and w > 2 ǫ . Then for any fixed i ∈ [ n ] , x i ≤ ˜ x i and Pr[˜ x i ≥ x i + ǫ � x � 1 ] ≤ δ. Choose d = 2 ln n and w = 2 /ǫ : we have x i ≥ x i + ǫ � x � 1 ] ≤ 1 / n 2 . Pr[˜ By union bound, with probability (1 − 1 / n ) , for all i ∈ [ n ] , x i ≤ x i + ǫ � x � 1 ˜ Total space O ( 1 ǫ log n ) counters and hence O ( 1 ǫ log n log m ) bits. Chandra (UIUC) CS498ABD 10 Spring 2019 10 / 18
CountMin as a Linear Sketch Question: Why is CountMin a linear sketch? Chandra (UIUC) CS498ABD 11 Spring 2019 11 / 18
CountMin as a Linear Sketch Question: Why is CountMin a linear sketch? Recall that for 1 ≤ ℓ ≤ d and 1 ≤ s ≤ w : � C [ ℓ, s ] = x i i : h ℓ ( i )= s Thus, once hash function h ℓ is fixed: C [ ℓ, s ] = � u , x � where u is a row vector in { 0 , 1 } n such that u i = 1 if h ℓ ( i ) = s and u i = 0 otherwise Thus, once hash functions are fixed, the counter values can be written as Mx where M ∈ { 0 , 1 } wd × n is the sketch matrix Chandra (UIUC) CS498ABD 11 Spring 2019 11 / 18
Part II Count Sketch Chandra (UIUC) CS498ABD 12 Spring 2019 12 / 18
Count Sketch [Charikar-Chen-FarachColton] Count-Sketch( w , d ): h 1 , h 2 , . . . , h d are pair-wise independent hash functions from [ n ] → [ w ] . g 1 , g 2 , . . . , g d are pair-wise independent hash functions from [ n ] → {− 1 , 1 } . While (stream is not empty) do e t = ( i t , ∆ t ) is current item for ℓ = 1 to d do C [ ℓ, h ℓ ( i j )] ← C [ ℓ, h ℓ ( i j )] + g ( i t )∆ t endWhile For i ∈ [ n ] set ˜ x i = median { g 1 ( i ) C [1 , h 1 ( i )] , . . . , g ℓ ( i ) C [ ℓ, h ℓ ( i )] } . Like CountMin, Count sketch has wd counters. Now counter values can become negative even if x is positive. Chandra (UIUC) CS498ABD 13 Spring 2019 13 / 18
Intuition Each hash function h ℓ spreads the elements across w buckets The has function g ℓ induces cancellations (inspired by F 2 estimation algorithm) Since answer may be negative even if x ≥ 0 , we take the median Exercise: Show that Count sketch is also a linear sketch. Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 18
Count Sketch Analysis Lemma Let d ≥ 4 log 1 3 δ and w > ǫ 2 . Then for any fixed i ∈ [ n ] , E[˜ x i ] = x i and Pr[ | ˜ x i − x i | ≥ ǫ � x � 2 ] ≤ δ. Chandra (UIUC) CS498ABD 15 Spring 2019 15 / 18
Count Sketch Analysis Lemma Let d ≥ 4 log 1 3 δ and w > ǫ 2 . Then for any fixed i ∈ [ n ] , E[˜ x i ] = x i and Pr[ | ˜ x i − x i | ≥ ǫ � x � 2 ] ≤ δ. Comparison to CountMin Error guarantee is with respect to � x � 2 instead of � x � 1 . For x ≥ 0 , � x � 2 ≤ � x � 1 and in some cases � x � 2 ≪ � x � 1 . Space increases to O ( 1 ǫ 2 log n ) counters from O ( 1 ǫ log n ) counters Chandra (UIUC) CS498ABD 15 Spring 2019 15 / 18
Analysis Fix an i ∈ [ n ] . Let Z ℓ = g ℓ ( i ) C [ ℓ, h ℓ ( i )] . Chandra (UIUC) CS498ABD 16 Spring 2019 16 / 18
Analysis Fix an i ∈ [ n ] . Let Z ℓ = g ℓ ( i ) C [ ℓ, h ℓ ( i )] . For i ′ ∈ [ n ] let Y i ′ be the indicator random variable that is 1 if h ℓ ( i ) = h ℓ ( i ′ ) ; that is i and i ′ collide in h ℓ . E [ Y i ′ ] = E [ Y 2 i ′ ] = 1 / w from pairwise independence of h ℓ . Chandra (UIUC) CS498ABD 16 Spring 2019 16 / 18
Analysis Fix an i ∈ [ n ] . Let Z ℓ = g ℓ ( i ) C [ ℓ, h ℓ ( i )] . For i ′ ∈ [ n ] let Y i ′ be the indicator random variable that is 1 if h ℓ ( i ) = h ℓ ( i ′ ) ; that is i and i ′ collide in h ℓ . E [ Y i ′ ] = E [ Y 2 i ′ ] = 1 / w from pairwise independence of h ℓ . � g ℓ ( i ′ ) x i ′ Y i ′ Z ℓ = g ℓ ( i ) C [ ℓ, h ℓ ( i )] = g ℓ ( i ) i ′ Chandra (UIUC) CS498ABD 16 Spring 2019 16 / 18
Recommend
More recommend