Sublinear Algorithms L ECTURE 9 Last time • Approximate counting • Estimation of the 2 nd moment • Linear sketching Today • Multipurpose sketches • Count-min and count-sketch • Range queries, heavy hitters, quantiles 10/1/2020 Sofya Raskhodnikova;Boston University
Multipurpose Sketches: Problems Input: a stream 𝑏 1 , 𝑏 2 , … , 𝑏 𝑛 ∈ 𝑜 𝑛 The frequency vector of the stream is 𝑔 = (𝑔 1 , … , 𝑔 𝑜 ) , • where 𝑔 𝑗 is the number of times 𝑗 appears in the stream Goal: to maintain data structures that can answer the following queries Point Query: For 𝑗 ∈ [ 𝑜 ], estimate 𝑔 • 𝑗 Range Query: For 𝑗, 𝑘 ∈ [n], estimate 𝑔 𝑗 + 𝑔 𝑗+1 + . . . + 𝑔 • 𝑘 • Quantile Query: For 𝜚 ∈ [0, 1], find 𝑘 with 𝑔 1 + . . . + 𝑔 𝑘 ≈ 𝜚𝑛 Heavy Hitters Query: For 𝜚 ∈ [0, 1], find all 𝑗 with 𝑔 𝑗 ≥ 𝜚𝑛 . • Desired accuracy: ±𝜁𝑛 with error probability 𝜀 Based on Andrew McGregor’s slides: https://people.cs.umass.edu/~mcgregor/711S18/vectors-3.pdf 2
Initial Solution to Point Queries • We could maintain the whole frequency vector 𝑔 1 , … , 𝑔 𝑜 • Then, on query 𝑗 , we can output 𝑔 𝑗 Idea: Group counts for some numbers together 3 7 2 1 5 4 1 1 4 6 7 1 … 𝑑 1 𝑑 2 𝑑 3 𝑑 𝑐 If 𝑗 falls into bucket 𝑘 , then 𝑔 𝑗 ≤ 𝑑 𝑘 . Point Query Algorithm (initial version) 1. Sample a hash function ℎ ∶ 𝑜 → [𝑐] from a 2-wise independent family Initialize counters 𝑑 1 , … , 𝑑 𝑐 to 0 2. 3. For each element 𝑏 , increment c ℎ 𝑏 by 1. Never underestimate 4. To answer a point query 𝑗 , return c ℎ 𝑗 . 3
Initial Solution to Point Queries: Analysis Point Query Algorithm (initial version) Sample a hash function ℎ ∶ 𝑜 → [𝑐] from a 2-wise independent family 1. Initialize counters 𝑑 1 , … , 𝑑 𝑐 to 0 2. For each element 𝑏 , increment c ℎ 𝑏 by 1. 3. Never underestimate To answer a point query 𝑗 , return c ℎ 𝑗 . 4. Fix 𝑗 ∗ ∈ [𝑜] . • Let 𝑎 = 𝑑 ℎ 𝑗 ∗ − 𝑔 𝑗 ∗ be the overestimation error. • by 2-wise independence For all 𝑗 ≠ 𝑗 ∗ , let 𝑌 𝑗 = ቊ1 if ℎ 𝑗 = ℎ 𝑗 ∗ 𝔽 𝑌 𝑗 = Pr[ℎ 𝑗 = ℎ(𝑗 ∗ )] = 1 • 0 otherwise 𝑐 𝑗 = 1 𝑗 ≤ 𝑛 𝑎 = 𝑌 𝑗 ⋅ 𝑔 by linearity of 𝑗 𝔽 𝑎 = 𝔽 𝑌 𝑗 ⋅ 𝑔 𝑐 𝑔 𝑐 𝑗≠𝑗 ∗ expectation 𝑗≠𝑗 ∗ 𝑗≠𝑗 ∗ By Markov’s inequality, if 𝑐 = 2/𝜁 then • Pr 𝑎 ≥ 𝜁𝑛 ≤ 𝔽 𝑎 𝜁𝑛 ≤ 1 𝜁𝑐 ≤ 1 2 Based on Andrew McGregor’s slides: https://people.cs.umass.edu/~mcgregor/711S18/vectors-3.pdf 4
Count-Min Sketch [Cormode Muthukrishnan 03] Point Query Algorithm Set 𝑢 = log 2 1/𝜀 and 𝑐 = 2/𝜁 1. 2. Sample 𝑢 hash functions ℎ 𝑘 : 𝑜 → [𝑐] from a 2-wise independent family 3. Initialize 𝑢𝑐 counters 𝑑 𝑘,𝑙 to 0 4. For each element 𝑏 and each 𝑘 ∈ [𝑢] , increment c 𝑘,ℎ 𝑏 by 1. To answer a point query 𝑗 , return ෩ 5. 𝑔 𝑗 = min 𝑘∈[𝑢] c 𝑘,ℎ 𝑗 . Never underestimate 𝑗 ≤ ෩ Correctness: Pr 𝑔 𝑔 𝑗 ≤ 𝑔 𝑗 + 𝜁𝑛 • = 1 − Pr all 𝑢 hash functions overestimate by more than 𝜁𝑛 𝑢 ≥ 1 − 1 = 1 − 𝜀 since hash functions are chosen independently 2 Space: 𝑃 𝑢 (log 𝑜 + log 𝑐) for the hash functions + • 𝑃 𝑢𝑐 log 𝑛 for the counters 1 1 Total: 𝑃 log 𝑜 + log 𝑛 𝜁 log 𝜀 Based on Andrew McGregor’s slides: https://people.cs.umass.edu/~mcgregor/711S18/vectors-3.pdf 5
Multipurpose Sketches: Problems Input: a stream 𝑏 1 , 𝑏 2 , … , 𝑏 𝑛 ∈ 𝑜 𝑛 The frequency vector of the stream is 𝑔 = (𝑔 1 , … , 𝑔 𝑜 ) , • where 𝑔 𝑗 is the number of times 𝑗 appears in the stream Goal: to maintain data structures that can answer the following queries Point Query: For 𝑗 ∈ [ 𝑜 ], estimate 𝑔 • 𝑗 Range Query: For 𝑗, 𝑘 ∈ [n], estimate 𝑔 𝑗 + 𝑔 𝑗+1 + . . . + 𝑔 • Denote by 𝑔 𝑗,𝑘 𝑘 • Quantile Query: For 𝜚 ∈ [0, 1], find 𝑘 with 𝑔 1 + . . . + 𝑔 𝑘 ≈ 𝜚𝑛 Heavy Hitters Query: For 𝜚 ∈ [0, 1], find all 𝑗 with 𝑔 𝑗 ≥ 𝜚𝑛 . • Desired accuracy: ±𝜁𝑛 with error probability 𝜀 Based on Andrew McGregor’s slides: https://people.cs.umass.edu/~mcgregor/711S18/vectors-3.pdf 6
Range Queries We could estimate 𝑔 𝑗,𝑘 by ෩ 𝑗 + ሚ 𝑗+1 +. . . +෩ 𝑔 𝑔 𝑔 • 𝑗 But errors add up: need too much space to keep accurate enough estimates Idea: We could estimate counts for some intervals directly by grouping 𝑗, … , 𝑘 1 2 3 4 5 6 7 8 How many intervals do we need so that each interval is a sum of 𝑃 log 𝑜 original intervals? 7
Dyadic Intervals [𝑜] 𝑜 1, 𝑜 2 + 1, 𝑜 2 lg 𝑜 + 1 levels 1, 𝑜 𝑜 4 + 1, 𝑜 2 + 1, 3𝑜 𝑜 3𝑜 4 + 1, 𝑜 4 2 4 … … … … … 𝑜 − 1 𝑜 1 2 • Exercise: Each interval [𝑗, 𝑘] is a sum of at most 2 lg 𝑜 dyadic intervals. • Such a representation of an interval is its dyadic decomposition. 8
Count-Min Strikes Back Range Query Algorithm Construct lg 𝑜 + 1 Count-Min sketches, one for each level such that 1. for all intervals 𝐽 at that level, our estimate ෩ 𝑔 𝐽 for 𝑔 𝐽 satisfies 𝐽 ≤ ෩ Pr 𝑔 𝑔 𝐽 ≤ 𝑔 𝐽 + 𝜁𝑛 ≤ 1 − 𝜀 To answer a range query [𝑗, 𝑘] , let 𝐽 1 , … , 𝐽 𝑙 be its dyadic decomposition 2. Return ሚ 𝑔 𝑗,𝑘 = ሚ 𝐽 1 + ⋯ + ሚ 𝑔 𝑔 𝐽 𝑙 [𝑗,𝑘] ≤ ሚ Correctness: Pr 𝑔 𝑔 𝑗,𝑘 ≤ 𝑔 𝑗,𝑘 + 𝜁𝑛(2 lg 𝑜) ≥ 1 − 𝜀(2 lg 𝑜) • • Space: Multiply the old space complexity by log 𝑜 and divide 𝜁 and 𝜀 by log 𝑜 : 𝑃 log 2 𝑜 log 𝑜 + log 𝑛 1 𝜁 log log 𝑜 𝜀 Quantile Query: For 𝜚 ∈ [0, 1] find 𝑘 with 𝑔 1,𝑘 ≈ 𝜚𝑛 • 𝑛 𝑛 Approximate Median: Find 𝑘 such that 𝑔 1,𝑘 ≥ 2 − 𝜁𝑛 and 𝑔 1,𝑘−1 ≤ 2 + 𝜁𝑛 We can approximate median via binary search of range queries. Based on Andrew McGregor’s slides: https://people.cs.umass.edu/~mcgregor/711S18/vectors-3.pdf 9
Count-Min Strikes Back (Part 2) Heavy Hitters Query: For 𝜚 ∈ (𝜁, 1), find a set 𝑇 that – includes all 𝑗 with 𝑔 𝑗 ≥ 𝜚𝑛 – excludes all 𝑘 with 𝑔 𝑘 ≤ 𝜚 − 𝜁 𝑛 Heavy Hitters Algorithm 1. Construct lg 𝑜 + 1 Count-Min sketches for levels of dyadic tree, as before 2. To answer query 𝜚 , mark the root. Going level-by-level from the root, mark children 𝐽 of marked nodes if ሚ 𝑔 𝐽 ≥ 𝜚𝑛 3. Return all marked leaves Correctness: If 𝑔 𝑗 ≥ 𝜚𝑛 , then for all ancestors 𝐽 of the leaf 𝑗 , ሚ 𝐽 ≥ 𝑔 𝑔 𝐽 ≥ 𝜚𝑛 • If we ensure that Pr[point query overestimates by > 𝜁𝑛 ] ≤ 𝜀/𝑜 , then, by union bound, all point queries are correct w.p. ≥ 1 − 𝜀 • There are at most 1/𝜚 indices 𝑗 with 𝑔 𝑗 ≥ 𝜚𝑛 Thus, 𝑃(𝜚 −1 log 𝑜) time suffices for post-processing Based on Andrew McGregor’s slides: https://people.cs.umass.edu/~mcgregor/711S18/vectors-3.pdf 10
CR-Precis: Deterministic Count-Min [Ganguly Majumder 07] Use deterministic hash functions: ℎ 𝑘 𝑏 = 𝑏 mod 𝑞 𝑘 , where 𝑞 𝑘 is the 𝑘 -th prime, for 𝑘 ∈ [𝑢] Analysis: Fix 𝑗 ∗ ∈ 𝑜 . Define 𝑨 1 , … , 𝑨 𝑢 such that 𝑑 𝑘,ℎ 𝑘 𝑗 ∗ = 𝑔 𝑗 ∗ + 𝑨 𝑘 , that is, 𝑨 𝑘 = 𝑔 𝑗 𝑗≠𝑗 ∗ :ℎ 𝑘 𝑗 =ℎ 𝑘 𝑗 ∗ Claim: For each 𝑗 ≠ 𝑗 ∗ , we have ℎ 𝑘 𝑗 = ℎ 𝑘 𝑗 ∗ for at most log 𝑜 primes 𝑞 𝑘 • by Chinese Remainder Theorem Thus, σ 𝑘∈ 𝑢 𝑨 𝑘 = σ 𝑘 σ 𝑗 𝑔 𝑗 = σ 𝑗 σ 𝑘 𝑔 𝑗 ≤ σ 𝑗 𝑔 • 𝑗 log 𝑜 = 𝑛 log 𝑜 𝑗 ∗ + 𝑛 log 𝑜 ෪ 𝑘∈[𝑢] c 𝑘,ℎ 𝑗 ∗ = min 𝑗 ∗ + 𝑨 𝑗 ∗ + min 𝑔 𝑗 ∗ = min 𝑘∈[𝑢] (𝑔 𝑘 ) = 𝑔 𝑘∈[𝑢] 𝑨 𝑘 ≤ 𝑔 𝑢 log 𝑜 𝑗 ≤ ෩ We set 𝑢 = to get 𝑔 𝑔 𝑗 ≤ 𝑔 𝑗 + 𝜁𝑛 • 𝜁 log 2 𝑜 Requires keeping at most 𝑢 ⋅ 𝑞 𝑢 = ෨ • 𝑃 counters since 𝑞 𝑢 = 𝑃(𝑢 log 𝑢) 𝜁 2 Based on Andrew McGregor’s slides: https://people.cs.umass.edu/~mcgregor/711S18/vectors-3.pdf 11
Count-Sketch: Count-Min+AMS combined Count-Sketch In addition to ℎ 𝑘 : 𝑜 → [𝑐] , use hash functions 𝑠 𝑘 : 𝑜 → {−1,1} 1. 𝑘,𝑙 = σ 𝑗:ℎ 𝑘 𝑗 =𝑙 𝑠 Maintain 𝑢𝑐 counters 𝑑 𝑘 𝑗 𝑔 2. 𝑗 To answer a point query 𝑗 , return መ 3. 𝑔 𝑗 = median 𝑠 1 𝑗 𝑑 1,ℎ 1 𝑗 , … , 𝑠 𝑢 𝑗 𝑑 𝑢,ℎ 𝑢 𝑗 𝐺 2 Claim. 𝔽 𝑠 𝑘 𝑗 𝑑 𝑘,ℎ 𝑘 𝑗 = 𝑔 𝑗 and Var 𝑠 𝑘 𝑗 𝑑 𝑘,ℎ 𝑘 𝑗 ≤ ∀𝑘 ∈ [𝑢] 𝑐 • By Chebyshev, for 𝑐 = 2/𝜁 2 , 𝐺 = 1 2 Pr 𝑔 𝑗 − 𝑠 𝑘 𝑗 𝑑 𝑘,ℎ 𝑘 𝑗 ≥ 𝜁 𝐺 2 ≤ 𝜁 2 𝑐𝐺 3 2 • By Chernoff, for 𝑢 = Θ(log 1/𝜀) 𝑗 − መ Pr 𝑔 𝑔 𝑗 ≥ 𝜁 𝐺 2 ≤ 𝜀 Based on Andrew McGregor’s slides: https://people.cs.umass.edu/~mcgregor/711S18/vectors-3.pdf 12
Recommend
More recommend