l ecture 7
play

L ECTURE 7 Last time Communication complexity Other models of - PowerPoint PPT Presentation

Sublinear Algorithms L ECTURE 7 Last time Communication complexity Other models of computation Today Streaming 9/24/2020 Sofya Raskhodnikova;Boston University Data Stream Model [Alon Matias Szegedy 96] B L A - B L A - B L A - B L


  1. Sublinear Algorithms L ECTURE 7 Last time • Communication complexity • Other models of computation Today • Streaming 9/24/2020 Sofya Raskhodnikova;Boston University

  2. Data Stream Model [Alon Matias Szegedy 96] B L A - B L A - B L A - B L A - B L A - B L A - B L A - Streaming (1) Quickly process each element Algorithm (3) Quickly produce output (2) Limited working memory Motivation: internet traffic analysis Model the stream as 𝑛 elements from [𝑜] , e.g., 𝑏 1 , 𝑏 2 , … , 𝑏 𝑛 = 3, 5, 3, 7, 5, 4, … Goal: Compute a function of the stream, e.g., median, number of distinct elements, longest increasing sequence. Based on Andrew McGregor’s slides: http://www.cs.umass.edu/~mcgregor/slides/10-jhu1.pdf 2

  3. Streaming Puzzle A stream contains 𝑜 − 1 distinct elements from 𝑜 in arbitrary order. Problem: Find the missing element, using 𝑃(log 𝑜) space. 3

  4. Sampling from a Stream of Unknown Length Warm-up: Find a uniform sample 𝑡 from a stream 𝑏 1 , 𝑏 2 , … , 𝑏 𝑛 of known length 𝑛 . 4

  5. Sampling from a Stream of Unknown Length Problem: Find a uniform sample 𝑡 from a stream 𝑏 1 , 𝑏 2 , … , 𝑏 𝑛 of unknown length 𝑛 Algorithm (Reservoir Sampling) 1. Initially, 𝑡 ← 𝑏 1 2. On seeing the 𝑢 th element, 𝑡 ← 𝑏 𝑢 with probability 1/𝑢 Analysis: What is the probability that 𝑡 = 𝑏 𝑗 at some time 𝑢 ≥ 𝑗 ? Pr 𝑡 = 𝑏 𝑗 = 1 𝑗 + 1 ⋅ … ⋅ 1 − 1 1 𝑗 ⋅ 1 − 𝑢 = 1 𝑗 + 1 ⋅ … ⋅ 𝑢 − 1 𝑗 = 1 𝑗 ⋅ 𝑢 𝑢 Space: 𝑃(𝑙 log 𝑜 + log 𝑛) bits to get 𝑙 samples. 5

  6. Counting Distinct Elements Input: a stream 𝑏 1 , 𝑏 2 , … , 𝑏 𝑛 ∈ 𝑜 𝑛 Warm-up: Output the number of distinct elements in the stream. Exact solutions: • Store 𝑜 bits, indicating whether each domain element has appeared. • Store the stream: O (𝑛 log 𝑜) bits. Known lower bounds: Every deterministic algorithm requires Ω(𝑛) bits • (even for a constant-factor approximation). Every exact algorithm (even randomized) requires Ω 𝑜 bits. • Need to use both randomization and approximation to get polylog (𝑛, 𝑜) space 6

  7. Counting Distinct Elements Input: a stream 𝑏 1 , 𝑏 2 , … , 𝑏 𝑛 ∈ 𝑜 𝑛 Goal: Estimate the number of distinct elements in the stream up to a multiplicative factor (1 + 𝜁) with probability ≥ 2/3 • Studied by [Flajolet Martin 83, Alon Matias Szegedy 96,...] Today: 𝑃(𝜁 −2 log 𝑜) space algorithm • [Bar−Yossef Jayram Kumar Sivakuar Trevisan 02] Optimal: 𝑃(𝜁 −2 + log 𝑜) space algorithm [Kane Nelson Woodruff 10] • 7

  8. ǁ Counting Distinct Elements Input: a stream 𝑏 1 , 𝑏 2 , … , 𝑏 𝑛 ∈ 𝑜 𝑛 Goal: Estimate the number of distinct elements in the stream up to a multiplicative factor (1 + 𝜁) with probability ≥ 2/3 Algorithm Apply a random hash function ℎ ∶ 𝑜 → [𝑜] to each element 1. Compute 𝑌 , the 𝑢 -th smallest value of the hash seen where 𝑢 = 10 ⁄ 𝜁 2 2. 3. Return ǁ 𝑠 = 𝑢 ⋅ 𝑜/𝑌 as estimate for 𝑠 , the number of distinct elements. Analysis: Algorithm uses 𝑃(𝜁 −2 log 𝑜) bits of space (not accounting for storing ℎ ) • • We'll show: estimate ǁ 𝑠 has good accuracy with reasonable probability. Claim. Pr 𝑠 − 𝑠 ≤ 𝜁𝑠 ≥ 2/3 8

  9. ǁ ǁ Counting Distinct Elements: Analysis 𝑌: 𝑢 -th smallest hashed value Claim. Pr 𝑠 − 𝑠 ≤ 𝜁𝑠 ≥ 2/3 𝑢 = 10 ⁄ 𝜁 2 Proof: Suppose the distinct elements are 𝑓 1 , … , 𝑓 𝑠 𝑠 = 𝑢 ⋅ 𝑜/𝑌 • Overestimation: 𝑠 ≥ 1 + 𝜁 𝑠 = Pr 𝑢 ⋅ 𝑜 𝑢 ⋅ 𝑜 Pr ǁ ≥ 1 + 𝜁 𝑠 = Pr 𝑌 ≤ 𝑌 𝑠 1 + 𝜁 𝑢 𝑢⋅𝑜 𝑠 and 𝑍 = σ 𝑗=1 Let 𝑍 𝑗 = 𝟚 ℎ(𝑓 𝑗 ) ≤ 𝑍 • 𝐹[𝑍] = 𝑗 1 + 𝜁 𝑠 1+𝜁 Var 𝑍 ≤ 𝐹[𝑍] 𝑢 𝑢 E 𝑍 = 𝑠 ⋅ 𝐹 𝑍 1 = 𝑠 ⋅ 𝑠 1 + 𝜁 = 1 + 𝜁 𝑠 𝑠 Var 𝑍 = Var ෍ 𝑍 𝑗 = ෍ Var 𝑍 𝑗 𝑗=1 𝑗=1 𝑠 𝑠 2 = ෍ ≤ ෍ E 𝑍 E 𝑍 𝑗 = E 𝑍 𝑗 𝑗=1 𝑗=1 9

  10. ǁ ǁ Counting Distinct Elements: Analysis 𝑌: 𝑢 -th smallest hashed value Claim. Pr 𝑠 − 𝑠 ≤ 𝜁𝑠 ≥ 2/3 𝑢 = 10 ⁄ 𝜁 2 Proof: Suppose the distinct elements are 𝑓 1 , … , 𝑓 𝑠 𝑠 = 𝑢 ⋅ 𝑜/𝑌 • Overestimation: 𝑠 ≥ 1 + 𝜁 𝑠 = Pr 𝑢 ⋅ 𝑜 𝑢 ⋅ 𝑜 Pr ǁ ≥ 1 + 𝜁 𝑠 = Pr 𝑌 ≤ 𝑌 𝑠 1 + 𝜁 𝑢 𝑢⋅𝑜 𝑠 and 𝑍 = σ 𝑗=1 Let 𝑍 𝑗 = 𝟚 ℎ(𝑓 𝑗 ) ≤ 𝑍 • 𝐹[𝑍] = 𝑗 1 + 𝜁 𝑠 1+𝜁 Var 𝑍 ≤ 𝐹[𝑍] 𝑢 ⋅ 𝑜 Pr 𝑌 ≤ = Pr 𝑍 ≥ 𝑢 = Pr 𝑍 ≥ 1 + 𝜁 E 𝑍 𝑠 1 + 𝜁 By the Chebyshev’s inequality, for 𝜁 ≤ 2/3, • Var[𝑍] 𝜁 2 E 𝑍 = 1 + 𝜁 1 𝜁 2 ⋅ 𝑢 = 1 + 𝜁 ≤ 1 Pr 𝑍 ≥ 1 + 𝜁 E 𝑍 ≤ 2 ≤ 𝜁 ⋅ E 𝑍 10 6 1 Underestimation: A similar analysis shows Pr ǁ 𝑠 ≤ 1 − 𝜁 𝑠 ≤ • 6 10

  11. Removing the Random Hashing Assumption Idea: Use limited independence A family ℋ = {ℎ: 𝑏 → 𝑐 } of hash functions is 𝑙 -wise independent if for • all distinct 𝑦 1 , … , 𝑦 𝑙 ∈ [𝑏] and all 𝑧 1 , … , 𝑧 𝑙 ∈ 𝑐 , ℎ∈ℋ ℎ 𝑦 1 = 𝑧 1 , … , ℎ 𝑦 𝑙 = 𝑧 𝑙 = 1 Pr 𝑐 𝑙 Note: a uniformly random family is 𝑙 -wise independent for all 𝑙 Observations: For 𝑦 1 , … , 𝑦 𝑙 as above, • ℎ(𝑦 1 ) is uniform over 𝑐 1. ℎ 𝑦 1 , … , ℎ(𝑦 𝑙 ) are mutually independent. 2. Based on Sepehr Assadi’s lecture notes for CS 514 (Lecture 7, 03/20/20) at Rutgers 11

  12. Construction of 𝒍 -wise Independent Family Idea: Use limited independence A family ℋ = {ℎ: 𝑏 → 𝑐 } of hash functions is 𝑙 -wise independent if for • all distinct 𝑦 1 , … , 𝑦 𝑙 ∈ [𝑏] and all 𝑧 1 , … , 𝑧 𝑙 ∈ 𝑐 , ℎ∈ℋ ℎ 𝑦 1 = 𝑧 1 , … , ℎ 𝑦 𝑙 = 𝑧 𝑙 = 1 Pr 𝑐 𝑙 Construction of 𝑙 -wise Independent Family of Hash Functions 1. Let 𝑞 be a prime. Condider the set of polynomials of degree 𝑙 − 1 over 𝔾 𝑞 2. ℋ = {ℎ: {0, … , 𝑞 − 1} → {0, … , 𝑞 − 1} ∣ ℎ 𝑦 = 𝑑 𝑙−1 𝑦 𝑙−1 + ⋯ + 𝑑 1 𝑦 + 𝑑 0 , with 𝑑 0 , … , 𝑑 𝑙−1 ∈ 𝔾 𝑞 } 3. To sample ℎ ∈ ℋ , sample 𝑑 0 , … , 𝑑 𝑙−1 ∈ 𝔾 𝑞 u.i.r. Space to store ℎ is 𝑃 𝑙 log 𝑞 • For arbitrary 𝑏, 𝑐 , need 𝑃 𝑙 ⋅ log 𝑏 + log 𝑐 • space. Based on Sepehr Assadi’s lecture notes for CS 514 (Lecture 7, 03/20/20) at Rutgers 12

  13. Counting Distinct Elements: Final Algorithm Input: a stream 𝑏 1 , 𝑏 2 , … , 𝑏 𝑛 ∈ 𝑜 𝑛 Goal: Estimate the number of distinct elements in the stream up to a multiplicative factor (1 + 𝜁) with probability ≥ 2/3 Algorithm Sample a hash function ℎ ∶ 𝑜 → [𝑜] from a 2-wise independent 1. family and apply ℎ to each element Compute 𝑌 , the 𝑢 -th smallest value of the hash seen where 𝑢 = 10 ⁄ 𝜁 2 2. 3. Return ǁ 𝑠 = 𝑢 ⋅ 𝑜/𝑌 as estimate for 𝑠 , the number of distinct elements. Analysis: Algorithm uses 𝑃(𝜁 −2 log 𝑜) bits of space • • Our correctness analysis applies. 13

  14. Frequency Moments Estimation Input: a stream 𝑏 1 , 𝑏 2 , … , 𝑏 𝑛 ∈ 𝑜 𝑛 The frequency vector of the stream is 𝑔 = (𝑔 1 , … , 𝑔 𝑛 ) , • where 𝑔 𝑗 is the number of times 𝑗 appears in the stream 𝑞 = σ 𝑗=1 𝑞 𝑜 The 𝑞 -th frequency moment is 𝐺 𝑞 = 𝑔 𝑔 • 𝑗 𝑞 𝐺 0 is the number of nonzero entries of 𝑔 (# of distinct elements) 𝐺 1 = 𝑛 (# of elements in the stream) 2 is a measure of non-uniformity 𝐺 2 = 𝑔 2 used e.g. for anomaly detection in network analysis 𝐺 ∞ = max 𝑔 𝑗 is the most frequent element 𝑗 Goal: Estimate 𝐺 𝑞 up to a multiplicative factor (1 + 𝜁) with probability ≥ 2/3 14

  15. Summary Streaming Model • Reservoir sampling • Distinct Elements (approximating 𝐺 0 ) 𝑙 -wise independent hashing • 20

Recommend


More recommend