algorithms for data streams
play

Algorithms for data streams Irene Finocchi finocchi@di.uniroma1.it - PowerPoint PPT Presentation

Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Thanks! Algorithms for data streams Irene Finocchi finocchi@di.uniroma1.it http://www.dsi.uniroma1.it/ finocchi/ May 9, 2012 1 / 99 Irene Finocchi Algorithms for data


  1. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Stream sources Thanks! Data stream model Performance metrics Minimize space, passes, and processing time upon token arrivals 10 / 99 Irene Finocchi Algorithms for data streams

  2. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Stream sources Thanks! Data stream model Performance metrics Minimize space, passes, and processing time upon token arrivals 1 Use a sublinear amount of space s : s = o ( min { n , m } ) where s = bits of random-access working memory 10 / 99 Irene Finocchi Algorithms for data streams

  3. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Stream sources Thanks! Data stream model Performance metrics Minimize space, passes, and processing time upon token arrivals 1 Use a sublinear amount of space s : s = o ( min { n , m } ) where s = bits of random-access working memory 2 Make p passes over the data, for some small integer p (no random access to tokens) 10 / 99 Irene Finocchi Algorithms for data streams

  4. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Stream sources Thanks! Data stream model Performance metrics Minimize space, passes, and processing time upon token arrivals 1 Use a sublinear amount of space s : s = o ( min { n , m } ) where s = bits of random-access working memory 2 Make p passes over the data, for some small integer p (no random access to tokens) 3 Use small per-item processing time t 10 / 99 Irene Finocchi Algorithms for data streams

  5. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Stream sources Thanks! Data stream model Performance metrics Minimize space, passes, and processing time upon token arrivals 1 Use a sublinear amount of space s : s = o ( min { n , m } ) where s = bits of random-access working memory 2 Make p passes over the data, for some small integer p (no random access to tokens) 3 Use small per-item processing time t 10 / 99 Irene Finocchi Algorithms for data streams

  6. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Stream sources Thanks! Data stream model Performance metrics Minimize space, passes, and processing time upon token arrivals 1 Use a sublinear amount of space s : s = o ( min { n , m } ) where s = bits of random-access working memory 2 Make p passes over the data, for some small integer p (no random access to tokens) 3 Use small per-item processing time t  s = O ( log m + log n )    Happy if s = O ( polylog ( min { n , m } )  p = 1   t = O ( 1 ) 10 / 99 Irene Finocchi Algorithms for data streams

  7. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Stream sources Thanks! Data stream model Token frequencies Data stream = sequence σ = � a 1 , a 2 , ... a m � of tokens drawn from universe [ n ] = { 1 , 2 , ... n } 11 / 99 Irene Finocchi Algorithms for data streams

  8. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Stream sources Thanks! Data stream model Token frequencies Data stream = sequence σ = � a 1 , a 2 , ... a m � of tokens drawn from universe [ n ] = { 1 , 2 , ... n } σ represents a multiset of items and implicitly defines a frequency vector f = � f 1 , f 2 , ... f n � where f i = number of occurrences of item i ∈ [ n ] in σ Example If σ = � 2 , 1 , 2 , 1 , 5 , 2 , 3 , 2 � and n = 5, then f = � 2 , 4 , 1 , 0 , 1 � 11 / 99 Irene Finocchi Algorithms for data streams

  9. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Stream sources Thanks! Data stream model Token frequencies Data stream = sequence σ = � a 1 , a 2 , ... a m � of tokens drawn from universe [ n ] = { 1 , 2 , ... n } σ represents a multiset of items and implicitly defines a frequency vector f = � f 1 , f 2 , ... f n � where f i = number of occurrences of item i ∈ [ n ] in σ Example If σ = � 2 , 1 , 2 , 1 , 5 , 2 , 3 , 2 � and n = 5, then f = � 2 , 4 , 1 , 0 , 1 � In many streaming problems, wish to compute some statistical properties of the multiset: e.g., majority token (if any), most frequent items, or number of distinct items 11 / 99 Irene Finocchi Algorithms for data streams

  10. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Stream sources Thanks! Data stream model Variations of the basic setup Data stream = sequence of tuples σ = � ( a 1 , c 1 ) , ( a 2 , c 2 ) , ... � where ( a i , c i ) ∈ [ n ] × {− F , ..., F } Upon arrival of ( a i , c i )) , update frequency f a i = f a i + c i New role for m : m = � n j = 1 f j 12 / 99 Irene Finocchi Algorithms for data streams

  11. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Stream sources Thanks! Data stream model Variations of the basic setup Data stream = sequence of tuples σ = � ( a 1 , c 1 ) , ( a 2 , c 2 ) , ... � where ( a i , c i ) ∈ [ n ] × {− F , ..., F } Upon arrival of ( a i , c i )) , update frequency f a i = f a i + c i New role for m : m = � n j = 1 f j Basic data stream model : c i = 1 ( m = stream length) 12 / 99 Irene Finocchi Algorithms for data streams

  12. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Stream sources Thanks! Data stream model Variations of the basic setup Data stream = sequence of tuples σ = � ( a 1 , c 1 ) , ( a 2 , c 2 ) , ... � where ( a i , c i ) ∈ [ n ] × {− F , ..., F } Upon arrival of ( a i , c i )) , update frequency f a i = f a i + c i New role for m : m = � n j = 1 f j Basic data stream model : c i = 1 ( m = stream length) Cash register model : c i > 0 (items can only arrive, their frequencies can be incremented by variable amounts) 12 / 99 Irene Finocchi Algorithms for data streams

  13. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Stream sources Thanks! Data stream model Variations of the basic setup Data stream = sequence of tuples σ = � ( a 1 , c 1 ) , ( a 2 , c 2 ) , ... � where ( a i , c i ) ∈ [ n ] × {− F , ..., F } Upon arrival of ( a i , c i )) , update frequency f a i = f a i + c i New role for m : m = � n j = 1 f j Basic data stream model : c i = 1 ( m = stream length) Cash register model : c i > 0 (items can only arrive, their frequencies can be incremented by variable amounts) Turnstile model : generic c i (items can arrive and depart from the multiset) 12 / 99 Irene Finocchi Algorithms for data streams

  14. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Stream sources Thanks! Data stream model Historical remarks Origin in the 70s (seminal paper by Munro & Paterson, STOC’78 ) 13 / 99 Irene Finocchi Algorithms for data streams

  15. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Stream sources Thanks! Data stream model Historical remarks Origin in the 70s (seminal paper by Munro & Paterson, STOC’78 ) Gained popularity in the last fifteen years: theoretical interest: easy-to-state, but hard-to-solve problems links to other theory areas and to novel computing paradigms (MapReduce) 13 / 99 Irene Finocchi Algorithms for data streams

  16. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Stream sources Thanks! Data stream model Historical remarks Origin in the 70s (seminal paper by Munro & Paterson, STOC’78 ) Gained popularity in the last fifteen years: theoretical interest: easy-to-state, but hard-to-solve problems links to other theory areas and to novel computing paradigms (MapReduce) practical appeal: fast and effective solutions, wide applicability 13 / 99 Irene Finocchi Algorithms for data streams

  17. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Stream sources Thanks! Data stream model Historical remarks Origin in the 70s (seminal paper by Munro & Paterson, STOC’78 ) Gained popularity in the last fifteen years: theoretical interest: easy-to-state, but hard-to-solve problems links to other theory areas and to novel computing paradigms (MapReduce) practical appeal: fast and effective solutions, wide applicability Alon, Matias & Szegedy: Gödel prize (2005) for their paper on frequency moments approximation ( STOC’96, JCSS’99 ), foundational work for streaming and sketching algorithms 13 / 99 Irene Finocchi Algorithms for data streams

  18. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Three puzzles Data stream challenges 14 / 99 Irene Finocchi Algorithms for data streams

  19. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap The missing number puzzle π = � π 1 , π 2 , ...π n − 1 � is a permutation of [ 1 , n ] with one number missing 15 / 99 Irene Finocchi Algorithms for data streams

  20. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap The missing number puzzle π = � π 1 , π 2 , ...π n − 1 � is a permutation of [ 1 , n ] with one number missing What’s the missing number? 15 / 99 Irene Finocchi Algorithms for data streams

  21. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap The missing number puzzle π = � π 1 , π 2 , ...π n − 1 � is a permutation of [ 1 , n ] with one number missing What’s the missing number? Constraint: Carole has limited memory: she can only use O ( log n ) bits 15 / 99 Irene Finocchi Algorithms for data streams

  22. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap The missing number puzzle π = � π 1 , π 2 , ...π n − 1 � is a permutation of [ 1 , n ] with one number missing What’s the missing number? Constraint: Carole has limited memory: she can only use O ( log n ) bits n ( n − 1 ) − � n − 1 i = 1 π i 2 15 / 99 Irene Finocchi Algorithms for data streams

  23. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Two missing numbers Now π has two missing numbers, x and y : find them, but use only O ( log n ) bits! 16 / 99 Irene Finocchi Algorithms for data streams

  24. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Two missing numbers Now π has two missing numbers, x and y : find them, but use only O ( log n ) bits! � − � n − 2 S = n ( n + 1 ) i = 1 π i 2 Track P = n ! − Π n − 2 i = 1 π i Solve equations x + y = S and x y = P 16 / 99 Irene Finocchi Algorithms for data streams

  25. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Two missing numbers Now π has two missing numbers, x and y : find them, but use only O ( log n ) bits! � − � n − 2 S = n ( n + 1 ) i = 1 π i 2 Track P = n ! − Π n − 2 i = 1 π i Solve equations x + y = S and x y = P How many bits? Ω( log n !) = Ω( n log n ) 16 / 99 Irene Finocchi Algorithms for data streams

  26. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Two missing numbers Now π has two missing numbers, x and y : find them, but use only O ( log n ) bits! 17 / 99 Irene Finocchi Algorithms for data streams

  27. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Two missing numbers Now π has two missing numbers, x and y : find them, but use only O ( log n ) bits!  − � n − 2  S 1 = n ( n − 1 ) i = 1 π i 2 Track − � n − 2  S 2 = n ( n + 1 )( 2 n + 1 ) i = 1 π 2 6 i Solve equations x + y = S 1 and x 2 + y 2 = S 2 17 / 99 Irene Finocchi Algorithms for data streams

  28. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Two missing numbers Now π has two missing numbers, x and y : find them, but use only O ( log n ) bits!  − � n − 2  S 1 = n ( n − 1 ) i = 1 π i 2 Track − � n − 2  S 2 = n ( n + 1 )( 2 n + 1 ) i = 1 π 2 6 i Solve equations x + y = S 1 and x 2 + y 2 = S 2 How many bits? O ( log n 3 ) = O ( log n ) 17 / 99 Irene Finocchi Algorithms for data streams

  29. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Lesson 1 Some problems can be deterministically solved in: logarithmic space one pass 18 / 99 Irene Finocchi Algorithms for data streams

  30. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Lesson 1 Some problems can be deterministically solved in: logarithmic space one pass Most of the times, we’re not so lucky 18 / 99 Irene Finocchi Algorithms for data streams

  31. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Fishing U = { 1 , ... u } fish species in the universe 19 / 99 Irene Finocchi Algorithms for data streams

  32. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Fishing U = { 1 , ... u } fish species in the universe a t ∈ U fish species caught at time t 19 / 99 Irene Finocchi Algorithms for data streams

  33. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Fishing U = { 1 , ... u } fish species in the universe a t ∈ U fish species caught at time t f t [ j ] = |{ a i | a i = j , i ≤ t }| frequency of species j up to time t 19 / 99 Irene Finocchi Algorithms for data streams

  34. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Fishing U = { 1 , ... u } fish species in the universe a t ∈ U fish species caught at time t f t [ j ] = |{ a i | a i = j , i ≤ t }| frequency of species j up to time t j is rare iff f t [ j ] = 1 19 / 99 Irene Finocchi Algorithms for data streams

  35. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Fishing U = { 1 , ... u } fish species in the universe a t ∈ U fish species caught at time t f t [ j ] = |{ a i | a i = j , i ≤ t }| frequency of species j up to time t j is rare iff f t [ j ] = 1 Rarity of catch at time t : ρ t = |{ j | f t [ j ] = 1 }| = R t u u 19 / 99 Irene Finocchi Algorithms for data streams

  36. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Fishing U = { 1 , ... u } fish species in the universe a t ∈ U fish species caught at time t f t [ j ] = |{ a i | a i = j , i ≤ t }| frequency of species j up to time t j is rare iff f t [ j ] = 1 Rarity of catch at time t : ρ t = |{ j | f t [ j ] = 1 }| = R t u u George is curious and wants to compute rarity 19 / 99 Irene Finocchi Algorithms for data streams

  37. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Fishing U = { 1 , ... u } fish species in the universe a t ∈ U fish species caught at time t f t [ j ] = |{ a i | a i = j , i ≤ t }| frequency of species j up to time t j is rare iff f t [ j ] = 1 Rarity of catch at time t : ρ t = |{ j | f t [ j ] = 1 }| = R t u u George is curious and wants to compute rarity 2 u -bit vector would suffice ... but George’s suitcase has o ( u ) size 19 / 99 Irene Finocchi Algorithms for data streams

  38. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Deterministic fish rarity George cannot compute ρ t precisely with a deterministic algorithm using only o(u) bits By contradiction 20 / 99 Irene Finocchi Algorithms for data streams

  39. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Deterministic fish rarity George cannot compute ρ t precisely with a deterministic algorithm using only o(u) bits By contradiction Let S ⊆ U be a set of species: no duplicates, | S | = Θ( u ) Need Ω( | S | ) = Ω( u ) bits to represent S If claim is false, could break information theoretic lower bound 20 / 99 Irene Finocchi Algorithms for data streams

  40. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Deterministic fish rarity George cannot compute ρ t precisely with a deterministic algorithm using only o(u) bits By contradiction Let S ⊆ U be a set of species: no duplicates, | S | = Θ( u ) Need Ω( | S | ) = Ω( u ) bits to represent S If claim is false, could break information theoretic lower bound To retrieve S , for each i ∈ U , stream � S , i � to George and compare ρ t and ρ t + 1 : 20 / 99 Irene Finocchi Algorithms for data streams

  41. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Deterministic fish rarity George cannot compute ρ t precisely with a deterministic algorithm using only o(u) bits By contradiction Let S ⊆ U be a set of species: no duplicates, | S | = Θ( u ) Need Ω( | S | ) = Ω( u ) bits to represent S If claim is false, could break information theoretic lower bound To retrieve S , for each i ∈ U , stream � S , i � to George and compare ρ t and ρ t + 1 : if i �∈ S , then R t + 1 = R t + 1 and ρ t + 1 > ρ t 20 / 99 Irene Finocchi Algorithms for data streams

  42. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Deterministic fish rarity George cannot compute ρ t precisely with a deterministic algorithm using only o(u) bits By contradiction Let S ⊆ U be a set of species: no duplicates, | S | = Θ( u ) Need Ω( | S | ) = Ω( u ) bits to represent S If claim is false, could break information theoretic lower bound To retrieve S , for each i ∈ U , stream � S , i � to George and compare ρ t and ρ t + 1 : if i �∈ S , then R t + 1 = R t + 1 and ρ t + 1 > ρ t if i ∈ S , then R t + 1 = R t − 1 and ρ t + 1 < ρ t 20 / 99 Irene Finocchi Algorithms for data streams

  43. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Deterministic fish rarity George cannot compute ρ t precisely with a deterministic algorithm using only o(u) bits By contradiction Let S ⊆ U be a set of species: no duplicates, | S | = Θ( u ) Need Ω( | S | ) = Ω( u ) bits to represent S If claim is false, could break information theoretic lower bound To retrieve S , for each i ∈ U , stream � S , i � to George and compare ρ t and ρ t + 1 : if i �∈ S , then R t + 1 = R t + 1 and ρ t + 1 > ρ t if i ∈ S , then R t + 1 = R t − 1 and ρ t + 1 < ρ t Hence ρ decreases ⇔ i ∈ S 20 / 99 Irene Finocchi Algorithms for data streams

  44. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Randomized fish rarity (1/2) George can approximate ρ t using 2 k = o ( u ) bits Sampling: pick k random fish species maintain rarity c 1 [ t ] , ... c k [ t ] of each sampled species (2 bits) � ρ t = |{ i ∈ [ 1 , k ] | c i [ t ] = 1 }| R t Return � = k k 21 / 99 Irene Finocchi Algorithms for data streams

  45. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Randomized fish rarity (1/2) George can approximate ρ t using 2 k = o ( u ) bits Sampling: pick k random fish species maintain rarity c 1 [ t ] , ... c k [ t ] of each sampled species (2 bits) � ρ t = |{ i ∈ [ 1 , k ] | c i [ t ] = 1 }| R t Return � = k k Claim: E [ � ρ t ] = ρ t 21 / 99 Irene Finocchi Algorithms for data streams

  46. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Randomized fish rarity (1/2) George can approximate ρ t using 2 k = o ( u ) bits Sampling: pick k random fish species maintain rarity c 1 [ t ] , ... c k [ t ] of each sampled species (2 bits) � ρ t = |{ i ∈ [ 1 , k ] | c i [ t ] = 1 }| R t Return � = k k Claim: E [ � ρ t ] = ρ t If ρ t large enough, � ρ t is a good estimate for ρ t with arbitrarily small precision and good probability Requires more advanced probabilistic tools: examples later 21 / 99 Irene Finocchi Algorithms for data streams

  47. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Randomized fish rarity (2/2) � ρ t = |{ i ∈ [ 1 , k ] | c i [ t ] = 1 }| R t � = k k E [ � ρ t ] = ρ t 22 / 99 Irene Finocchi Algorithms for data streams

  48. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Randomized fish rarity (2/2) � ρ t = |{ i ∈ [ 1 , k ] | c i [ t ] = 1 }| R t � = k k E [ � ρ t ] = ρ t � Y i = 1 if c i [ t ] = 1 Y i indicator variable: Y i = 0 otherwise 22 / 99 Irene Finocchi Algorithms for data streams

  49. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Randomized fish rarity (2/2) � ρ t = |{ i ∈ [ 1 , k ] | c i [ t ] = 1 }| R t � = k k E [ � ρ t ] = ρ t � Y i = 1 if c i [ t ] = 1 Y i indicator variable: Y i = 0 otherwise Pr { Y i = 1 } = Pr { the i-th sampled species is rare } = R t u = ρ t 22 / 99 Irene Finocchi Algorithms for data streams

  50. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Randomized fish rarity (2/2) � ρ t = |{ i ∈ [ 1 , k ] | c i [ t ] = 1 }| R t � = k k E [ � ρ t ] = ρ t � Y i = 1 if c i [ t ] = 1 Y i indicator variable: Y i = 0 otherwise Pr { Y i = 1 } = Pr { the i-th sampled species is rare } = R t u = ρ t ⇒ E [ Y i ] = ρ t 22 / 99 Irene Finocchi Algorithms for data streams

  51. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Randomized fish rarity (2/2) � ρ t = |{ i ∈ [ 1 , k ] | c i [ t ] = 1 }| R t � = k k E [ � ρ t ] = ρ t � Y i = 1 if c i [ t ] = 1 Y i indicator variable: Y i = 0 otherwise Pr { Y i = 1 } = Pr { the i-th sampled species is rare } = R t u = ρ t ⇒ E [ Y i ] = ρ t R t ] = � k ⇒ E [ � i = 1 E [ Y i ] = k ρ t 22 / 99 Irene Finocchi Algorithms for data streams

  52. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Randomized fish rarity (2/2) � ρ t = |{ i ∈ [ 1 , k ] | c i [ t ] = 1 }| R t � = k k E [ � ρ t ] = ρ t � Y i = 1 if c i [ t ] = 1 Y i indicator variable: Y i = 0 otherwise Pr { Y i = 1 } = Pr { the i-th sampled species is rare } = R t u = ρ t ⇒ E [ Y i ] = ρ t R t ] = � k ⇒ E [ � i = 1 E [ Y i ] = k ρ t ρ t ] = E [ � R t ] ⇒ E [ � = ρ t k 22 / 99 Irene Finocchi Algorithms for data streams

  53. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Lesson 2 It is often impossible to solve problems precisely and deterministically in small (sublinear) space Randomization and approximation greatly help: find an answer correct within some factor (guarantee that � ρ is within 10 % of ρ ) allow a small probability of failure (answer is correct, except with probability 1 in 10,000) 23 / 99 Irene Finocchi Algorithms for data streams

  54. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Pointer and chaser Paul has n + 1 pointers For each pointer i , he points to a position P [ i ] ∈ [ 1 , n ] n=7 3 6 5 2 1 3 4 1 24 / 99 Irene Finocchi Algorithms for data streams

  55. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Pointer and chaser Paul has n + 1 pointers For each pointer i , he points to a position P [ i ] ∈ [ 1 , n ] n=7 3 6 5 2 1 3 4 1 Carole has to guess any duplicate pointer Constraints: O ( log n ) bits O ( n ) queries cannot move items 24 / 99 Irene Finocchi Algorithms for data streams

  56. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Repeated scans n=7 3 6 5 2 1 3 4 1 1 Trivial solution for each i, count how many j are such that P[j]=i O ( log n ) bits, but O ( n 2 ) queries 25 / 99 Irene Finocchi Algorithms for data streams

  57. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Repeated scans n=7 3 6 5 2 1 3 4 1 1 Trivial solution for each i, count how many j are such that P[j]=i O ( log n ) bits, but O ( n 2 ) queries 2 Better solution if # of items below n / 2 > # of items above n / 2 then search for duplicates < n / 2 else search for duplicates ≥ n / 2 O ( log n ) bits and passes, O ( n log n ) queries 3 With O ( log n ) bits, Ω( log n / log log n ) passes are needed 25 / 99 Irene Finocchi Algorithms for data streams

  58. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Random access helps Chase pointers, starting from position n + 1 n=7 Problem equivalent to finding a 3 6 5 2 1 3 4 1 loop in a linked list Can be solved in O ( n ) time with just 2 pointers! r1 r2 26 / 99 Irene Finocchi Algorithms for data streams

  59. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Random access helps Chase pointers, starting from position n + 1 n=7 Problem equivalent to finding a 3 6 5 2 1 3 4 1 loop in a linked list Can be solved in O ( n ) time with just 2 pointers! r1 r2 27 / 99 Irene Finocchi Algorithms for data streams

  60. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Random access helps Chase pointers, starting from position n + 1 n=7 Problem equivalent to finding a 3 6 5 2 1 3 4 1 loop in a linked list Can be solved in O ( n ) time with just 2 pointers! r1 r2 28 / 99 Irene Finocchi Algorithms for data streams

  61. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Random access helps Chase pointers, starting from position n + 1 n=7 Problem equivalent to finding a 3 6 5 2 1 3 4 1 loop in a linked list Can be solved in O ( n ) time with just 2 pointers! r1 r2 29 / 99 Irene Finocchi Algorithms for data streams

  62. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Random access helps Chase pointers, starting from position n + 1 n=7 Problem equivalent to finding a 3 6 5 2 1 3 4 1 loop in a linked list Can be solved in O ( n ) time with just 2 pointers! r1 r2 30 / 99 Irene Finocchi Algorithms for data streams

  63. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Random access helps Chase pointers, starting from position n + 1 n=7 Problem equivalent to finding a 3 6 5 2 1 3 4 1 loop in a linked list Can be solved in O ( n ) time with just 2 pointers! a=9 b=3 c=3 r1 r2 31 / 99 Irene Finocchi Algorithms for data streams

  64. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Random access helps Chase pointers, starting from position n + 1 n=7 Problem equivalent to finding a 3 6 5 2 1 3 4 1 loop in a linked list Can be solved in O ( n ) time with just 2 pointers! a=9 b=3 c=3 r1 r2 � a + b = t a + k ( b + c ) + b = 2 t t and k known 31 / 99 Irene Finocchi Algorithms for data streams

  65. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Random access helps Chase pointers, starting from position n + 1 n=7 Problem equivalent to finding a 3 6 5 2 1 3 4 1 loop in a linked list Can be solved in O ( n ) time with just 2 pointers! a=9 b=3 c=3 r1 r2 � a + b = t � a + b = t ⇒ a + k ( b + c ) + b = 2 t b + c = t / k t and k known 31 / 99 Irene Finocchi Algorithms for data streams

  66. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Random access helps Chase pointers, starting from position n + 1 n=7 Problem equivalent to finding a 3 6 5 2 1 3 4 1 loop in a linked list Can be solved in O ( n ) time with just 2 pointers! a=9 b=3 c=3 r1 r2 � a + b = t � a + b = t ⇒ a = c + k − 1 ⇒ t a + k ( b + c ) + b = 2 t b + c = t / k k t and k known 31 / 99 Irene Finocchi Algorithms for data streams

  67. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Random access helps Chase pointers, starting from position n + 1 n=7 Problem equivalent to finding a 3 6 5 2 1 3 4 1 loop in a linked list Can be solved in O ( n ) time with just 2 pointers! a=9 b=3 c=3 r1 r2 t(k-1)/k=6 � a + b = t � a + b = t ⇒ a = c + k − 1 ⇒ t a + k ( b + c ) + b = 2 t b + c = t / k k t and k known 32 / 99 Irene Finocchi Algorithms for data streams

  68. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Lesson 3 Tokens come as a stream: no random access Sometimes impossible to achieve the same bounds as in the RAM model 33 / 99 Irene Finocchi Algorithms for data streams

  69. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Missing number Thanks! Fishing Pointer & chaser Recap Recap on lessons Typically impossible to solve problems precisely and deterministically in small (sublinear) space Randomize and approximate! Sequential data access makes things harder 34 / 99 Irene Finocchi Algorithms for data streams

  70. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Reservoir sampling Thanks! Heavy hitters Sampling Working with less 35 / 99 Irene Finocchi Algorithms for data streams

  71. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Reservoir sampling Thanks! Heavy hitters Why sampling? Basic problem: sample s items uniformly from a stream Answer queries (e.g., compute fish species rarity) on the sample Utility depends on the problem: in some cases, sampling-based approaches not effective unless taking large (almost linear) samples 36 / 99 Irene Finocchi Algorithms for data streams

  72. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Reservoir sampling Thanks! Heavy hitters Why sampling? Basic problem: sample s items uniformly from a stream Answer queries (e.g., compute fish species rarity) on the sample Utility depends on the problem: in some cases, sampling-based approaches not effective unless taking large (almost linear) samples How can we sample uniformly if we don’t know in advance how long is the stream? When do we sample a stream token? 36 / 99 Irene Finocchi Algorithms for data streams

  73. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Reservoir sampling Thanks! Heavy hitters Reservoir sampling 1 Add to S the first s stream items 2 Upon seeing x i at time, sample x i with probability s / i 3 If x i added to S , evict a random item from S (other than x i ) Sample is uniform At any time t and for each i ≤ t , it holds: Pr { x i ∈ t S } = s t 37 / 99 Irene Finocchi Algorithms for data streams

  74. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Reservoir sampling Thanks! Heavy hitters Reservoir sampling 1 Add to S the first s stream items 2 Upon seeing x i at time, sample x i with probability s / i 3 If x i added to S , evict a random item from S (other than x i ) Sample is uniform At any time t and for each i ≤ t , it holds: Pr { x i ∈ t S } = s t Warmup analysis: s = 1 Pr { x i ∈ t S } = = Pr { x i sampled at time i } × Pr { x i survives up to time t } = i + 2 × ... × t − 2 t − 1 × t − 1 = 1 i + 1 × i + 1 i = 1 i × t t 37 / 99 Irene Finocchi Algorithms for data streams

  75. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Reservoir sampling Thanks! Heavy hitters Arbitrary sample size s : analysis Sample is uniform: Pr { x i ∈ t S } = s t 38 / 99 Irene Finocchi Algorithms for data streams

  76. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Reservoir sampling Thanks! Heavy hitters Arbitrary sample size s : analysis Sample is uniform: Pr { x i ∈ t S } = s t By induction on t (base step: t ≤ s ) 38 / 99 Irene Finocchi Algorithms for data streams

  77. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Reservoir sampling Thanks! Heavy hitters Arbitrary sample size s : analysis Sample is uniform: Pr { x i ∈ t S } = s t By induction on t (base step: t ≤ s ) How does S change at time t when x t arrives? 38 / 99 Irene Finocchi Algorithms for data streams

  78. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Reservoir sampling Thanks! Heavy hitters Arbitrary sample size s : analysis Sample is uniform: Pr { x i ∈ t S } = s t By induction on t (base step: t ≤ s ) How does S change at time t when x t arrives? 1 Pr { x t added to S } = s t 38 / 99 Irene Finocchi Algorithms for data streams

  79. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Reservoir sampling Thanks! Heavy hitters Arbitrary sample size s : analysis Sample is uniform: Pr { x i ∈ t S } = s t By induction on t (base step: t ≤ s ) How does S change at time t when x t arrives? 1 Pr { x t added to S } = s t 2 Inductive hypothesis: Pr { x i ∈ t − 1 S } = s t − 1 38 / 99 Irene Finocchi Algorithms for data streams

  80. Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Reservoir sampling Thanks! Heavy hitters Arbitrary sample size s : analysis Sample is uniform: Pr { x i ∈ t S } = s t By induction on t (base step: t ≤ s ) How does S change at time t when x t arrives? 1 Pr { x t added to S } = s t 2 Inductive hypothesis: Pr { x i ∈ t − 1 S } = s t − 1 Pr { x i ∈ t S | x t added to S } = Pr { x i ∈ t − 1 S and not evicted } = 3 � � s 1 − 1 = t − 1 s 38 / 99 Irene Finocchi Algorithms for data streams

Recommend


More recommend