Consistent subset sampling Konstantin Kutzkov and Rasmus Pagh Work - PowerPoint PPT Presentation

Consistent subset sampling Konstantin Kutzkov and Rasmus Pagh Work supported by: 1

Agenda • Consistent sampling • Applications of consistent sampling of subsets • New sampling algorithm 2

Sampling g ) n m i a e t r ( s n i o l t o a l n t e m a d u n F e a t m t i e s t o . , g e . s , m h i t o r g a l e h t o r s m e i t c t i n s t d i f o e r b m u n s . e t s o f y i t a r i l m s i d a r c a c J 3

Consistent sampling • Setting : Data stream of items x 1 ,x 2 ,… ∈ U. • Problem : Create substream with only items in a random sample U 2 ⊆ U. • Idea : Use a hash function h to   determine whether to sample: • Include x i in sample iff h(x 2 )=0. • U 2 = h -1 (0). 4

Consistent sampling • Setting : Data stream of items x 1 ,x 2 ,… ∈ U. • Problem : Create substream with only items in a random sample U 2 ⊆ U.   y : r t p e o r P • Idea : Use a hash function h to   s i o n t i n c u f s h a h I f t , e n determine whether to sample: d n p e e d i n 2 - i s d e p l m a s o n t i a c r f • Include x i in sample iff h(x 2 )=0. d e a t r n t c e n o c • U 2 = h -1 (0). o n t i t a c p e x e d u n o a r 4

Consistent subset sampling • Setting : Data stream of sets S 1 ,S 2 ,… ⊆ U. • Problem : For each set S i , generate every K ⊆ S i with |K|=k and h(K)=0. • Hash function h should be 2-independent with Pr[h(S)=0] = 1/r. • Goal : Time and space efficiency. 5

Consistent subset sampling • Setting : Data stream of sets S 1 ,S 2 ,… ⊆ U. • Problem : For each set S i , generate every K ⊆ S i with |K|=k and h(K)=0. 4 = k k : a l n t I • Hash function h should be 2-independent with Pr[h(S)=0] = 1/r. • Goal : Time and space efficiency. 5

      Motivating example • Set of shopping baskets (sets of products).   • Generate sample of 4-sets of products bought together to, for example, estimate: • The number of frequent 4-sets. • Similarity in buying habits of two customers. 6

      Motivating example • Set of shopping baskets (sets of products).   e t m s e t t i e n u q r e f f o s e r e t m a a r p e n t u e - n fi t o e s U o t e r d o r n ) i h w t o g r P - F , o r i r i p A s ( m h i t o r g a l g i n n m i f o e r b m n u g e u h a o f o n t i r a e e n g e h d t i v o a . n t u e q e f r d r e e i d n s o c e a r t h a t t s e m s e i t • Generate sample of 4-sets of products bought together to, for example, estimate: • The number of frequent 4-sets. • Similarity in buying habits of two customers. 6

Naïve solutions 1.Generate explicitly all 4-subsets and apply 2- independent hash function to each of them. • Slow: Time O(b 4 ) for set of size b. 2. Sample items S ⊆ S i independently w.p. r -1/4 . • Not 2-independent : Positive correlation among overlapping 4-subsets. 7

  Decomposable hashing We choose h to be “decomposable”:   � h({i 1, i 2 , i 3 , i 4 } ) = (g( i 1 ) + g( i 2 ) + g( i 3 ) + g( i 4 )) mod r � where g: U → [r] is an 8-independent hash function. � � 8

  Decomposable hashing We choose h to be “decomposable”:   n t e d e n p e n d - i 2 � h({i 1, i 2 , i 3 , i 4 } ) = (g( i 1 ) + g( i 2 ) + g( i 3 ) + g( i 4 )) mod r � where g: U → [r] is an 8-independent hash function. � � 8

  Decomposable hashing We choose h to be “decomposable”:   n t e d e n p e n d - i 2 � h({i 1, i 2 , i 3 , i 4 } ) = (g( i 1 ) + g( i 2 ) + g( i 3 ) + g( i 4 )) mod r � where g: U → [r] is an 8-independent hash function. � �   n : o c t i u d r e n a i m r O u h s h a t h w i s e t s u b s 4 - g n d i n F i o t t e n a l v u i q e i s o e r z e u a l v } S x | x ) g ( ∈ { n o M U S 4 g i i n l v s o 8

Schroeppel/Shamir technique sorted values sorted values sorted values sorted values 9

Schroeppel/Shamir technique s m u s r i a p maxheap sorted values sorted values sorted values sorted values 9

Schroeppel/Shamir technique s p m a i u r s s u r i m a p s minheap maxheap sorted values sorted values sorted values sorted values 9

Schroeppel/Shamir technique   r : p e a p n I e d l n a h t o w o H s p m a i u r c e t i m h i t a r r s l a d u s o m u r i m a p s minheap maxheap sorted values sorted values sorted values sorted values 9

Main result • Theorem (vanilla version)   We can compute a consistent 2-independent sample of 4-subsets of a set of size b in expected time O(b 2 log log b + pb 4 ) and space O(b) for sampling probability p. 10

Main result • Theorem (vanilla version)   We can compute a consistent 2-independent sample of 4-subsets of a set of size b in expected time O(b 2 log log b + pb 4 ) and space O(b) for sampling probability p.   : e r p p a I n   s e t s u b s k - o t d z e l i r a e e n G - f o f e - d a t r e a c p s e / m T i - 10

Thank you! 11

Consistent subset sampling Konstantin Kutzkov and Rasmus Pagh Work - PowerPoint PPT Presentation

Consistent subset sampling Konstantin Kutzkov and Rasmus Pagh Work supported by: 1 Agenda Consistent sampling Applications of consistent sampling of subsets New sampling algorithm 2 Sampling g ) n m i a e t r ( s n i

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Theorem 7.56 SUBSET-SUM is NP Complete ANSHUMAN MOHANTY SUBSET-SUM Problem Consider a set of

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling

Feasibility of Consistent, Feasibility of Consistent, Feasibility of Consistent, Feasibility of

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Part I bers, t - target number Question: Is there a subset of X such the sum of its elements is t ?

W4231: Analysis of Algorithms Subset Sum The Subset Sum problem is defined as follows: 11/30/99

More Recursion Summary Topics: more recursion Subset sum: finding if a subset of an

CSS Modules with BEM Consistent Design Consistent Design Different Module Versions Consistent

General Structure of a PW code Self-Consistent KS eqs. or Global Minimization approach

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate @

Gigascope: A Stream Database for Network Applications Authors: Cranor, Johnson, Spataschek

over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura Iowa

Models and Issues in Data Stream Systems Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev

Toward GPU Accelerated Data Stream Processing Marcus Pinnecke, David Broneske and Gunter Saake

Large-Scale Data Engineering Data streams and low latency processing event.cwi.nl/lsde2015 DATA

Finding Recent Frequent Itemsets Adaptively over Online Data Stream Yueting Chen Outline

Protodune Cosmic Ray tagger (CRT) Camillo Mariani ProtoDUNE DAQ Review November 3 rd and 4 th

Sambuz

Useful Links

Newsletter

Mail Us

Consistent subset sampling Konstantin Kutzkov and Rasmus Pagh Work - PowerPoint PPT Presentation

Consistent subset sampling Konstantin Kutzkov and Rasmus Pagh Work supported by: 1 Agenda Consistent sampling Applications of consistent sampling of subsets New sampling algorithm 2 Sampling g ) n m i a e t r ( s n i

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Theorem 7.56 SUBSET-SUM is NP Complete ANSHUMAN MOHANTY SUBSET-SUM Problem Consider a set of

Sampling Distributions Sampling Distribution of the Mean &amp; Hypothesis Testing Sampling

Feasibility of Consistent, Feasibility of Consistent, Feasibility of Consistent, Feasibility of

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Part I bers, t - target number Question: Is there a subset of X such the sum of its elements is t ?

W4231: Analysis of Algorithms Subset Sum The Subset Sum problem is defined as follows: 11/30/99

More Recursion Summary Topics: more recursion Subset sum: finding if a subset of an

CSS Modules with BEM Consistent Design Consistent Design Different Module Versions Consistent

General Structure of a PW code Self-Consistent KS eqs. or Global Minimization approach

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate @

Gigascope: A Stream Database for Network Applications Authors: Cranor, Johnson, Spataschek

over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura Iowa

Models and Issues in Data Stream Systems Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev

Toward GPU Accelerated Data Stream Processing Marcus Pinnecke, David Broneske and Gunter Saake

Large-Scale Data Engineering Data streams and low latency processing event.cwi.nl/lsde2015 DATA

Finding Recent Frequent Itemsets Adaptively over Online Data Stream Yueting Chen Outline

Protodune Cosmic Ray tagger (CRT) Camillo Mariani ProtoDUNE DAQ Review November 3 rd and 4 th

Sambuz

Useful Links

Newsletter

Mail Us

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling