Consistent subset sampling Konstantin Kutzkov and Rasmus Pagh Work supported by: 1
Agenda • Consistent sampling • Applications of consistent sampling of subsets • New sampling algorithm 2
Sampling g ) n m i a e t r ( s n i o l t o a l n t e m a d u n F e a t m t i e s t o . , g e . s , m h i t o r g a l e h t o r s m e i t c t i n s t d i f o e r b m u n s . e t s o f y i t a r i l m s i d a r c a c J 3
Sampling g ) n m i a e t r ( s n i o l t o a l n t e m a d u n F e a t m t i e s t o . , g e . s , m h i t o r g a l e h t o r s m e i t c t i n s t d i f o e r b m u n s . e t s o f y i t a r i l m s i d a r c a c J 3
Consistent sampling • Setting : Data stream of items x 1 ,x 2 ,… ∈ U. • Problem : Create substream with only items in a random sample U 2 ⊆ U. • Idea : Use a hash function h to determine whether to sample: • Include x i in sample iff h(x 2 )=0. • U 2 = h -1 (0). 4
Consistent sampling • Setting : Data stream of items x 1 ,x 2 ,… ∈ U. • Problem : Create substream with only items in a random sample U 2 ⊆ U. y : r t p e o r P • Idea : Use a hash function h to s i o n t i n c u f s h a h I f t , e n determine whether to sample: d n p e e d i n 2 - i s d e p l m a s o n t i a c r f • Include x i in sample iff h(x 2 )=0. d e a t r n t c e n o c • U 2 = h -1 (0). o n t i t a c p e x e d u n o a r 4
Consistent subset sampling • Setting : Data stream of sets S 1 ,S 2 ,… ⊆ U. • Problem : For each set S i , generate every K ⊆ S i with |K|=k and h(K)=0. • Hash function h should be 2-independent with Pr[h(S)=0] = 1/r. • Goal : Time and space efficiency. 5
Consistent subset sampling • Setting : Data stream of sets S 1 ,S 2 ,… ⊆ U. • Problem : For each set S i , generate every K ⊆ S i with |K|=k and h(K)=0. 4 = k k : a l n t I • Hash function h should be 2-independent with Pr[h(S)=0] = 1/r. • Goal : Time and space efficiency. 5
Motivating example • Set of shopping baskets (sets of products). • Generate sample of 4-sets of products bought together to, for example, estimate: • The number of frequent 4-sets. • Similarity in buying habits of two customers. 6
Motivating example • Set of shopping baskets (sets of products). e t m s e t t i e n u q r e f f o s e r e t m a a r p e n t u e - n fi t o e s U o t e r d o r n ) i h w t o g r P - F , o r i r i p A s ( m h i t o r g a l g i n n m i f o e r b m n u g e u h a o f o n t i r a e e n g e h d t i v o a . n t u e q e f r d r e e i d n s o c e a r t h a t t s e m s e i t • Generate sample of 4-sets of products bought together to, for example, estimate: • The number of frequent 4-sets. • Similarity in buying habits of two customers. 6
Naïve solutions 1.Generate explicitly all 4-subsets and apply 2- independent hash function to each of them. • Slow: Time O(b 4 ) for set of size b. 2. Sample items S ⊆ S i independently w.p. r -1/4 . • Not 2-independent : Positive correlation among overlapping 4-subsets. 7
Decomposable hashing We choose h to be “decomposable”: � h({i 1, i 2 , i 3 , i 4 } ) = (g( i 1 ) + g( i 2 ) + g( i 3 ) + g( i 4 )) mod r � where g: U → [r] is an 8-independent hash function. � � 8
Decomposable hashing We choose h to be “decomposable”: n t e d e n p e n d - i 2 � h({i 1, i 2 , i 3 , i 4 } ) = (g( i 1 ) + g( i 2 ) + g( i 3 ) + g( i 4 )) mod r � where g: U → [r] is an 8-independent hash function. � � 8
Decomposable hashing We choose h to be “decomposable”: n t e d e n p e n d - i 2 � h({i 1, i 2 , i 3 , i 4 } ) = (g( i 1 ) + g( i 2 ) + g( i 3 ) + g( i 4 )) mod r � where g: U → [r] is an 8-independent hash function. � � n : o c t i u d r e n a i m r O u h s h a t h w i s e t s u b s 4 - g n d i n F i o t t e n a l v u i q e i s o e r z e u a l v } S x | x ) g ( ∈ { n o M U S 4 g i i n l v s o 8
Schroeppel/Shamir technique sorted values sorted values sorted values sorted values 9
Schroeppel/Shamir technique s m u s r i a p maxheap sorted values sorted values sorted values sorted values 9
Schroeppel/Shamir technique s p m a i u r s s u r i m a p s minheap maxheap sorted values sorted values sorted values sorted values 9
Schroeppel/Shamir technique s p m a i u r s s u r i m a p s minheap maxheap sorted values sorted values sorted values sorted values 9
Schroeppel/Shamir technique s p m a i u r s s u r i m a p s minheap maxheap sorted values sorted values sorted values sorted values 9
Schroeppel/Shamir technique s p m a i u r s s u r i m a p s minheap maxheap sorted values sorted values sorted values sorted values 9
Schroeppel/Shamir technique s p m a i u r s s u r i m a p s minheap maxheap sorted values sorted values sorted values sorted values 9
Schroeppel/Shamir technique s p m a i u r s s u r i m a p s minheap maxheap sorted values sorted values sorted values sorted values 9
Schroeppel/Shamir technique s p m a i u r s s u r i m a p s minheap maxheap sorted values sorted values sorted values sorted values 9
Schroeppel/Shamir technique s p m a i u r s s u r i m a p s minheap maxheap sorted values sorted values sorted values sorted values 9
Schroeppel/Shamir technique s p m a i u r s s u r i m a p s minheap maxheap sorted values sorted values sorted values sorted values 9
Schroeppel/Shamir technique s p m a i u r s s u r i m a p s minheap maxheap sorted values sorted values sorted values sorted values 9
Schroeppel/Shamir technique r : p e a p n I e d l n a h t o w o H s p m a i u r c e t i m h i t a r r s l a d u s o m u r i m a p s minheap maxheap sorted values sorted values sorted values sorted values 9
Main result • Theorem (vanilla version) We can compute a consistent 2-independent sample of 4-subsets of a set of size b in expected time O(b 2 log log b + pb 4 ) and space O(b) for sampling probability p. 10
Main result • Theorem (vanilla version) We can compute a consistent 2-independent sample of 4-subsets of a set of size b in expected time O(b 2 log log b + pb 4 ) and space O(b) for sampling probability p. : e r p p a I n s e t s u b s k - o t d z e l i r a e e n G - f o f e - d a t r e a c p s e / m T i - 10
Thank you! 11
Recommend
More recommend