Sampling from Databases CompSci 590.02 Instructor: - PowerPoint PPT Presentation

Sampling from Databases CompSci 590.02 Instructor: AshwinMachanavajjhala Lecture 2 : 590.02 Spring 13 1

Recap • Given a set of elements, random sampling when number of elements N is known is easy if you have random access to any arbitrary element – Pick n indexes at random from 1 … N – Read the corresponding n elements • Reservoir Sampling: If N is unknown, or if you are only allowed sequential access to the data – Read elements one at a time. Include t th element into a reservoir of size n with probability n/t. – Need to access at most n(1+ln(N/n)) elements to get a sample of size n – Optimal for any reservoir based algorithm Lecture 2 : 590.02 Spring 13 2

Today’s Class • In general, sampling from a database where elements are only accessed using indexes. – B + -Trees – Nearest neighbor indexes • Estimating the number of restaurants in Google Places. Lecture 2 : 590.02 Spring 13 3

B+ Tree • Data values only appear in the leaves • Internal nodes only contain keys • Each node has between f max /2 and f max children – f max = maximum fan-out of the tree • Root has 2 or more children Lecture 2 : 590.02 Spring 13 4

Problem • How to pick an element uniformly at random from the B + Tree? Lecture 2 : 590.02 Spring 13 5

Attempt 1: Random Path Choose a random path • Start from the root • Choose a child uniformly at random • Uniformly sample from the resulting leaf node • Will this result in a random sample? Lecture 2 : 590.02 Spring 13 6

Attempt 1: Random Path Choose a random path • Start from the root • Choose a child uniformly at random • Uniformly sample from the resulting leaf node • Will this result in a random sample? NO. Elements reachable from internal nodes with low fanout are more likely. Lecture 2 : 590.02 Spring 13 7

Attempt 2 : Random Path with Rejection • Attempt 1 will work if all internal nodes have the same fan-out • Choose a random path – Start from the root – Choose a child uniformly at random – Uniformly sample from the resulting leaf node • Accept the sample with probability Lecture 2 : 590.02 Spring 13 8

Attempt 2 : Correctness • Any root to leaf path is picked with probability: • The probability of including a record given the path: Lecture 2 : 590.02 Spring 13 9

Attempt 2 : Correctness • Any root to leaf path is picked with probability: • The probability of including a record given the path: • The probability of including a record: Lecture 2 : 590.02 Spring 13 10

Attempt 3 : Early Abort Idea: Perform acceptance/rejection test at each node. • Start from the root • Choose a child uniformly at random • Continue the traversal with probability: • At the leaf, pick an element uniformly at random, and accept it with probability : Proof of correctness: same as previous algorithm Lecture 2 : 590.02 Spring 13 11

Attempt 4: Batch Sampling • Repeatedly sampling n elements will require accessing the internal nodes many times. Lecture 2 : 590.02 Spring 13 12

Attempt 4: Batch Sampling • Repeatedly sampling n elements will require accessing the internal nodes many times. Perform random walks simultaneously: • At the root node, assign each of the n samples to one of its children uniformly at random – n  (n 1 , n 2 , …, n k ) • At each internal node, – Divide incoming samples uniformly across children. • Each leaf node receives s samples. Include each sample with acceptance probability Lecture 2 : 590.02 Spring 13 13

Attempt 4 : Batch Sampling • Problem: If we start the algorithm with n, we might end up with fewer than n samples (due to rejection) Lecture 2 : 590.02 Spring 13 14

Attempt 4 : Batch Sampling • Problem: If we start the algorithm with n, we might end up with fewer than n samples (due to rejection) • Solution: Start with a larger set • n’ = n/β h-1 , where β is the ratio of average fanout and f max Lecture 2 : 590.02 Spring 13 15

Summary of B + tree sampling • Randomly choosing a path weights elements differently – Elements in the subtree rooted at nodes with lower fan-out are more likely to be picked than those under higher fan-out internal nodes • Accept/Reject sampling helps remove this bias. Lecture 2 : 590.02 Spring 13 16

Nearest Neighbor indexes Lecture 2 : 590.02 Spring 13 17

Problem Statement Input: • A database D that can’t be accessed directly, and where each element is associated with a geo location. • A nearest neighbor index (elements in D near <x, y>) – Assumption: index returns k elements closest to the point <x,y> Output • Estimate Lecture 2 : 590.02 Spring 13 18

Problem Statement Input: • A database D that can’t be accessed directly, and where each element is associated with a geo location. • A nearest neighbor index (elements in D near <x, y>) – Assumption: index returns k elements closest to the point <x,y> Output • Estimate Applications • Estimate the size of a population in a region • Estimate the size of a competing business’ database • Estimate the prevalence of a disease in a region Lecture 2 : 590.02 Spring 13 19

Attempt 1: Naïve geo sampling For i = 1 to N • Pick a random point p i = <x,y> • Find element d i in D that is closes to p i • Return Lecture 2 : 590.02 Spring 13 20

Problem? Voronoi Cell: Points for which d 4 is the closest element Elements d 7 and d 8 are much more likely to be picked than d 1 Lecture 2 : 590.02 Spring 13 21

Voronoi Decomposition Perpendicular bisector of d 4 , d 3 Lecture 2 : 590.02 Spring 13 22

Voronoi Decomposition Lecture 2 : 590.02 Spring 13 23

Voronoi decomposition of Restaurants in US Lecture 2 : 590.02 Spring 13 24

Attempt 2: Weighted sampling For i = 1 to N • Pick a random point p i = <x,y> • Find element d i in D that is closes to p i • Return Lecture 2 : 590.02 Spring 13 25

Attempt 2: Weighted sampling For i = 1 to N • Pick a random point p i = <x,y> • Find element d i in D that is closes to p i • Return Problem: We need to compute the area of the Voronoi cell. We do not have access to other elements in the database. Lecture 2 : 590.02 Spring 13 26

Using index to estimate Voronoi cell • Find nearest point • Compute perpendicular bisector e 0 a 0 • a0 is a point on the Voronoi cell. d Lecture 2 : 590.02 Spring 13 27

Using index to estimate Voronoi cell • Find a point on (a 0 , b 0 ) which is just inside the Voronoi cell. e 0 – Use binary search a 0 – Recursively check d whether mid point is in a 1 the Voronoi cell b 0 Lecture 2 : 590.02 Spring 13 28

Using index to estimate Voronoi cell • Find nearest points to a 1 – a 1 has to be equidistant e 0 to one point other than a 0 e 1 e 0 and d d • Next direction is a 1 perpendicular to (e 1 ,d) b 0 b 1 Lecture 2 : 590.02 Spring 13 29

Using index to estimate Voronoi cell • Find nearest points to a 1 – a 1 has to be equidistant e 0 to one point other than a 0 e 1 e 0 and d d • Next direction is a 1 perpendicular to (e 1 ,d) a 2 b 0 • Find next point … b 2 e 2 • … and so on … b 1 Lecture 2 : 590.02 Spring 13 30

Using index to estimate Voronoi cell • Find nearest points to a 1 e 4 – a 1 has to be equidistant e 0 to one point other than a 0 e 1 e 0 and d d a 4 • Next direction is a 1 perpendicular to (e 1 ,d) a 2 b 0 e 3 a 3 • Find next point … b 2 e 2 • … and so on … b 1 Lecture 2 : 590.02 Spring 13 31

Number of samples • Identifying each a i requires a binary search – If L is the max length of (ai, bi), then a i+1 can be computed with ε error in O(log (L/ ε )) calls to the index • Identifying the next direction requires another call to the index • If number of edges of Voronoi cell = k, total number of calls to the index = O(K log(L/ ε )) • Average number of edges of a Voronoi cell < 6 – Assuming general position … Lecture 2 : 590.02 Spring 13 32

Summary • Many web services allow access to databases using nearest neighbor indexes. • Showed a method to sample uniformly from such databases. • Next class: Monte Carlo Estimation for #P-hard problems. Lecture 2 : 590.02 Spring 13 33

References • F. Olken , “Random Sampling from Databases” , PhD Thesis, U C Berkeley, 1993 • N. Dalvi, R. Kumar, A. Machanavajjhala, V. Rastogi , “Sampling Hidden Objects using Nearest Neighbor Oracles”, KDD 2011 Lecture 2 : 590.02 Spring 13 34

Sampling from Databases CompSci 590.02 Instructor: - PowerPoint PPT Presentation

Sampling from Databases CompSci 590.02 Instructor: AshwinMachanavajjhala Lecture 2 : 590.02 Spring 13 1 Recap Given a set of elements, random sampling when number of elements N is known is easy if you have random access to any arbitrary

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Module 3: Creating and Managing Databases Overview Creating Databases Creating

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling

Overview of Sampling Topics (Shannon) sampling theorem Impulse-train sampling

GEMS/Food Databases and GEMS/Food Databases and GEMS/Food Databases and in the Food Supply

Image Databases Image Databases Image Databases Prof. Paolo Ciaccia Prof. Paolo Ciaccia

Mat 2170 Lab 14 Week 14 ArrayList Class Generic Types Wrapper ArrayList Class Classes

CS 225 Data Structures Oc October 28 28 Ha Hashing Analysis G G Carl Evans Running g

Elements of Floating-point Arithmetic Sanzheng Qiao Department of Computing and Software

Defining Formal Elements CS 2501 Computer Game Design CS 2501 Ludic

Field-Sensitive Unreachability and Non-Cyclicity Analysis Enrico Scapin and Fausto Spoto

CPSC 121: Models of Computation Unit 12 Sets and Functions Based on slides by Patrice Belleville

LIEF: Library to Instrument Executable Formats Table of Contents Introduction Architecture Demo

Approximate Query Service on Autonomous IoT Cameras Mengwei Xu 1 , Xiwen Zhang 2 , Yunxin Liu 3

Sampling from Databases CompSci 590.02 Instructor: - PowerPoint PPT Presentation

Sampling from Databases CompSci 590.02 Instructor: AshwinMachanavajjhala Lecture 2 : 590.02 Spring 13 1 Recap Given a set of elements, random sampling when number of elements N is known is easy if you have random access to any arbitrary

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Module 3: Creating and Managing Databases Overview Creating Databases Creating

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Sampling Distributions Sampling Distribution of the Mean &amp; Hypothesis Testing Sampling

Overview of Sampling Topics (Shannon) sampling theorem Impulse-train sampling

GEMS/Food Databases and GEMS/Food Databases and GEMS/Food Databases and in the Food Supply

Image Databases Image Databases Image Databases Prof. Paolo Ciaccia Prof. Paolo Ciaccia

Mat 2170 Lab 14 Week 14 ArrayList Class Generic Types Wrapper ArrayList Class Classes

CS 225 Data Structures Oc October 28 28 Ha Hashing Analysis G G Carl Evans Running g

Elements of Floating-point Arithmetic Sanzheng Qiao Department of Computing and Software

Defining Formal Elements CS 2501 Computer Game Design CS 2501 Ludic

Field-Sensitive Unreachability and Non-Cyclicity Analysis Enrico Scapin and Fausto Spoto

CPSC 121: Models of Computation Unit 12 Sets and Functions Based on slides by Patrice Belleville

LIEF: Library to Instrument Executable Formats Table of Contents Introduction Architecture Demo

Approximate Query Service on Autonomous IoT Cameras Mengwei Xu 1 , Xiwen Zhang 2 , Yunxin Liu 3

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling