Distinct Value Estimators For Zipfian Distributions Sergei - PowerPoint PPT Presentation

Distinct Value Estimators For Zipfian Distributions Sergei Vassilvitskii Rajeev Motwani Stanford University

Problem Statement Given a large multiset with elements, count X n the number of distinct elements in . X X = { a, b, c, a, a, c, b, a } = ⇒ Distinct ( X ) = 3 Alternatively, given samples from a distribution , P estimate the 0-th frequency moment. 2

Why Do We Care? Good Planning for SQL Queries. Consider: select * from R, S where R.A = S.B and f ( S.C ) > k where is expensive to compute. f If has few distinct elements, compute first, S.C f cache results, then join. If has many elements, compute the join first, S.C then check the condition. f Orders of Magnitude Improvements 3

Classical problem Different approaches: Streaming Input - Minimize space used. Sample from Input - Guarantee on approximations? ˆ Given a sample of size from , find an X D r approximation to . Distinct ( X ) 4

Previous Work Good-Turing Estimator: “The Population Frequencies of Species, and the estimation of Population Parameters,” 1953. Other Heuristic Estimators: Smoothed Jackknife Estimator (Haas et. al) Adaptive Estimator (Charikar et. al) Many Others 5

Previous Work - Theory Given samples from a set of size r n Guaranteed Error Estimator(GEE) [CCMN] � Approximation Ratio: O ( n/r ) Lower Bound: There exist inputs such that with constant probability any estimator will have approximation ratio at least: � n − r 2 r 6

Lower Bound Detail Scenario 1: S = { x, x, . . . , x } , | S | = n Scenario 2: S � = { x, x, . . . , x, y 1 , . . . , y k } , | S � | = n k = n − r ln 1 With , after samples cannot r 2 r δ distinguish between the two scenarios with probability at least . δ 7

So Why Are We Here? Many large datasets are not worst-case. In fact, many follow Zipfian Distributions. Zipf θ ( i ) ∝ 1 i θ Examples: - In/Out-Degrees of the Web Graph - Word frequencies in many languages - many, many more. 8

Problem Definition Suppose on elements. X ∼ Zipf θ D is known, is unknown D θ Estimate by sampling from . D X Two Kinds of Results: - Adaptive Sampling: Will sample from until X a stopping condition is met. - Best-you-can Estimation: Given a sample from , return best estimate of . X D 9

Results p ∗ Let be the probability of the least likely element. Adaptive sampling will return after at most D O (log D samples with constant probability. ) p ∗ r = (1 + 2 � ) 1+ θ Given samples, can return an 1 + � p ∗ estimate to with probability at least D 1 − exp( − Ω( D� 2 )) 10

Outline Introduction Techniques Experimental Results Conclusion 11

Approximation Techniques For a sample of size let be the number of f r r distinct values in the sample. Suppose and are known, then we can compute D θ the expected number of distinct values in E D,θ [ f r ] the sample. If is the number of distinct values observed, the f ∗ r ˆ estimator returns such that . D, θ [ f r ] = f ∗ D E ˆ r 12

Analysis Lemma : Tight Distribution of . f r For large enough , r � � ≤ exp( − � 2 Ω( D )) | E [ f r ] − f r | ≥ �E [ f r ] Pr Proof: Parallels the sharp threshold coupon collector arguments for uniform distributions. 13

Analysis (2) Lemma : MLE preserves approximation Given: , observed elements f r ≤ (1 + � ) E D,θ [ f r ] f ∗ r ˆ f ∗ r = E ˆ D, θ [ f r ] Let such that , and . D r ≥ 1 /p ∗ (1 − 2 � ) ˆ D ≤ D ≤ (1 + 2 � ) ˆ Then: D 14

The Competition Zipfian Estimator (ZE): Performance guarantees only for Zipfian Distributions. � Guaranteed Error Estimator (GEE): O ( n/r ) error guarantee. (Works for all distributions) Analytic Estimator (AE): Best performing heuristic - no theoretical guarantees. 16

Datasets Synthetic Data: - Vary number of distinct elements D ∈ { 10 k, 50 k, 100 k } - Vary the Database size n ∈ { 100 k, 500 k, 1000 k } - Vary the skew of the distribution θ ∈ { 0 , 0 . 5 , 1 } Real Datasets - “Router” dataset - Packet trace from the Internet Traffic Archive. , , θ ≈ 1 . 6 n ≈ 4 M D ≈ 250 k 17

Estimating θ Zipf θ ( i ) ∝ 1 Recall: i θ Let be the frequency of the i-th element. f i E [ f i ] = cri − θ = ⇒ log E [ f i ] = log cr − θ log i Estimate by doing linear regression on θ plot. log f i vs log i 18

Experimental Results , n = 1M Theta = 0.5, D = 50000 10 ZE AE GEE 8 6 Ratio Error 4 2 0 0 20 40 60 80 100 Number of Samples x 1000 19

Experimental Results (2) Router Dataset 10 ZE AE GEE 8 6 Ratio Error 4 2 0 0 2 4 6 8 10 % DB Sampled 20

Conclusion Can have error guarantees if the family of distributions is known ahead of time. How does the approximation of affect error θ guarantees? Subtle problem: disk reads occur in blocks. Time to sample 10% is equivalent to reading the whole DB. 22

Thank You

Distinct Value Estimators For Zipfian Distributions Sergei - PowerPoint PPT Presentation

Distinct Value Estimators For Zipfian Distributions Sergei Vassilvitskii Rajeev Motwani Stanford University Problem Statement Given a large multiset with elements, count X n the number of distinct elements in . X X = { a, b,

L-estimators, R-estimators, Redescending M gr. Jakub Petr asek Estimators Revision Seminar

Implementation of Zipfian Sumita Barahmand and Shahram Ghandeharizadeh Database Lab, University

Policy Exploration for JITDs (Java) Team Datum Testing Current Implementation (On Zipfian Read

Formal Modeling in Cognitive Science 1 Distributions Lecture 20: Joint, Marginal, and Conditional

? ? ? ? Basic Charts Outline - Distributions & Histograms - Mean, Mode, Average - Chart

Dynamic Panel Data estimators Christopher F Baum EC 823: Applied Econometrics Boston College,

Small Sample Performance of Instrumental Variables Probit Estimators: A Monte Carlo Investigation

Review - Mathematical Statistics Estimators and Estimates Unbiased estimators Efficiency

Review - Mathematical Statistics Estimators and Estimates Unbiased estimators Efficiency

Dynamic Panel Data estimators Christopher F Baum ECON 8823: Applied Econometrics Boston College,

From Importance Sampling to Doubly Robust Policy Gradient Jiawei Huang (UIUC) Nan Jiang (UIUC)

Regression Discontinuity Estimators and LATE James Heckman University of Chicago Econ 312 May

Data from our man Zipf Zipf in brief Principles of Complex Systems Zipfian empirics Course 300,

Policy Exploration for JITDs (Java) Team Datum Recap Experimentation on current policies

Stat 5102 Lecture Slides: Deck 1 Empirical Distributions, Exact Sampling Distributions,

Create Distributions Empirically using Excel V0E 10/11/2014 0E 2014 Schield Creating

Random-Variate Generation Banks, Carson, Nelson & Nicol Discrete-Event System Simulation

Descriptive Statistics Observed data are at the heart of every application of statistics. We need

Variation Among Processors Under Turbo-Boost Bilge Acun, Ph.D.

Lecture 1: Review and Exploratory Data Analysis (EDA) Ani Manichaikul amanicha@jhsph.edu 16

Simulation Discrete-Event System Simulation Dr. Mesut Gne Computer Science, Informatik

Workshop 7.2b: Introduction to Bayesian models Murray Logan February 7, 2017 Table of

1 Sampling and Aliasing Image Processing pipeline Artifacts due to undersampling or poor

Yasser F. O. Mohammad REMINDER 1: Common Impulse Responses

Distinct Value Estimators For Zipfian Distributions Sergei - PowerPoint PPT Presentation

Distinct Value Estimators For Zipfian Distributions Sergei Vassilvitskii Rajeev Motwani Stanford University Problem Statement Given a large multiset with elements, count X n the number of distinct elements in . X X = { a, b,

L-estimators, R-estimators, Redescending M gr. Jakub Petr asek Estimators Revision Seminar

Implementation of Zipfian Sumita Barahmand and Shahram Ghandeharizadeh Database Lab, University

Policy Exploration for JITDs (Java) Team Datum Testing Current Implementation (On Zipfian Read

Formal Modeling in Cognitive Science 1 Distributions Lecture 20: Joint, Marginal, and Conditional

? ? ? ? Basic Charts Outline - Distributions &amp; Histograms - Mean, Mode, Average - Chart

Dynamic Panel Data estimators Christopher F Baum EC 823: Applied Econometrics Boston College,

Small Sample Performance of Instrumental Variables Probit Estimators: A Monte Carlo Investigation

Review - Mathematical Statistics Estimators and Estimates Unbiased estimators Efficiency

Review - Mathematical Statistics Estimators and Estimates Unbiased estimators Efficiency

Dynamic Panel Data estimators Christopher F Baum ECON 8823: Applied Econometrics Boston College,

From Importance Sampling to Doubly Robust Policy Gradient Jiawei Huang (UIUC) Nan Jiang (UIUC)

Regression Discontinuity Estimators and LATE James Heckman University of Chicago Econ 312 May

Data from our man Zipf Zipf in brief Principles of Complex Systems Zipfian empirics Course 300,

Policy Exploration for JITDs (Java) Team Datum Recap Experimentation on current policies

Stat 5102 Lecture Slides: Deck 1 Empirical Distributions, Exact Sampling Distributions,

Create Distributions Empirically using Excel V0E 10/11/2014 0E 2014 Schield Creating

Random-Variate Generation Banks, Carson, Nelson &amp; Nicol Discrete-Event System Simulation

Descriptive Statistics Observed data are at the heart of every application of statistics. We need

Variation Among Processors Under Turbo-Boost Bilge Acun, Ph.D.

Lecture 1: Review and Exploratory Data Analysis (EDA) Ani Manichaikul amanicha@jhsph.edu 16

Simulation Discrete-Event System Simulation Dr. Mesut Gne Computer Science, Informatik

Workshop 7.2b: Introduction to Bayesian models Murray Logan February 7, 2017 Table of

1 Sampling and Aliasing Image Processing pipeline Artifacts due to undersampling or poor

Yasser F. O. Mohammad REMINDER 1: Common Impulse Responses

? ? ? ? Basic Charts Outline - Distributions & Histograms - Mean, Mode, Average - Chart

Random-Variate Generation Banks, Carson, Nelson & Nicol Discrete-Event System Simulation