Distinct Value Estimators For Zipfian Distributions Sergei Vassilvitskii Rajeev Motwani Stanford University
Problem Statement Given a large multiset with elements, count X n the number of distinct elements in . X X = { a, b, c, a, a, c, b, a } = ⇒ Distinct ( X ) = 3 Alternatively, given samples from a distribution , P estimate the 0-th frequency moment. 2
Why Do We Care? Good Planning for SQL Queries. Consider: select * from R, S where R.A = S.B and f ( S.C ) > k where is expensive to compute. f If has few distinct elements, compute first, S.C f cache results, then join. If has many elements, compute the join first, S.C then check the condition. f Orders of Magnitude Improvements 3
Classical problem Different approaches: Streaming Input - Minimize space used. Sample from Input - Guarantee on approximations? ˆ Given a sample of size from , find an X D r approximation to . Distinct ( X ) 4
Previous Work Good-Turing Estimator: “The Population Frequencies of Species, and the estimation of Population Parameters,” 1953. Other Heuristic Estimators: Smoothed Jackknife Estimator (Haas et. al) Adaptive Estimator (Charikar et. al) Many Others 5
Previous Work - Theory Given samples from a set of size r n Guaranteed Error Estimator(GEE) [CCMN] � Approximation Ratio: O ( n/r ) Lower Bound: There exist inputs such that with constant probability any estimator will have appro- ximation ratio at least: � n − r 2 r 6
Lower Bound Detail Scenario 1: S = { x, x, . . . , x } , | S | = n Scenario 2: S � = { x, x, . . . , x, y 1 , . . . , y k } , | S � | = n k = n − r ln 1 With , after samples cannot r 2 r δ distinguish between the two scenarios with probability at least . δ 7
So Why Are We Here? Many large datasets are not worst-case. In fact, many follow Zipfian Distributions. Zipf θ ( i ) ∝ 1 i θ Examples: - In/Out-Degrees of the Web Graph - Word frequencies in many languages - many, many more. 8
Problem Definition Suppose on elements. X ∼ Zipf θ D is known, is unknown D θ Estimate by sampling from . D X Two Kinds of Results: - Adaptive Sampling: Will sample from until X a stopping condition is met. - Best-you-can Estimation: Given a sample from , return best estimate of . X D 9
Results p ∗ Let be the probability of the least likely element. Adaptive sampling will return after at most D O (log D samples with constant probability. ) p ∗ r = (1 + 2 � ) 1+ θ Given samples, can return an 1 + � p ∗ estimate to with probability at least D 1 − exp( − Ω( D� 2 )) 10
Outline Introduction Techniques Experimental Results Conclusion 11
Approximation Techniques For a sample of size let be the number of f r r distinct values in the sample. Suppose and are known, then we can compute D θ the expected number of distinct values in E D,θ [ f r ] the sample. If is the number of distinct values observed, the f ∗ r ˆ estimator returns such that . D, θ [ f r ] = f ∗ D E ˆ r 12
Analysis Lemma : Tight Distribution of . f r For large enough , r � � ≤ exp( − � 2 Ω( D )) | E [ f r ] − f r | ≥ �E [ f r ] Pr Proof: Parallels the sharp threshold coupon collector arguments for uniform distributions. 13
Analysis (2) Lemma : MLE preserves approximation Given: , observed elements f r ≤ (1 + � ) E D,θ [ f r ] f ∗ r ˆ f ∗ r = E ˆ D, θ [ f r ] Let such that , and . D r ≥ 1 /p ∗ (1 − 2 � ) ˆ D ≤ D ≤ (1 + 2 � ) ˆ Then: D 14
Outline Introduction Techniques Experimental Results Conclusion 15
The Competition Zipfian Estimator (ZE): Performance guarantees only for Zipfian Distributions. � Guaranteed Error Estimator (GEE): O ( n/r ) error guarantee. (Works for all distributions) Analytic Estimator (AE): Best performing heuristic - no theoretical guarantees. 16
Datasets Synthetic Data: - Vary number of distinct elements D ∈ { 10 k, 50 k, 100 k } - Vary the Database size n ∈ { 100 k, 500 k, 1000 k } - Vary the skew of the distribution θ ∈ { 0 , 0 . 5 , 1 } Real Datasets - “Router” dataset - Packet trace from the Internet Traffic Archive. , , θ ≈ 1 . 6 n ≈ 4 M D ≈ 250 k 17
Estimating θ Zipf θ ( i ) ∝ 1 Recall: i θ Let be the frequency of the i-th element. f i E [ f i ] = cri − θ = ⇒ log E [ f i ] = log cr − θ log i Estimate by doing linear regression on θ plot. log f i vs log i 18
Experimental Results , n = 1M Theta = 0.5, D = 50000 10 ZE AE GEE 8 6 Ratio Error 4 2 0 0 20 40 60 80 100 Number of Samples x 1000 19
Experimental Results (2) Router Dataset 10 ZE AE GEE 8 6 Ratio Error 4 2 0 0 2 4 6 8 10 % DB Sampled 20
Outline Introduction Techniques Experimental Results Conclusion 21
Conclusion Can have error guarantees if the family of distributions is known ahead of time. How does the approximation of affect error θ guarantees? Subtle problem: disk reads occur in blocks. Time to sample 10% is equivalent to reading the whole DB. 22
Thank You
Recommend
More recommend