A STORY OF DISTINCT ELEMENTS Ravi Kumar Yahoo! Research - PowerPoint PPT Presentation

A STORY OF DISTINCT ELEMENTS Ravi Kumar Yahoo! Research Sunnyvale, CA ravikumar@yahoo-inc.com ��

� results about F 0 (This represents joint works with Bar-Yossef, Jayram, Sivakumar, Trevisan) ��

Data stream model Modeling efficient computation on massive data Compute a function of inputs X = x 1 , …, x n Approximate, randomize, and be space-efficient! ��

Finding distinct elements � Given X = x 1 , …, x n compute F 0 (X), the number of distinct elements in X, in the data stream model Assume x i � [m] � ( � , � )-approximation: Output F’ 0 (X) such that with probability at least 1 - � , F’ 0 (X) = (1 ± � ) F 0 (X) � Zeroth frequency moment � Assume log m = O(log n); otherwise hash input � Sampling needs lots of space � Without randomization and approximation, this problem is uninteresting ��

Some applications � Web analysis � How many different queries were processed by the search engine in the last 48 hours? � How many non-duplicate pages have been crawled from a given web site? � How many unique ads has the user clicked on (or) how many unique users ever clicked a given ad? � Databases � Query selectivity � Query planning and execution � Networks � Smart traffic routing ��

Some previous work � [Flajolet, Martin]: Assumed ideal hash functions � [Alon, Matias, Szegedy]: Pairwise independent hashing (2+ � )-approximation using O(log m) space � [Cohen]: Similar to FM, AMS � [Gibbons, Tirthapura]: Hashing-based � -approximation using O(1/ � 2 log m) space � [Bar-Yossef, Kumar, Sivakumar]: Hashing-based, range-summable � -approximation using O(1/ � 3 log m) space � [Cormode, Datar, Indyk, Muthukrishnan]: Stable distributions � -approximation using O(1/ � 2 log m) space ��

The rest of the talk � Upper bounds � Lower bounds ��

Upper bounds What is the goal beyond O(1/ � 2 log m) space ? Can we get upper bounds of the form Õ(1/ � 2 + log m) where Õ hides factors of the form log 1/ � and log log m? Three algorithms with improved upper bounds ��

Summary of the bounds � ALG I: Space O(1/ � 2 log m) and time Õ(log m) per element � ALG II: Space Õ(1/ � 2 + log m) and time Õ(1/ � 2 log m) per element � ALG III: Space Õ(1/ � 2 + log m) and time Õ(log m) amortized per element ��

ALG I: Basic idea Suppose h:[m] � (0, 1) is truly random 0 1 Then min (h(x i )) is roughly ~ 1/F 0 (X) Reciprocal of this value is F 0 (X) [FM, AMS] More robust: Keep the t-th smallest value v t v t is roughly ~ t/F 0 A good estimator of F 0 is t/v t ��

ALG I: Details t = 1/ � 2 ; h:[m] � h[m 3 ], pairwise indep.; T = ∅ for i = 1, …, n do T � t smallest values in T U h(x i ) v t = t-th smallest value in T Output tm 3 /v t = F’ 0 (X) � Space: O(log m) for h and O(1/ � 2 log m) for T � Time: Balanced binary search tree for T ��

ALG I: Analysis h is pairwise independent, injective whp Y = { y 1 , …, y k } be distinct values, F 0 = k Suppose F’ 0 > (1+ � ) F 0 means h(y 1 ), …, h(y k ) has t values smaller than tm 3 /(F 0 (1+ � )) Pr[this event] < 1/6 by Chebyshev Similar analysis for F’ 0 < (1- � ) F 0 ��

ALG II: Basic idea Suppose we know rough value of F 0 , say R Suppose h:[m] � [R] is truly random Define r = Pr h [h maps some x i to 0] � � � � � � � � � � � � If R and F 0 are close, then r is all we need Estimate R using [AMS] � �� Estimate r using sufficiently indep. hash functions ��

ALG II: Some details H be (log1/ � )-wise independent hash family Estimator p = Pr h � H [h maps some x i to 0] p matches first log1/ � terms in expansion of r Chebyshev inequality, inclusion-exclusion p and r will be close if 1/ � 2 estimators (hash functions) are deployed Create these hash functions from a master hash ��

ALG III: Basic idea Overview of algorithm of [GT] and [BKS] Suppose h: [m] � [m] is pairwise indep. Let h t = projection of h onto last t bits Find min t for which r = #{x i | h t (x i ) = 0} < 1/ � 2 Output r 2 t Can do space-efficiently since if h t+1 (x i ) = 0 then h t (x i ) = 0 and so can filter ��

ALG III: Some details � Space = 1/ � 2 log m � Obs: Need not store elements explicitly � Use a secondary hash function g � g succinct, injective � g suffices to store trailing zeros � Space: log m + 1/ � 2 (log 1/ � + log log m) � Amortized time: Õ(log m + log 1/ � ) ��

Lower bounds The general paradigm � Consider communication complexity of a certain problem � One-way � Multi-round � Reduce it to that of computing F 0 in the data stream model � Obtain one-pass or multi-pass space lower bound ��

� (log m) lower bound [AMS] Reduction from set equality problem Alice given X, Bob given Y, both m-bit vectors, and the question is if X = Y � Randomized space bound of � (log m) X’ = � (X), Y’ = � (Y) where � is error- correcting code � YES case: if X = Y, then F 0 (X’ U Y’) = n’ � NO case: if X � Y, then F 0 (X’ U Y’) ~ 2n’ ��

One-pass � (1/ � ) lower bound Reduction from set disjointness with special instances Alice has bit vector X with |X| = m/2, Bob has bit vector Y with |Y| = � m � Treated as sets YES instance: X contains Y NO instance: X � Y = ∅ � One-pass lower bound [BJKS]: � (1/ � ) Z = (1, x 1 ) … (m, x m ) (1, y 1 ) … (m, y m ) � YES case: If X contains Y, then F 0 (Z) = m/2 � NO case: If X and Y are disjoint, F 0 (Z) = m/2+ � m = m/2(1 + 2 � ) ��

The gap-hamming problem [IW] Alice given X, Bob given Y, both m-bit vectors � Promise � YES instance: h(X, Y) � m/2 � NO instance: h(X, Y) � m/2 - � m Gap-hamming problem: distinguish the two cases in one-pass or multi-round communication model ��

Gap-hamming captures F0 � Z = (1, x 1 ) … (m, x m ) (1, y 1 ) … (m, y m ) � F 0 (Z) = 2h(X,Y) + (m - h(X, Y)) = m + h(X,Y) � YES case: if h(X, Y) � m/2 then F 0 (Z) � 3m/2 � NO case: if h(X, Y) � m/2 - � m then F 0 (Z) � 3m/2 - � m = 3m/2(1 – 1/ � m) Can be shown that � (( � m) c ) lower bound for gap- hamming leads to � (1/ � c ) lower bound for F 0 ��

A STORY OF DISTINCT ELEMENTS Ravi Kumar Yahoo! Research - PowerPoint PPT Presentation

A STORY OF DISTINCT ELEMENTS Ravi Kumar Yahoo! Research Sunnyvale, CA ravikumar@yahoo-inc.com results

1 QC STORY -32 QC STORY -32 QC STORY -32 QC Story-1 QC Story-1 QC Story-1 Awards and

DXA studio 40 Greene Avenue October 17, 2017 GREENE AVENUE 4 STORY 4 STORY 4 STORY 4 STORY

Literary Elements OB OBJE JECT CTIVES IVES Identify elements of a short story Define

Elements of Future COP Elements of Future COP Elements of Future COP Elements of Future COP

Getting Inside A Story Literary Elements: the pieces of a story Analysis: exploring how the

Story of Wisconsin Story of Wisconsin The Story of Wisconsin s s The Story of Wisconsin

Distinct Value Estimators For Zipfian Distributions Sergei Vassilvitskii Rajeev Motwani

mRNA purification through fragmentation Distinct 28S ribosomal Distinct 18S subunit (or prok.

Old English Nouns: Survival Kit P . S. Langeslag Distinct Forms of a Typical Present-Day English

Homogeneous Linear Systems Three Cases: Distinct Real Eigenvalues Repeated Eigenvalues

Living organisms are composed of about 25 chemical elements Most Common Elements In the Human

Literary Elements: A Story Sep 1510:34 PM 1 Literary elements.notebook September 21, 2017

THE STORY OF REDEMPTION SACRED SPACE SUMMER OF LEARNING UNFOLDING HIS STORY The main story that

Study 8 Presentation The Story also talks about five movements that take part in Gods Story:

Study 25 Presentation The Story also talks about five movements that take part in Gods Story:

Study 16 Presentation The Story also talks about five movements that take part in Gods Story:

Amit Chakrabarti Dartmouth College Main result joint with Oded Regev, Tel Aviv University

Properties of Linear Block Codes Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department of

Amit Chakrabarti (Joint work with Joshua Brody) Dartmouth College DIMACS/DyDAn Workshop, March

Listing Bit Strings List all bit strings of length 3. 000, 001, 010, 011, 100, 101, 110, 111.

Locating arrays with error correcting ability Masakazu Jimbo joint work with Xiao-Nan Lu

Interlacing methods in Extremal Combinatorics Hao Huang Emory University Nov 14, 2020 Hao Huang

Critical Problem for Matroids and Codes Keisuke Shiromoto Department of Mathematics and

Linear coordinates for perfect codes and Steiner triple systems F.I. Soloveva, I.Yu. Mogilnykh