Advanced Algorithms � ��� ���
Count Distinct Elements a sequence x 1 , x 2 , ..., x n ∈ Ω Input : z = | { x 1 , x 2 , ..., x n } | Output : an estimation of • data stream: input comes one at a time • naive algorithm: store everything with O( n ) space x 1 x 2 x n Z : an estimation of b Algorithm z = f ( x 1 , ..., x n ) h i • ( ε , δ ) -estimator: (1 − ✏ ) z ≤ b Pr Z ≤ (1 + ✏ ) z ≥ 1 − � Using only memory equivalent to 5 lines of printed text, you can estimate with a typical accuracy of 5% and in a single pass the total vocabulary of Shakespeare. -----Flajolet
a sequence x 1 , x 2 , ..., x n ∈ Ω Input : z = | { x 1 , x 2 , ..., x n } | Output : an estimation of h i • ( ε , δ ) -estimator: (1 − ✏ ) z ≤ b Pr Z ≤ (1 + ✏ ) z ≥ 1 − � uniform hash function h : Ω → [0,1] h ( x 1 ), ..., h ( x n ) : z uniform independent values in [0,1] (partition [0,1] into z +1 subintervals) � 1 = E [ length of a subinterval] 1 ≤ i ≤ n h ( x i ) min E = z + 1 (by symmetry) 1 estimator: b min i h ( x i ) − 1 ? Z = But Var[min i h ( x i )] is too large! (think of z =1 )
a sequence x 1 , x 2 , ..., x n ∈ Ω Input : z = | { x 1 , x 2 , ..., x n } | Output : an estimation of h i • ( ε , δ ) -estimator: Pr (1 − ✏ ) z ≤ b Z ≤ (1 + ✏ ) z ≥ 1 − � uniform independent hash functions: h 1 , h 2 , ..., h k : Ω → [0,1] Y j = min 1 ≤ i ≤ n h j ( x i ) k average-min: Y = 1 X Y j k j =1 Z = 1 b Flajolet-Martin estimator: Y − 1 UHA: Uniform Hash Assumption unbiased estimator: 1 E [ Y ] = E [ Y j ] = z +1 h i • Deviation: Z < (1 − ✏ ) z or b b Pr Z > (1 + ✏ ) z < ?
z = | { x 1 , x 2 , ..., x n } | For j =1, 2, …, k , hash values of h j : uniform independent X j 1 , X j 2 , ... , X jz ∈ [0, 1] Y j = min 1 ≤ i ≤ n X ji symmetry ) 1 k E [ Y ] = E [ Y j ] = Y = 1 z +1 X Y j k j =1 Z = 1 F-M estimator: let b Y − 1 h i goal: Z > (1 + ✏ ) z or b b Pr Z < (1 − ✏ ) z < � for ε ≤ 1/2 � � ✏ / 2 1 � � � Y − � > � � z + 1 z + 1
z = | { x 1 , x 2 , ..., x n } | For j =1, 2, …, k , hash values of h j : uniform independent X j 1 , X j 2 , ... , X jz ∈ [0, 1] Y j = min 1 ≤ i ≤ n X ji symmetry ) 1 k E [ Y ] = E [ Y j ] = Y = 1 z +1 X Y j k j =1 Z = 1 F-M estimator: let b Y − 1 h i goal: Z > (1 + ✏ ) z or b b Pr Z < (1 − ✏ ) z < � for ε ≤ 1/2 ✏ / 2 � > � � � Y − E [ Y ] z + 1
z = | { x 1 , x 2 , ..., x n } | For j =1, 2, …, k , hash values of h j : uniform independent X j 1 , X j 2 , ... , X jz ∈ [0, 1] Y j = min 1 ≤ i ≤ n X ji symmetry ) 1 k E [ Y ] = E [ Y j ] = Y = 1 z +1 X Y j k j =1 Z = 1 F-M estimator: let b Y − 1 h i Z > (1 + ✏ ) z or b b Pr Z < (1 − ✏ ) z � � ✏ / 2 � > � (for ε ≤ 1/2 ) ≤ Pr � Y − E [ Y ] z + 1 Chebyshev : ≤ 4 ✏ 2 ( z + 1) 2 Var [ Y ]
Markov’s Inequality Markov’s Inequality: For nonnegative X , for any t > 0, Pr[ X ≥ t ] ≤ E [ X ] . t Proof : � X � 1 if X ≥ t , ⇥ ≤ X Let Y = ⇒ Y ≤ t , 0 otherwise. t � X ⇥ = E [ X ] Pr[ X ≥ t ] = E [ Y ] ≤ E . t t tight if we only know the expectation of X
A Generalization of Markov’s Inequality Theorem: For any X , for h : X ⇥� R + , for any t > 0, Pr[ h ( X ) ≥ t ] ≤ E [ h ( X )] . t
Chebyshev’s Inequality Chebyshev’s Inequality: For any t > 0, Pr[ | X − E [ X ] | ≥ t ] ≤ Var [ X ] . t 2 Variance: Var [ X ] = E [( X − E [ X ]) 2 ] = E [ X 2 ] − ( E [ X ]) 2 Var [ cX ] = c 2 Var [ X ] for pairwise independent X i Var [ P i X i ] = P i Var [ X i ]
Chebyshev’s Inequality Chebyshev’s Inequality: For any t > 0, Pr[ | X − E [ X ] | ≥ t ] ≤ Var [ X ] . t 2 Proof : Apply Markov’s inequality to ( X − E [ X ]) 2 ( X − E [ X ]) 2 ⇥ � E ( X − E [ X ]) 2 ≥ t 2 ⇥ � Pr ≤ t 2
z = | { x 1 , x 2 , ..., x n } | For j =1, 2, …, k , hash values of h j : uniform independent X j 1 , X j 2 , ... , X jz ∈ [0, 1] Y j = min 1 ≤ i ≤ n X ji symmetry ) 1 k E [ Y ] = E [ Y j ] = Y = 1 z +1 X Y j k j =1 Z = 1 F-M estimator: let b Y − 1 h i Z > (1 + ✏ ) z or b b Pr Z < (1 − ✏ ) z � � ✏ / 2 � > � (for ε ≤ 1/2 ) ≤ Pr � Y − E [ Y ] z + 1 Chebyshev : ≤ 4 ✏ 2 ( z + 1) 2 Var [ Y ]
z = | { x 1 , x 2 , ..., x n } | For j =1, 2, …, k , hash values of h j : uniform independent X j 1 , X j 2 , ... , X jz ∈ [0, 1] ) symmetry Y j = min 1 ≤ i ≤ n X ji 1 k E [ Y ] = E [ Y j ] = Y = 1 z +1 X Y j k j =1 geometry pdf = z (1- y ) z -1 Pr[ Y j ≥ y ] = (1- y ) z probability Z 1 2 y 2 z (1 − y ) z − 1 d y E [ Y 2 = j ] = ( z + 1)( z + 2) 0 j ] − E [ Y j ] 2 ≤ 1 Var [ Y j ] = E [ Y 2 ( z +1) 2 1 P k 1 Var [ Y ] = j =1 Var [ Y j ] = 1 ≤ k Var [ Y j ] k 2 k ( z + 1) 2 2-wise independence
z = | { x 1 , x 2 , ..., x n } | For j =1, 2, …, k , hash values of h j : uniform independent X j 1 , X j 2 , ... , X jz ∈ [0, 1] Y j = min 1 ≤ i ≤ n X ji symmetry ) 1 k E [ Y ] = E [ Y j ] = Y = 1 z +1 X Y j k j =1 Z = 1 F-M estimator: let b Y − 1 h i 4 Z > (1 + ✏ ) z or b b ≤ Pr Z < (1 − ✏ ) z ✏ 2 k � � ✏ / 2 � > � (for ε ≤ 1/2 ) ≤ Pr � Y − E [ Y ] z + 1 1 Chebyshev : ≤ 4 ✏ 2 ( z + 1) 2 Var [ Y ] Var [ Y ] ≤ k ( z + 1) 2
a sequence x 1 , x 2 , ..., x n ∈ Ω Input : z = | { x 1 , x 2 , ..., x n } | Output : an estimation of uniform independent hash functions: h 1 , h 2 , ..., h k : Ω → [0,1] Y j = min 1 ≤ i ≤ n h j ( x i ) k average-min: Y = 1 X Y j k j =1 Z = 1 b Flajolet-Martin estimator: Y − 1 UHA: Uniform Hash Assumption h i 4 Z > (1 + ✏ ) z or b b ≤ Pr Z < (1 − ✏ ) z ≤ δ ✏ 2 k 4 choose k = ✏ 2 �
Frequency Estimation Data : a sequence x 1 , x 2 , ..., x n ∈ Ω Query : an item x ∈ Ω Estimate the frequency f x = | { i : x i = x } | of item x within additive error ε n . • data stream: input comes one at a time x 1 x 2 x n Algorithm
Frequency Estimation Data : a sequence x 1 , x 2 , ..., x n ∈ Ω Query : an item x ∈ Ω Estimate the frequency f x = | { i : x i = x } | of item x within additive error ε n . • data stream: input comes one at a time x 1 x 2 x n ˆ f x : estimation of Algorithm frequency f x query x Pr[ | ˆ f x − f x | ≥ ✏ n ] ≤ � • heavy hitters: items that appears > ε n times
Data Structure for Set Data : a set S of n items x 1 , x 2 , ..., x n ∈ Ω Query : an item x ∈ Ω Determine whether x ∈ S. • space cost: size of data structure (in bits) • entropy of a set: O( n log| Ω |) bits • time cost: time to answer a query • balanced tree: O( n log| Ω |) space, O(log n ) time • perfect hashing: O( n log| Ω |) space, O(1) time • using < entropy space ? a sketch of the set (approximate representation)
Approximate a Set Data : a set S of n items x 1 , x 2 , ..., x n ∈ Ω Query : an item x ∈ Ω Determine whether x ∈ S. uniform hash function h : Ω → [ m ] data structure: an m -bit vector v ∈ {0, 1} m initially v is all- 0 ; set v [ h ( x i )]=1 for each x i ∈ S ; query x : answer “yes” if v [ h ( x )]=1 ; x ∈ S : always correct Pr[ v [ h ( x )]=1 ] = 1- (1-1/ m ) n = 1-e - n / m x ∉ S : false positive
Bloom Filters (Bloom 1970) Data : a set S of n items x 1 , x 2 , ..., x n ∈ Ω Query : an item x ∈ Ω Determine whether x ∈ S. uniform independent hash functions h 1 , h 2 , ..., h k : Ω → [ m ] data structure: an m -bit vector v ∈ {0, 1} m initially v is all- 0 ; for each x i ∈ S : set v [ h j ( x i )]=1 for all j =1 ,...,k ; query x : “yes” if v [ h j ( x )]=1 for all j =1 ,...,k ;
Bloom Filters uniform independent hash functions h 1 , h 2 , ..., h k : Ω → [ m ] data structure: an m -bit vector v ∈ {0, 1} m initially v is all- 0 ; for each x i ∈ S : set v [ h j ( x i )]=1 for all j =1 ,...,k ; query x : “yes” if v [ h j ( x )]=1 for all j =1 ,...,k ; y h 1 h 2 h 3 x z 0 0 1 0 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 v false positive! w
data : set S ⊆ Ω of size | S | = n query : x ∈ Ω uniform independent hash functions h 1 , h 2 , ..., h k : Ω → [ m ] data structure: an m -bit vector v ∈ {0, 1} m initially v is all- 0 ; for each x i ∈ S : set v [ h j ( x i )]=1 for all j =1 ,...,k ; query x : “yes” if v [ h j ( x )]=1 for all j =1 ,...,k ; UHA: Uniform Hash Assumption choose k = m ln 2 x ∉ S : false positive n Pr[ ∀ 1 ≤ j ≤ k : v [ h j ( x )] = 1] m = cn = (Pr[ v [ h j ( x )] = 1]) k = (1 − Pr[ v [ h j ( x )] = 0]) k ≤ (1 − (1 − 1 /m ) kn ) k = (1 − e − kn/m ) k ≈ (0 . 6185) c
Bloom Filters data : set S ⊆ Ω of size | S | = n query : x ∈ Ω uniform independent hash functions h 1 , h 2 , ..., h k : Ω → [ m ] data structure: an m -bit vector v ∈ {0, 1} m initially v is all- 0 ; for each x i ∈ S : set v [ h j ( x i )]=1 for all j =1 ,...,k ; query x : “yes” if v [ h j ( x )]=1 for all j =1 ,...,k ; k = m ln 2 choose = c ln 2 m = cn n • space cost: cn bits; time cost: c ln 2 • false positive: < (0.6185) c
Recommend
More recommend