Lecture 4 Barna Saha AT&T-Labs Research September 19, 2013
Outline Heavy Hitter Continued Frequency Moment Estimation Dimensionality Reduction
Heavy Hitter ◮ Heavy Hitter Problem: For 0 < ǫ < φ < 1 find a set of elements S including all i such that f i > φ m and there is no element in S with frequency ≤ ( φ − ǫ ) m . ◮ Count-Min sketch guarantees: f i ≤ ˆ f i ≤ f i + ǫ m with probability ≥ 1 − δ in space e 1 ǫ log ( φ − ǫ ) δ . 1 ◮ Insert only: Maintain a min-heap of size k = φ − ǫ , when an item arrives estimate frequency and if above φ m include it in the heap. If heap size more than k , discard the minimum frequency element in the heap.
Heavy Hitter ◮ Turnstile model: ◮ Maintain dyadic intervals over binary search tree and maintain ǫ log 2 log n log n count-min sketch with using space e δ ( φ − ǫ ) one for each level. ◮ At every level at most 1 φ heavy hitters. ◮ Estimate frequency of children of the heavy hitter nodes until leaf-level is reached. ◮ Return all the leaves with estimated frequency above φ m . ◮ Analysis ◮ At most 2 φ − ǫ nodes at every level is examined. ◮ Each true frequency > ( φ − ǫ ) m with probability at least 1 − δ ( φ − ǫ ) 2 log n . ◮ By union bound all true frequencies are above ( φ − ǫ ) m with probability at least 1 − δ .
l 2 frequency estimation � ◮ | f i − ˆ f 2 1 + f 2 2 + .... f 2 f i | ≤ ± ǫ n [Count-sketch] ◮ F 2 = f 2 1 + f 2 2 + .... f 2 n ◮ How do we estimate F 2 in small space ?
AMS- F 2 Estimation ◮ H = { h : [ n ] → { +1 , − 1 }} four-wise independent hash functions ◮ Maintain Z j = Z j + ah j ( i ) on arrival of ( i , a ) for j = 1 , ..., t = c ǫ 2 ◮ Return Y = 1 � t j =1 Z 2 j t
Analysis ◮ Z j = � n i =1 f i h j ( i ) ◮ E � � � Z 2 � Z j = 0, E = F 2 . j ) 2 ≤ 4 F 2 ◮ Var � Z 2 � � Z 4 � � � − (E = E Z j 2 . j j j ) = 4 ǫ 2 = 1 � t ◮ E � � � � j =1 Var ( Z 2 c F 2 = F 2 . Var Y Y t 2 2 ◮ By Chebyshev Inequality Pr ≤ 4 � | Y − E � � | > ǫ F 2 � Y c
Boosting by Median ◮ Keep Y 1 , Y 2 , ... Y s , s = O (log 1 δ ) ◮ Return A = median ( Y 1 , Y 2 , .., Y s ) ◮ By Chernoff bound Pr � � | A − F 2 | > ǫ F 2 < δ
Linear Sketch ◮ Algorithm maintains a linear sketch [ Z 1 , Z 2 , ...., Z t ] x = R x where R is a t × n random matrix with entries { +1 , − 1 } . ◮ Use Y = || Rx || 2 2 to estimate t || x | 2 2 . t = O ( 1 ǫ 2 ). ◮ Streaming algorithm operating in the sketch model can be viewed as dimensionality reduction technique.
Dimensionality Reduction ◮ Streaming algorithm operating in the sketch model can be viewed as dimensionality reduction technique. ◮ stream S : point in n dimensional space, want to compute l 2 ( S ) ◮ sketch operator can be viewed as an approximate embedding of l n 2 to sketch space C such that 1. Each point in C can be described using only small number (say m ) of numbers so C ⊂ R m and 2. value of l 2 ( S ) is approximately equal to F ( C ( S )). ◮ F ( Y 1 , Y 2 , .. Y t ) = median( Y 1 , Y 2 , .., Y t )
Dimensionality Reduction ◮ F ( Y 1 , Y 2 , .. Y t ) = median( Y 1 , Y 2 , .., Y t ) ◮ Disadvantage: F is not a norm–performing any nontrivial operations in the sketch space (e.g. clustering, similarity search, regression etc.) becomes difficult. ◮ Can we embed from l n 2 to l m 2 , m << n approximately preserving the distance ? Johnson-Lindenstrauss Lemma
Interlude to Normal Distribution Normal distribution N (0 , 1): ◮ Range ( −∞ , ∞ ) √ ◮ Density f ( x ) = e − x 2 / 2 π ◮ Mean=0, Variance=1 Basic facts ◮ If X and Y are independent random variables with normal distribution then so is X + Y ◮ If X and Y are independent with mean 0 then [ X + Y ] 2 � X 2 � Y 2 � � � � E = E + E ◮ E � � � � � � = c 2 Var � � = c E , Var cX X cX X
A Different Linear Sketch Instead of ± 1 let r i be a i.i.d. random variable from N (0 , 1). ◮ Consider Z = � i r i x i ◮ E Z 2 � i r i x i ) 2 � r 2 x 2 x 2 � � ( � = � � � i = � � � = E i E i Var r i i = i i x 2 i = || x || 2 � 2 . ◮ As before we maintain Z = [ Z 1 , Z 2 , ..., Z t ] and define Y = || Z || 2 2 = t || x || 2 ◮ E � � Y 2 ◮ We show that there exists constant C > 0 s.t. for small enough ǫ > 0 ≤ e − C ǫ 2 t (JL lemma) | Y − t || x || 2 2 | > ǫ t || x || 2 � � Pr 2 ◮ set t = O ( 1 ǫ 2 log 1 δ )
Johnson Lindenstrauss Lemma Lemma For any 0 < epsilon < 1 and any integer m, let t be a positive integer such that 4 ln m t > ǫ 2 / 2 + ǫ 3 / 3 Then for any set V of m points in R n , there is a map f : R n → R t such that for all u and v ∈ V , (1 − ǫ ) || u − v || 2 2 ≤ || f ( u ) − f ( v ) || 2 2 ≤ (1 + ǫ ) || u − v || 2 2 . Furthermore this map can be found in randomized polynomial time.
Recommend
More recommend