Lecture 4 Barna Saha AT&T-Labs Research September 19, 2013

Outline Heavy Hitter Continued Frequency Moment Estimation Dimensionality Reduction

Heavy Hitter ◮ Heavy Hitter Problem: For 0 < ǫ < φ < 1 find a set of elements S including all i such that f i > φ m and there is no element in S with frequency ≤ ( φ − ǫ ) m . ◮ Count-Min sketch guarantees: f i ≤ ˆ f i ≤ f i + ǫ m with probability ≥ 1 − δ in space e 1 ǫ log ( φ − ǫ ) δ . 1 ◮ Insert only: Maintain a min-heap of size k = φ − ǫ , when an item arrives estimate frequency and if above φ m include it in the heap. If heap size more than k , discard the minimum frequency element in the heap.

Heavy Hitter ◮ Turnstile model: ◮ Maintain dyadic intervals over binary search tree and maintain ǫ log 2 log n log n count-min sketch with using space e δ ( φ − ǫ ) one for each level. ◮ At every level at most 1 φ heavy hitters. ◮ Estimate frequency of children of the heavy hitter nodes until leaf-level is reached. ◮ Return all the leaves with estimated frequency above φ m . ◮ Analysis ◮ At most 2 φ − ǫ nodes at every level is examined. ◮ Each true frequency > ( φ − ǫ ) m with probability at least 1 − δ ( φ − ǫ ) 2 log n . ◮ By union bound all true frequencies are above ( φ − ǫ ) m with probability at least 1 − δ .

l 2 frequency estimation � ◮ | f i − ˆ f 2 1 + f 2 2 + .... f 2 f i | ≤ ± ǫ n [Count-sketch] ◮ F 2 = f 2 1 + f 2 2 + .... f 2 n ◮ How do we estimate F 2 in small space ?

AMS- F 2 Estimation ◮ H = { h : [ n ] → { +1 , − 1 }} four-wise independent hash functions ◮ Maintain Z j = Z j + ah j ( i ) on arrival of ( i , a ) for j = 1 , ..., t = c ǫ 2 ◮ Return Y = 1 � t j =1 Z 2 j t

Analysis ◮ Z j = � n i =1 f i h j ( i ) ◮ E � � � Z 2 � Z j = 0, E = F 2 . j ) 2 ≤ 4 F 2 ◮ Var � Z 2 � � Z 4 � � � − (E = E Z j 2 . j j j ) = 4 ǫ 2 = 1 � t ◮ E � � � � j =1 Var ( Z 2 c F 2 = F 2 . Var Y Y t 2 2 ◮ By Chebyshev Inequality Pr ≤ 4 � | Y − E � � | > ǫ F 2 � Y c

Boosting by Median ◮ Keep Y 1 , Y 2 , ... Y s , s = O (log 1 δ ) ◮ Return A = median ( Y 1 , Y 2 , .., Y s ) ◮ By Chernoff bound Pr � � | A − F 2 | > ǫ F 2 < δ

Linear Sketch ◮ Algorithm maintains a linear sketch [ Z 1 , Z 2 , ...., Z t ] x = R x where R is a t × n random matrix with entries { +1 , − 1 } . ◮ Use Y = || Rx || 2 2 to estimate t || x | 2 2 . t = O ( 1 ǫ 2 ). ◮ Streaming algorithm operating in the sketch model can be viewed as dimensionality reduction technique.

Dimensionality Reduction ◮ Streaming algorithm operating in the sketch model can be viewed as dimensionality reduction technique. ◮ stream S : point in n dimensional space, want to compute l 2 ( S ) ◮ sketch operator can be viewed as an approximate embedding of l n 2 to sketch space C such that 1. Each point in C can be described using only small number (say m ) of numbers so C ⊂ R m and 2. value of l 2 ( S ) is approximately equal to F ( C ( S )). ◮ F ( Y 1 , Y 2 , .. Y t ) = median( Y 1 , Y 2 , .., Y t )

Dimensionality Reduction ◮ F ( Y 1 , Y 2 , .. Y t ) = median( Y 1 , Y 2 , .., Y t ) ◮ Disadvantage: F is not a norm–performing any nontrivial operations in the sketch space (e.g. clustering, similarity search, regression etc.) becomes difficult. ◮ Can we embed from l n 2 to l m 2 , m << n approximately preserving the distance ? Johnson-Lindenstrauss Lemma

Interlude to Normal Distribution Normal distribution N (0 , 1): ◮ Range ( −∞ , ∞ ) √ ◮ Density f ( x ) = e − x 2 / 2 π ◮ Mean=0, Variance=1 Basic facts ◮ If X and Y are independent random variables with normal distribution then so is X + Y ◮ If X and Y are independent with mean 0 then [ X + Y ] 2 � X 2 � Y 2 � � � � E = E + E ◮ E � � � � � � = c 2 Var � � = c E , Var cX X cX X

A Different Linear Sketch Instead of ± 1 let r i be a i.i.d. random variable from N (0 , 1). ◮ Consider Z = � i r i x i ◮ E Z 2 � i r i x i ) 2 � r 2 x 2 x 2 � � ( � = � � � i = � � � = E i E i Var r i i = i i x 2 i = || x || 2 � 2 . ◮ As before we maintain Z = [ Z 1 , Z 2 , ..., Z t ] and define Y = || Z || 2 2 = t || x || 2 ◮ E � � Y 2 ◮ We show that there exists constant C > 0 s.t. for small enough ǫ > 0 ≤ e − C ǫ 2 t (JL lemma) | Y − t || x || 2 2 | > ǫ t || x || 2 � � Pr 2 ◮ set t = O ( 1 ǫ 2 log 1 δ )

Johnson Lindenstrauss Lemma Lemma For any 0 < epsilon < 1 and any integer m, let t be a positive integer such that 4 ln m t > ǫ 2 / 2 + ǫ 3 / 3 Then for any set V of m points in R n , there is a map f : R n → R t such that for all u and v ∈ V , (1 − ǫ ) || u − v || 2 2 ≤ || f ( u ) − f ( v ) || 2 2 ≤ (1 + ǫ ) || u − v || 2 2 . Furthermore this map can be found in randomized polynomial time.

Lecture 4 Barna Saha AT&T-Labs Research September 19, 2013 - PowerPoint PPT Presentation

Lecture 4 Barna Saha AT&T-Labs Research September 19, 2013 Outline Heavy Hitter Continued Frequency Moment Estimation Dimensionality Reduction Heavy Hitter Heavy Hitter Problem: For 0 < < < 1 find a set of elements S

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

Abstract Datatypes for Differential Programming Benjamin MacAdam and many others. . . May 30,

Recap Hashing-based sketch techniques summarize large data sets Summarize vectors: Test

FUNCTIONS OF SEVERAL VARIABLES MATH 200 MAIN GOALS FOR TODAY Be able to describe and sketch

Reverse mathematics and marriage problems with finitely many solutions Noah A. Hughes noah.hughes

P I O N Thank You Aaron France Konstantin Itskov Yutaka Takeda Adam Kiss Lander Noterman

Quantum thermodynamics: 1 Mauro Paternostro Queens University Belfast Advanced School on

Darrell Bethea May 23, 2011 Program 2 due today Midterm on Thursday Covers everything

COMP 110-003 Introduction to Programming Classes February 19, 2013 Haohan Li TR 11:00