B669 Sublinear Algorithms for Big Data Qin Zhang 1-1
Part 1: Sublinear in Space 2-1
The model and challenge The data stream model (Alon, Matias and Szegedy 1996) RAM a n a 2 a 1 CPU Why hard? Cannot store everything. Applications : Internet router, stock data, ad auction, flight logs on tape, etc. 3-1
§ 1 . 1 Point Queries (part I) RAM 9 7 6 3 3 9 Which items are most frequent? CPU Approximation allowed. 4-1
More about the streaming model Denote the stream by A = a 1 , . . . , a m , where m = n O (1) is the length of the stream, which is unknown at the beginning. Let [ n ] be the item universe. Let x j be the frequency of item j in the steam. Each a i = ( j , ∆) denotes x j ← x j + ∆. 5-1
More about the streaming model Denote the stream by A = a 1 , . . . , a m , where m = n O (1) is the length of the stream, which is unknown at the beginning. Let [ n ] be the item universe. Let x j be the frequency of item j in the steam. Each a i = ( j , ∆) denotes x j ← x j + ∆. We call an algorithm insertion-only if it only works for ∆ = 1. 5-2
More about the streaming model Denote the stream by A = a 1 , . . . , a m , where m = n O (1) is the length of the stream, which is unknown at the beginning. Let [ n ] be the item universe. Let x j be the frequency of item j in the steam. Each a i = ( j , ∆) denotes x j ← x j + ∆. We call an algorithm insertion-only if it only works for ∆ = 1. Can represent the stream as a vector x = ( x 1 , . . . , x n ). When a i = ( j , ∆) comes, x j ← x j + ∆. – for insertion only, m = � x � 1 5-3
The MAJORITY problem MAJORITY: if ∃ j : f j > m / 2, then output j , otherwise, output ⊥ . 6-1
Heavy hitters and point queries L p heavy hitter set : HH p φ ( x ) = { i : | x i | ≥ φ � x � p } 7-1
Heavy hitters and point queries L p heavy hitter set : HH p φ ( x ) = { i : | x i | ≥ φ � x � p } L p Heavy Hitter Problem: Given φ, φ ′ , (often φ ′ = φ − ǫ ), return a set S such that HH p φ ( x ) ⊆ S ⊆ HH p φ ′ ( x ) 7-2
Heavy hitters and point queries L p heavy hitter set : HH p φ ( x ) = { i : | x i | ≥ φ � x � p } L p Heavy Hitter Problem: Given φ, φ ′ , (often φ ′ = φ − ǫ ), return a set S such that HH p φ ( x ) ⊆ S ⊆ HH p φ ′ ( x ) L p Point Query Problem: Given ǫ , after reading the whole stream, given i , report x i = x i ± ǫ � x � p ˜ 7-3
The Misra-Gries algorithm The algorithm (Misra-Gries ’82) 1. Maintain a set A ; each item is a counter pair ( i , x i ). A ← ∅ 2. For each new coming item e , (a) if e ∈ A then set ( e , x e ) ← ( e , x e + 1) (b) else if | A | < 1 /ǫ , add ( e , 1) to A (c) else, for each e ∈ A , set ( e , x e ) ← ( e , x e − 1), and if x e − 1 = 0, then remove ( e , 0) from A . 3. On query i , if i ∈ A , then return x i , otherwise return 0 8-1
The Misra-Gries algorithm The algorithm (Misra-Gries ’82) 1. Maintain a set A ; each item is a counter pair ( i , x i ). A ← ∅ 2. For each new coming item e , (a) if e ∈ A then set ( e , x e ) ← ( e , x e + 1) (b) else if | A | < 1 /ǫ , add ( e , 1) to A (c) else, for each e ∈ A , set ( e , x e ) ← ( e , x e − 1), and if x e − 1 = 0, then remove ( e , 0) from A . 3. On query i , if i ∈ A , then return x i , otherwise return 0 Analysis (on board) Theorem Misra-Gries uses O (1 /ǫ · log n ) bits, and for any j , produces an x j satisfing x j − ǫ m ≤ ˜ x j ≤ x j . estimate ˜ 8-2
Space-saving: an algorithm for insertion only Algorithm Space-saving [Metwally et al. ’05] When a new item e comes, we have two cases. 1. If e is already in the array. We just increment ˜ x e by 1 and reinsert the ( e , ˜ x e ) into the array. 2. If e is not in the array, we create a new tuple ( e , MIN + 1 ) where MIN = min { ˜ x e : e is in the array } . We always keep the array sorted according to ˜ x e , and then MIN is just the estimated frequency of the last item. If the length array is larger than 1 /ǫ , we delete the last tuple. At the query of e , report ˜ x e if e is in the array, otherwise report MIN 9-1
Space-saving: an algorithm for insertion only Algorithm Space-saving [Metwally et al. ’05] When a new item e comes, we have two cases. 1. If e is already in the array. We just increment ˜ x e by 1 and reinsert the ( e , ˜ x e ) into the array. 2. If e is not in the array, we create a new tuple ( e , MIN + 1 ) where MIN = min { ˜ x e : e is in the array } . We always keep the array sorted according to ˜ x e , and then MIN is just the estimated frequency of the last item. If the length array is larger than 1 /ǫ , we delete the last tuple. At the query of e , report ˜ x e if e is in the array, otherwise report MIN Theorem (Analysis on board) Space-saving uses O (1 /ǫ · log n ) bits, and for any j , produces an x j satisfing x j ≤ ˜ x j ≤ x j + ǫ m . estimate ˜ 9-2
§ 1 . 2 Distinct Elements RAM 9 7 6 3 3 9 How many distinct elements? CPU Approximation needed. 10-1
Universal hash function A family H ⊆ { h : X → Y } is said to be 2-universal if the following property holds, with h ∈ R H picked uniformly at random: ∀ x , x ′ ∈ X , ∀ y , y ′ ∈ Y , � � 1 x � = x ′ ⇒ Pr h [ h ( x ) = y ∧ h ( x ′ ) = y ′ ] = | Y | 2 11-1
The Flajoet-Martin algorithm The algorithm (Flajoet and Martin ’83) 1. Choose a random hash function h : [ n ] → [ n ] from a 2-universal family. Set z = 0. Let zeros( h ( e )) be the # tailing zeros of the binary representation of h ( e ). 2. For each new coming item e , if zeros( h ( e )) > z , then set z = zeros( h ( e )); 3. Output 2 z +0 . 5 . 12-1
The Flajoet-Martin algorithm The algorithm (Flajoet and Martin ’83) 1. Choose a random hash function h : [ n ] → [ n ] from a 2-universal family. Set z = 0. Let zeros( h ( e )) be the # tailing zeros of the binary representation of h ( e ). 2. For each new coming item e , if zeros( h ( e )) > z , then set z = zeros( h ( e )); 3. Output 2 z +0 . 5 . Analysis (on board) 12-2
The Flajoet-Martin algorithm The algorithm (Flajoet and Martin ’83) 1. Choose a random hash function h : [ n ] → [ n ] from a 2-universal family. Set z = 0. Let zeros( h ( e )) be the # tailing zeros of the binary representation of h ( e ). 2. For each new coming item e , if zeros( h ( e )) > z , then set z = zeros( h ( e )); 3. Output 2 z +0 . 5 . Analysis (on board) Theorem The number of distinct elements can be O (1)-approximated with probability 2 / 3 using O (log n ) bits. 12-3
Probability amplification Can we boost the success probability to 1 − δ ? The idea is to run k = Θ(log(1 /δ )) copies of this algorithm in parallel, using mutually independent random hash functions, and output the median of the k answers. 13-1
An improved algorithm Idea: two-level hashing. The algorithm (Bar-Yossef et al. ’02) 1. Choose a random hash function h : [ n ] → [ n ] from a 2-universal family. Set z = 0 , B = ∅ . Choose a secondary 2-universal hash function g : [ n ] → [(log n /ǫ ) O (1) ]. 2. For each new coming item e , if zeros( h ( e )) ≥ z , then (a) set B ← B ∪ { ( g ( e ) , zeros ( h ( e ))) } ; (b) if | B | > c /ǫ 2 then set z ← z + 1 and remove all ( α, β ) in B with β < z . 3. Output | B | 2 z . 14-1
An improved algorithm Idea: two-level hashing. The algorithm (Bar-Yossef et al. ’02) 1. Choose a random hash function h : [ n ] → [ n ] from a 2-universal family. Set z = 0 , B = ∅ . Choose a secondary 2-universal hash function g : [ n ] → [(log n /ǫ ) O (1) ]. 2. For each new coming item e , if zeros( h ( e )) ≥ z , then (a) set B ← B ∪ { ( g ( e ) , zeros ( h ( e ))) } ; (b) if | B | > c /ǫ 2 then set z ← z + 1 and remove all ( α, β ) in B with β < z . 3. Output | B | 2 z . Analysis (on board) 14-2
An improved algorithm Idea: two-level hashing. The algorithm (Bar-Yossef et al. ’02) 1. Choose a random hash function h : [ n ] → [ n ] from a 2-universal family. Set z = 0 , B = ∅ . Choose a secondary 2-universal hash function g : [ n ] → [(log n /ǫ ) O (1) ]. 2. For each new coming item e , if zeros( h ( e )) ≥ z , then (a) set B ← B ∪ { ( g ( e ) , zeros ( h ( e ))) } ; (b) if | B | > c /ǫ 2 then set z ← z + 1 and remove all ( α, β ) in B with β < z . 3. Output | B | 2 z . Analysis (on board) Theorem The number of distinct elements can be (1 + ǫ )-approximated with probability 2 / 3 using O (log n + 1 /ǫ 2 · (log(1 /ǫ ) + log log n )) bits. 14-3
§ 1 . 3 Linear Sketches 15-1
We have seen Misra-Gries, Space-saving, Flajolet-Martin and its improvement. 16-1
We have seen Misra-Gries, Space-saving, Flajolet-Martin and its improvement. Nice algorithms, but only work for insertion-only sequences ... Can we handle deletions? 16-2
We have seen Misra-Gries, Space-saving, Flajolet-Martin and its improvement. Nice algorithms, but only work for insertion-only sequences ... Can we handle deletions? A popular way is to use linear sketches. 16-3
Linear sketch Random linear projection M : R n → R k that preserves properties of any v ∈ R n with high prob. where k ≪ n . = answer M Mv v 17-1
Recommend
More recommend