FPGA Acceleration for the Frequent Item Problem Jens Teubner, Ren´ e M¨ uller, Gustavo Alonso ETH Zurich, Systems Group
not (only) about FPGAs not about a new solution to the frequent item problem 2 / 17
Frequent Item Problem: Given a stream S of items x i , which items occur most often in S ? Solution: [Metwally et al. 2006] lookup by item 1 foreach stream item x ∈ S do find bin b x with b x . item = x ; 2 if such a bin was found then 3 b x . count ← b x . count + 1 ; 4 lookup by count else 5 b min ← bin with minimum count value ; 6 b min . count ← b min . count + 1 ; 7 b min . item ← x ; 8 3 / 17
50 throughput [million items / sec] 40 z = ∞ z = 2 30 z = 1 . 5 20 z = 1 10 z = 0 0 16 32 64 128 256 512 1024 number of items monitored (Intel T9550 @ 2.66 MHz; code by Cormode and Hadjieleftheriou, VLDB 2008) 4 / 17
Tricks on FPGAs: content-addressable memory “hash table on steroids” dual-ported memory min-heap maintenance speed-up 5 / 17
50 throughput [million items / sec] hardware software 40 30 20 10 0 16 32 64 128 256 512 1024 number of items monitored → data dependent , not scalable � 6 / 17
Lesson 1: FPGAs are not a silver bullet. 7 / 17
Idea: Parallelize ? = x i , reduction: item coordinator b x , b min count ? update! · · · bin 1 bin 2 bin 3 bin k − 1 bin k 1 Broadcast input item x i to all bins. 2 Reduce to determine b x and b min . 3 Update b x / b min . 8 / 17
50 throughput [million items / sec] 40 z = ∞ 30 z = 1 . 5 20 10 z = 0 0 16 32 64 128 256 512 1024 number of items monitored → still not scalable � 9 / 17
What went wrong? coordinator bin 1 bin 2 bin 3 bin 4 · · · bin k − 1 bin k Lesson 2: Avoid long-distance communication. 10 / 17
Can we keep processing local ? (avoid long-distance communication) 11 / 17
Pipeline-Style Processing: x 1 x 1 ? b i . item = x 1 item item item item · · · · · · count b i − 1 count b i count b i + 1 count b i + 2 ? b i . count < b i + 1 . count 1 Compare input item x 1 to content of bin b i (and increment count value if a match was found) . 2 Order bins b i and b i + 1 according to count values. 3 Move x 1 forward in the array and repeat . → Drop x 1 into last bin if no match can be found. 12 / 17
O ( 1 ) → O ( #bins ) ? But: Can be parallelized well. 13 / 17
Pipeline Parallelism: x 2 x 1 ? ? b i − 1 . item = x 2 b i + 1 . item = x 1 item item item item · · · · · · count b i − 1 count b i count b i + 1 count b i + 2 ? ? b i − 1 . count < b i . count b i + 1 . count < b i + 2 . count 1 O ( #bins ) → #bins · O ( #bins ) 14 / 17
100 throughput [million items / sec] FPGA (pipeline parallel) 80 60 FPGA (data parallel) software 40 20 0 16 32 64 128 256 512 1024 number of items monitored 15 / 17
Lesson 3: Pipelining → scalability, performance. 16 / 17
Lessons learned: 1. FPGAs are not a silver bullet. Straightforward s/w → h/w mapping will not do the job. 2. Avoid long-distance communication. Signal propagation delays will limit scalability. 3. Pipelining → scalability, performance. Keep communication and synchronization cheap. Frequent item solution: three times faster than software, data independent. This work was supported by the Swiss National Science Foundation . 17 / 17
Recommend
More recommend