High Performance Combinatorial Algorithm Design on the Cell/B.E. David A. Bader, Virat Agarwal Kamesh Madduri Seunghwa Kang Sulabh Patel Kamesh Madduri, Seunghwa Kang, Sulabh Patel
Cell System Features � Heterogeneous multi-core � Synergistic Processor Element (SPE) system architecture consists of � Power Processor � Synergistic Processor Unit (SPU) Element for control � Synergistic Memory Flow Control (MFC) tasks � Data movement & synchronization � Synergistic Processor y g � Interface to high-performance � Interface to high performance Elements for data- Element Interconnect Bus intensive processing Virat Agarwal, STI Cell workshop 2
Cellbuzz @ Georgia Tech. • List ranking • Fast Fourier Transform • Zlib Compression/Decompression p / p • RC5 Encryption/Decryption • MPEG2 • MPEG2 Open-source, can be obtained from : http://sourceforge.net/projects/cellbuzz p // g /p j / Virat Agarwal, STI Cell workshop 3
List Ranking on Cell • Cell performs well for applications with predictable memory access patterns [Williams et. al. 2006] p [ ] • Conjecture: Can Cell architecture also perform well for applications that exhibit irregular memory access patterns? li ti th t hibit i l tt ? – Non-contiguous accesses to global data structures with low degrees of locality List ranking is a special case of Parallel Prefix where the f P ll l P fi h h values are initially set to 1 (except for the head) and addition addition is is used used as as the the operator. Virat Agarwal, STI Cell workshop 4
A parallel algorithm for List Ranking • SMP algorithm [Helman & JaJa, 1999] 1. Partition the input list into s sublists, by randomly choosing s sublist head nodes , one from each memory block of n/(s − 1) nodes. 2. Traverse each sublist computing the prefix sum of each node within the sublists the sublists. 3. Calculate prefix sums of the sublist head nodes. 4. Traverse the sublists again, summing the prefix sum values of each node with the value of its sublist head node. • Design Issues – Frequent DMA transfers required to fetch successor elements. – No significant computation in the algorithm, thus communication creates a bottleneck. – Need to hide DMA latency by overlapping computation with communication. Virat Agarwal, STI Cell workshop 5
A Generic Latency-hiding technique • Cell supports non-blocking memory transfers • Requires identification of another level of parallelism within each SPE. • Concept of software-managed-threads (SM-Threads) – SPE computation is distributed among these threads – SM-Threads are scheduled according to Round Robin policy . g p y • Instruction level profiling to determine the minimum number of SM- Threads needed to hide latency Threads needed to hide latency. – Tradeoff between latency and number of SM-threads. Virat Agarwal, STI Cell workshop 6
List Ranking: Performance Analysis Tuning the DMA parameter Running Time (PPE-only vs PPE+SPE’s) Random List Ordered List 120 5 120 6 100 100 5 4 ning Time (msec) ning Time(msec) provement factor provement factor 80 80 4 3 60 60 3 2 Runn Imp Imp Run 40 40 2 1 20 20 1 0 0 0 0 1 2 4 8 1 2 4 8 Number of DMA buffers Number of DMA buffers Listranking on Cell - Ordered Lists Listranking on Cell - Random Lists PPE-only vs Cell Optimized PPE-only vs Cell Optimized 140 2500 Cell (PPE-only) Cell (PPE-only) 120 Cell Optimized Cell Optimized 2000 ec) 100 c) Running time (msec Running Time (mse 1500 80 60 1000 40 500 20 0 0 16 17 18 19 20 21 22 23 16 17 18 19 20 21 22 23 log(list size) log(list size) Virat Agarwal, STI Cell workshop 7
List Ranking : Performance Analysis g y Comparison with other architectures � 2.5 times faster than an optimized parallel implementation on a d dual core Woodcrest (Intel Xeon 5150) l W d t (I t l X 5150) - 4.6 times faster than single core. Virat Agarwal, STI Cell workshop 8
ZLIB: Data compression/decompression • LZ77 algorithm [J. Ziv & A. Lempel, 1977] – Identify the longest repeating string from the previous data. – Replace the duplicated string Replace the duplicated string with the reference to its previous occurrence. – A reference is represented by a p y length-distance pair. – Length-distance pairs and literals produced by LZ77 algorithm are Huffman coded to l ith H ff d d t enhance the compression ratio. Virat Agarwal, STI Cell workshop 9
ZLIB: Optimization for the Cell • Optimizing on the SPE (most compute intensive parts) – Compression Compression • Finding longest matches in LZ77 algorithm – Decompression • Converting Huffman coded data to length distance pairs • Converting Huffman coded data to length-distance pairs – Reduce memory requirement • Parallelizing for the SPEs P ll li i f th SPE – Full flushing to break data dependency – Work queue to achieve load balancing – Extending Gzip header format to enable faster decompression • Include information on flush points • Keep it compatible with legacy gzip decompressor eep t co pat b e t egacy g p deco p esso Virat Agarwal, STI Cell workshop 10
GZIP Performance results Speedup of gzip compression Speedup of gzip decompression Cell optimized vs sequential gzip Cell optimized vs sequential gzip for a single SPE for a single SPE 5 5 Compression Level 1 Compression Level 1 Compression Level 1 Compression Level 5 Compression Level 5 Compression Level 5 Compression Level 9 4 Compression Level 9 4 3 eedup 3 dup Spe Spee 2 2 1 1 0 0 Compressed Text Bitmap1 Bitmap2 Bitmap3 Compressed Bitmap File 1 Text File Bitmap File 2 Bitmap File 3 File Type Virat Agarwal, STI Cell workshop 11
GZIP Performance results Speedup of Cell optimized gzip compression with varying number of SPEs 8 Obtained Speedup 7 6 5 Performance Comparison of Speedup Cell optimized gzip with other Architectures 4 600 3 2 9 2.9 2 2 500 1 0 Time (sec) 400 1 2 3 4 5 6 7 8 Number of SPEs 300 Running S Speedup of Cell optimized gzip decompression d f C ll ti i d i d i with varying number of SPEs 8 200 Obtained Speedup 7 6 100 5 Speedup 0 4 Cell Optimized Intel 3.2 Ghz 3 2 1 0 1 2 3 4 5 6 7 8 Number of SPEs Virat Agarwal, STI Cell workshop 12
FFT on Cell • Williams et al. analyzed peak performance of FFT of various types FFT of various types. • Green and Cooper showed impressive results for an FFT of size 64K for an FFT of size 64K. • Chow et al. developed a design for 16 million complex samples. complex samples • FFTW supports FFT of various size, type and accuracy accuracy. • None exhibit good performance for small input size input size Virat Agarwal, STI Cell workshop 13
FFT Algorithm used : Cooley Tukey Butterflies of ordered DIF � Out of Place 1D FFT requires two arrays � Out of Place 1D FFT requires two arrays FFT Algorithm A & B for computation at each stage � Saves bit-reversal stage Virat Agarwal, STI Cell workshop 14
FFT Algorithm for the Cell/B.E. Parallelization � Number of chunks = 2* p , where p : Number of SPEs � Chunk i and i+p are allocated to SPE i � Each chunk is fetched using DMA get with multibuffering with multibuffering Tree Synchronization Tree Synchronization � Synchronization after every stage using Inter-SPE DMA communication, Achieved in (2*log n) stages. � Each synchronization stage takes 1 microsec - PPU-coordinated synchronization takes 20 microsec. Virat Agarwal, STI Cell workshop 15
FFT: Optimization for SPE � Loop duplication for Stages 1 & 2 - For vectorization of these stages we need to use spu_shuffle on output vector. � Loop duplication for NP<buffersize and otherwise. - Need to stall for DMA get at different places Need to stall for DMA get at different places within the loop. � Code size increases which limits the size of FFT that can be computed. Virat Agarwal, STI Cell workshop 16
FFT: Design Challenges � Synchronization step after every stage (log N ) leads to significant overhead - minimize the sync. time by using tree based approach using inter SPE comm. � Limited local store - require space for twiddle factors and input data. - loop unrolling and duplication increases size of the code. � Algorithm is memory-bound - use multi-buffering - further increases the required space in a g q p limited local store. � Code is branchy with a doubly nested for loop within the outer y y p while loop, lack of branch predictor compromises performance. Virat Agarwal, STI Cell workshop 17
FFT: Performance analysis Performance Comparison of our optimized FFT implementation FFT Size 1K as compared with other architectures. Number of SPEs vs Running Time IBM Power5 50 8 AMD Opteron p Intel Pentium 4 ng Time (microseconds) 40 16 formance Improvement FFTW on Cell 6 Our implementation 30 Intel Core Duo 14 4 20 12 Runni Perf 2 10 10 GigaFlop/s 0 0 1 2 4 8 8 Number of SPEs FFT Size 8K FFT Size 8K 6 6 Number of SPEs vs Running Time 4 400 8 2 econds) ement 300 6 Running Time (microse Performance Improve 0 1024 2048 4096 8192 16384 200 4 Input size 100 2 Operation Count : (5*N log N) floating point operations fl i i i 0 0 1 2 4 8 Number of SPEs Virat Agarwal, STI Cell workshop 18
Recommend
More recommend