High Performance Combinatorial Algorithm Design on the Cell/B.E. - PowerPoint PPT Presentation

High Performance Combinatorial Algorithm Design on the Cell/B.E. David A. Bader, Virat Agarwal Kamesh Madduri Seunghwa Kang Sulabh Patel Kamesh Madduri, Seunghwa Kang, Sulabh Patel

Cell System Features � Heterogeneous multi-core � Synergistic Processor Element (SPE) system architecture consists of � Power Processor � Synergistic Processor Unit (SPU) Element for control � Synergistic Memory Flow Control (MFC) tasks � Data movement & synchronization � Synergistic Processor y g � Interface to high-performance � Interface to high performance Elements for data- Element Interconnect Bus intensive processing Virat Agarwal, STI Cell workshop 2

Cellbuzz @ Georgia Tech. • List ranking • Fast Fourier Transform • Zlib Compression/Decompression p / p • RC5 Encryption/Decryption • MPEG2 • MPEG2 Open-source, can be obtained from : http://sourceforge.net/projects/cellbuzz p // g /p j / Virat Agarwal, STI Cell workshop 3

List Ranking on Cell • Cell performs well for applications with predictable memory access patterns [Williams et. al. 2006] p [ ] • Conjecture: Can Cell architecture also perform well for applications that exhibit irregular memory access patterns? li ti th t hibit i l tt ? – Non-contiguous accesses to global data structures with low degrees of locality List ranking is a special case of Parallel Prefix where the f P ll l P fi h h values are initially set to 1 (except for the head) and addition addition is is used used as as the the operator. Virat Agarwal, STI Cell workshop 4

A parallel algorithm for List Ranking • SMP algorithm [Helman & JaJa, 1999] 1. Partition the input list into s sublists, by randomly choosing s sublist head nodes , one from each memory block of n/(s − 1) nodes. 2. Traverse each sublist computing the prefix sum of each node within the sublists the sublists. 3. Calculate prefix sums of the sublist head nodes. 4. Traverse the sublists again, summing the prefix sum values of each node with the value of its sublist head node. • Design Issues – Frequent DMA transfers required to fetch successor elements. – No significant computation in the algorithm, thus communication creates a bottleneck. – Need to hide DMA latency by overlapping computation with communication. Virat Agarwal, STI Cell workshop 5

A Generic Latency-hiding technique • Cell supports non-blocking memory transfers • Requires identification of another level of parallelism within each SPE. • Concept of software-managed-threads (SM-Threads) – SPE computation is distributed among these threads – SM-Threads are scheduled according to Round Robin policy . g p y • Instruction level profiling to determine the minimum number of SM- Threads needed to hide latency Threads needed to hide latency. – Tradeoff between latency and number of SM-threads. Virat Agarwal, STI Cell workshop 6

List Ranking: Performance Analysis Tuning the DMA parameter Running Time (PPE-only vs PPE+SPE’s) Random List Ordered List 120 5 120 6 100 100 5 4 ning Time (msec) ning Time(msec) provement factor provement factor 80 80 4 3 60 60 3 2 Runn Imp Imp Run 40 40 2 1 20 20 1 0 0 0 0 1 2 4 8 1 2 4 8 Number of DMA buffers Number of DMA buffers Listranking on Cell - Ordered Lists Listranking on Cell - Random Lists PPE-only vs Cell Optimized PPE-only vs Cell Optimized 140 2500 Cell (PPE-only) Cell (PPE-only) 120 Cell Optimized Cell Optimized 2000 ec) 100 c) Running time (msec Running Time (mse 1500 80 60 1000 40 500 20 0 0 16 17 18 19 20 21 22 23 16 17 18 19 20 21 22 23 log(list size) log(list size) Virat Agarwal, STI Cell workshop 7

List Ranking : Performance Analysis g y Comparison with other architectures � 2.5 times faster than an optimized parallel implementation on a d dual core Woodcrest (Intel Xeon 5150) l W d t (I t l X 5150) - 4.6 times faster than single core. Virat Agarwal, STI Cell workshop 8

ZLIB: Data compression/decompression • LZ77 algorithm [J. Ziv & A. Lempel, 1977] – Identify the longest repeating string from the previous data. – Replace the duplicated string Replace the duplicated string with the reference to its previous occurrence. – A reference is represented by a p y length-distance pair. – Length-distance pairs and literals produced by LZ77 algorithm are Huffman coded to l ith H ff d d t enhance the compression ratio. Virat Agarwal, STI Cell workshop 9

ZLIB: Optimization for the Cell • Optimizing on the SPE (most compute intensive parts) – Compression Compression • Finding longest matches in LZ77 algorithm – Decompression • Converting Huffman coded data to length distance pairs • Converting Huffman coded data to length-distance pairs – Reduce memory requirement • Parallelizing for the SPEs P ll li i f th SPE – Full flushing to break data dependency – Work queue to achieve load balancing – Extending Gzip header format to enable faster decompression • Include information on flush points • Keep it compatible with legacy gzip decompressor eep t co pat b e t egacy g p deco p esso Virat Agarwal, STI Cell workshop 10

GZIP Performance results Speedup of gzip compression Speedup of gzip decompression Cell optimized vs sequential gzip Cell optimized vs sequential gzip for a single SPE for a single SPE 5 5 Compression Level 1 Compression Level 1 Compression Level 1 Compression Level 5 Compression Level 5 Compression Level 5 Compression Level 9 4 Compression Level 9 4 3 eedup 3 dup Spe Spee 2 2 1 1 0 0 Compressed Text Bitmap1 Bitmap2 Bitmap3 Compressed Bitmap File 1 Text File Bitmap File 2 Bitmap File 3 File Type Virat Agarwal, STI Cell workshop 11

GZIP Performance results Speedup of Cell optimized gzip compression with varying number of SPEs 8 Obtained Speedup 7 6 5 Performance Comparison of Speedup Cell optimized gzip with other Architectures 4 600 3 2 9 2.9 2 2 500 1 0 Time (sec) 400 1 2 3 4 5 6 7 8 Number of SPEs 300 Running S Speedup of Cell optimized gzip decompression d f C ll ti i d i d i with varying number of SPEs 8 200 Obtained Speedup 7 6 100 5 Speedup 0 4 Cell Optimized Intel 3.2 Ghz 3 2 1 0 1 2 3 4 5 6 7 8 Number of SPEs Virat Agarwal, STI Cell workshop 12

FFT on Cell • Williams et al. analyzed peak performance of FFT of various types FFT of various types. • Green and Cooper showed impressive results for an FFT of size 64K for an FFT of size 64K. • Chow et al. developed a design for 16 million complex samples. complex samples • FFTW supports FFT of various size, type and accuracy accuracy. • None exhibit good performance for small input size input size Virat Agarwal, STI Cell workshop 13

FFT Algorithm used : Cooley Tukey Butterflies of ordered DIF � Out of Place 1D FFT requires two arrays � Out of Place 1D FFT requires two arrays FFT Algorithm A & B for computation at each stage � Saves bit-reversal stage Virat Agarwal, STI Cell workshop 14

FFT Algorithm for the Cell/B.E. Parallelization � Number of chunks = 2* p , where p : Number of SPEs � Chunk i and i+p are allocated to SPE i � Each chunk is fetched using DMA get with multibuffering with multibuffering Tree Synchronization Tree Synchronization � Synchronization after every stage using Inter-SPE DMA communication, Achieved in (2*log n) stages. � Each synchronization stage takes 1 microsec - PPU-coordinated synchronization takes 20 microsec. Virat Agarwal, STI Cell workshop 15

FFT: Optimization for SPE � Loop duplication for Stages 1 & 2 - For vectorization of these stages we need to use spu_shuffle on output vector. � Loop duplication for NP<buffersize and otherwise. - Need to stall for DMA get at different places Need to stall for DMA get at different places within the loop. � Code size increases which limits the size of FFT that can be computed. Virat Agarwal, STI Cell workshop 16

FFT: Design Challenges � Synchronization step after every stage (log N ) leads to significant overhead - minimize the sync. time by using tree based approach using inter SPE comm. � Limited local store - require space for twiddle factors and input data. - loop unrolling and duplication increases size of the code. � Algorithm is memory-bound - use multi-buffering - further increases the required space in a g q p limited local store. � Code is branchy with a doubly nested for loop within the outer y y p while loop, lack of branch predictor compromises performance. Virat Agarwal, STI Cell workshop 17

FFT: Performance analysis Performance Comparison of our optimized FFT implementation FFT Size 1K as compared with other architectures. Number of SPEs vs Running Time IBM Power5 50 8 AMD Opteron p Intel Pentium 4 ng Time (microseconds) 40 16 formance Improvement FFTW on Cell 6 Our implementation 30 Intel Core Duo 14 4 20 12 Runni Perf 2 10 10 GigaFlop/s 0 0 1 2 4 8 8 Number of SPEs FFT Size 8K FFT Size 8K 6 6 Number of SPEs vs Running Time 4 400 8 2 econds) ement 300 6 Running Time (microse Performance Improve 0 1024 2048 4096 8192 16384 200 4 Input size 100 2 Operation Count : (5*N log N) floating point operations fl i i i 0 0 1 2 4 8 Number of SPEs Virat Agarwal, STI Cell workshop 18

High Performance Combinatorial Algorithm Design on the Cell/B.E. - PowerPoint PPT Presentation

High Performance Combinatorial Algorithm Design on the Cell/B.E. David A. Bader, Virat Agarwal Kamesh Madduri Seunghwa Kang Sulabh Patel Kamesh Madduri, Seunghwa Kang, Sulabh Patel Cell System Features Heterogeneous multi-core

Introduction to Combinatorial Algorithms Lucia Moura Fall 2015 Introduction to Combinatorial

Introduction to Combinatorial Algorithms Lucia Moura Winter 2018 Introduction to Combinatorial

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

Introduction: Combinatorial Problems Combinatorial Problem Solving (CPS) Enric Rodr

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

CHAPTER IV IV CHAPTER Combinatorial Optimization Combinatorial Optimization by Neural Networks

Combinatorial Markov chains R. Gr ubel Leibniz Universit at Hannover AofA, Menorca 2013

20.1 Combinatorial Optimization next chapters: combinatorial optimization similar scenario,

Algorithm Analysis October 12, 2016 CMPE 250 Algorithm Analysis October 12, 2016 1 / 66

Combinatorial Security Testing: Combinatorial Testing Meets Information Security Dimitris E.

Combinatorial Testing Rick Kuhn NIST Computer Security Division NIST Combinatorial Testing

Model Theory and Combinatorial Geometry. Sergei Starchenko (joint with Artem Chernikov and David

Combinatorial Enumeration Jason Z. Gao Carleton University, Ottawa, Canada Counting

CS599: Convex and Combinatorial Optimization Fall 2013 Lecture 17: Combinatorial Problems as

COMBINATORIAL GAMES Combinatorial Games Game 1 Start with n chips. Players A,B alternately take

A combinatorial analysis of Severi degrees Fu Liu University of California, Davis The 16th

Chapter V: Indexing & Searching Information Retrieval & Data Mining Universitt des

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

An Intra-Chip Free-Space Optical Interconnect Jing Xue , Alok Garg, Berkehan Ciftcioglu, Jianyun

S Summary of f Technical Technical Achievements Sverre Jarp, CERN openlab CTO Sverre Jarp,

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation Pipeline CPI = Ideal

COMP 3713 Operating Systems Slides Part 1 Jim Diamond CAR 409 Jodrey School of Computer

CS603: Distributed Systems Lecture 1: Basic Communication Services Cristina Nita-Rotaru Lecture

Distributed Systems: Ordering and Consistency October 11, 2018 A.F. Cooper Context and

High Performance Combinatorial Algorithm Design on the Cell/B.E. - PowerPoint PPT Presentation

High Performance Combinatorial Algorithm Design on the Cell/B.E. David A. Bader, Virat Agarwal Kamesh Madduri Seunghwa Kang Sulabh Patel Kamesh Madduri, Seunghwa Kang, Sulabh Patel Cell System Features Heterogeneous multi-core

Introduction to Combinatorial Algorithms Lucia Moura Fall 2015 Introduction to Combinatorial

Introduction to Combinatorial Algorithms Lucia Moura Winter 2018 Introduction to Combinatorial

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

Introduction: Combinatorial Problems Combinatorial Problem Solving (CPS) Enric Rodr

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

CHAPTER IV IV CHAPTER Combinatorial Optimization Combinatorial Optimization by Neural Networks

Combinatorial Markov chains R. Gr ubel Leibniz Universit at Hannover AofA, Menorca 2013

20.1 Combinatorial Optimization next chapters: combinatorial optimization similar scenario,

Algorithm Analysis October 12, 2016 CMPE 250 Algorithm Analysis October 12, 2016 1 / 66

Combinatorial Security Testing: Combinatorial Testing Meets Information Security Dimitris E.

Combinatorial Testing Rick Kuhn NIST Computer Security Division NIST Combinatorial Testing

Model Theory and Combinatorial Geometry. Sergei Starchenko (joint with Artem Chernikov and David

Combinatorial Enumeration Jason Z. Gao Carleton University, Ottawa, Canada Counting

CS599: Convex and Combinatorial Optimization Fall 2013 Lecture 17: Combinatorial Problems as

COMBINATORIAL GAMES Combinatorial Games Game 1 Start with n chips. Players A,B alternately take

A combinatorial analysis of Severi degrees Fu Liu University of California, Davis The 16th

Chapter V: Indexing &amp; Searching Information Retrieval &amp; Data Mining Universitt des

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

An Intra-Chip Free-Space Optical Interconnect Jing Xue , Alok Garg, Berkehan Ciftcioglu, Jianyun

S Summary of f Technical Technical Achievements Sverre Jarp, CERN openlab CTO Sverre Jarp,

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation Pipeline CPI = Ideal

COMP 3713 Operating Systems Slides Part 1 Jim Diamond CAR 409 Jodrey School of Computer

CS603: Distributed Systems Lecture 1: Basic Communication Services Cristina Nita-Rotaru Lecture

Distributed Systems: Ordering and Consistency October 11, 2018 A.F. Cooper Context and

Chapter V: Indexing & Searching Information Retrieval & Data Mining Universitt des