A Hi h L A High-Level Signal Processing l Si l P i Library for Multicore Processors Sharon Sacco, Nadya Bliss, Ryan Haney, Jeremy Kepner, Hahn Kim, Sanjeev Mohindra, J K H h Ki S j M hi d Glenn Schrader and Edward Rutledge This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government y y MIT Lincoln Laboratory
Outline • Overview • HPEC Challenge Benchmarks • • Parallel Vector Tile Optimizing Library Parallel Vector Tile Optimizing Library • Summary MIT Lincoln Laboratory GT Cell Workshop-2 SMHS 6/19/2007
Embedded Digital Systems Open System Rapid System Technologies Prototyping Next-Generation Warfighting Vision High Performance Embedded Computing— C ti Widebody Airborne Wid b d Ai b Software Initiative Sensor Platform Greenbank Next Gen Radar Triple Canopy Open Systems Architecture Foliage Penetration Advanced Hardware Network & Implementations Decision Support Initiatives Receiver on-Chip Integrated Sensing and Decision Support Very Large Scale Integration/ Field Programmable Gate Array Hybrids LLGrid MIT Lincoln Laboratory GT Cell Workshop-3 SMHS 6/19/2007
Embedded Processor Evolution High Performance Embedded Processors 10000 tt LOPS / Wat Cell i860 1000 MPC7447A SHARC MPC7410 MPC7400 PowerPC MFL 100 PowerPC with AltiVec SHARC 750 Cell (estimated) 10 i860 XR 603e 1990 2000 2010 Year Year • 20 years of exponential growth in FLOPS / Watt • Requires switching architectures every ~5 years • Cell processor is current high performance architecture MIT Lincoln Laboratory GT Cell Workshop-4 SMHS 6/19/2007
Outline • Overview • HPEC Challenge Benchmarks • Time-Domain FIR Filter • Results • P Parallel Vector Tile Optimizing Library ll l V t Til O ti i i Lib • Summary MIT Lincoln Laboratory GT Cell Workshop-5 SMHS 6/19/2007
HPEC Challenge Information and Knowledge Processing Kernels G Genetic Algorithm ti Al ith P tt Pattern Match M t h • Selection Crossover Mutation Compute best match for a pattern out of 0.4 set of candidate patterns 0.3 0.2 – Uses weighted mean square error Uses weighted mean-square error 0 1 0.1 Evaluation Candidate Pattern 1 • Evaluate each chromosome Pattern under test Candidate Pattern 2 • Select chromosomes for next generation Mag … M • Crossover: randomly pair up chromosomes Range and exchange portions Candidate Pattern N • Mutation: randomly change each chromosome Database Operations Database Operations Corner Turn Corner Turn • 0 1 2 3 0 4 8 1 Three generic database operations: 4 5 6 7 5 9 2 6 search: find all items – 8 9 10 11 10 3 7 11 in a given range in a given range Red-Black Tree * Numbers denote memory content – insert: add items to the Data Structure database • Memory rearrangement of matrix delete:remove item – contents from the database – Switch from row to column major Switch from row to column major Linked List layout Data Structures MIT Lincoln Laboratory GT Cell Workshop-6 SMHS 6/19/2007
HPEC Challenge Signal and Image Processing Kernels FIR FIR QR QR M Filters nnels Input Matrix (~10 coefficients) A Q R * M Chan (MxN) (MxM) (MxN) M Filters (>100 coefficients) • Computes the factorization of an input matrix, A=QR matrix, A QR • • Bank of filters applied to input data Bank of filters applied to input data • Implementation uses Fast Givens • FIR filters implemented in time and algorithm frequency domain SVD SVD CFAR CFAR Input Diagonal Bidiagonal Target List Dopplers C(i,j,k) Matrix Σ Matrix Matrix C (i,j,k) Range Normalize, T(i,j,k) Beams Threshold • Produces decomposition of an input • matrix, X=U Σ V H Creates a target list given a data cube • • Calculates normalized power for each Classic Golub-Kahan SVD cell, thresholds for target detection implementation MIT Lincoln Laboratory GT Cell Workshop-7 SMHS 6/19/2007
Time Domain FIR Algorithm • Number of Operations: 3 2 1 0 Filter slides along k – Filter size Single Filter (example size 4) reference to form n – Input size dot products x x x x nf - number of filters nf - number of filters . . . Total FOPs: ~ 8 x nf x n x k 0 1 2 3 4 5 6 7 n-2 n-1 • Output Size: n + k - 1 + Output point Reference Input data . . . 0 1 2 3 4 5 6 7 M-3 M-2 M-1 HPEC Challenge Parameters TDFIR • TDFIR uses complex data Set k n nf • TDFIR uses a bank of filters 1 1 128 128 4096 4096 64 64 – Each filter is used in a tapered convolution – A convolution is a series of dot products 2 12 1024 20 • FIR is one of the best ways to demonstrate FLOPS FIR i f th b t t d t t FLOPS MIT Lincoln Laboratory GT Cell Workshop-8 SMHS 6/19/2007
Performance Challenges Dual issue Efficiently partition the instructions application Keep the pipelines full Keep the pipelines full Don’t let Can the buses be controlled the Maximize the use of Tiles efficiently? cache SIMD Registers g slow things Access memory Keep the data Registers down efficiently flowing Instr. Operands Can the exact Cache Cache processor be Watch out Cover memory transfers Blocks selected? for race with computations Local Memory conditions Messages Remote Memory Remote Memory C Can information on disk be f Pages preloaded before needed? Disk • Price of performance is increased programming complexity Price of performance is increased programming comple it MIT Lincoln Laboratory GT Cell Workshop-9 SMHS 6/19/2007
Reference C implementation for (i = K; i > 0; i--){ • Computations take 2 /* Set accumulators and pointers for dot product lines for output point */ r1 = Rin; • Mostly loop control, r2 = Iin; pointers, and initialization o1 = Rout; o2 = Iout; • Output initialization / calculate contributions from a single kernel point / /* calculate contributions from a single kernel point */ assumed d for (j = 0; j < N; j++){ • SPE needs split complex *o1 += *k1 * *r1 - *k2 * *r2; • Separate real and *o2 += *k2 * *r1 + *k1 * *r2; imaginary vectors imaginary vectors r1++; r2++; o1++; o2++; } /* update input pointers */ k1++; k2++; Rout++; R t Reference C FIR is Iout++; } easy to understand MIT Lincoln Laboratory GT Cell Workshop-10 SMHS 6/19/2007
C with SIMD Extensions /* load reference data and shift */ • Inner loop contributes to 4 ir0 = *Rin++; ii0 = *Iin++; output points per pass ir1 = (vector float) spu_shuffle(irOld, ir0, shift1); ii1 = (vector float) spu_shuffle(iiOld, ii0, shift1); • SIMD registers in use • SIMD registers in use ir2 = (vector float) spu_shuffle(irOld, ir0, shift2); ii2 = (vector float) spu_shuffle(iiOld, ii0, shift2); • Shuffling of values in ir3 = (vector float) spu_shuffle(irOld, ir0, shift3); registers is a requirement ii3 = (vector float) spu_shuffle(iiOld, ii0, shift3); – Compilers are unlikely to Compilers are unlikely to Rtemp = kr0 * ir0 + Rtemp; Rtemp = kr0 * ir0 + Rtemp; Itemp = kr0 * ii0 + Itemp; Itemp = kr0 * ii0 + Itemp; recognize this type of code Rtemp = -(ki0 * ii0 - Rtemp); Itemp = ki0 * ir0 + Itemp; • Can rival assembly code Rtemp = kr1 * ir1 + Rtemp; Itemp = kr1 * ii1 + Itemp; Rtemp = -(ki1 * ii1 - Rtemp); Itemp = ki1 * ir1 + Itemp; with more effort Rtemp = kr2 * ir2 + Rtemp; Itemp = kr2 * ii2 + Itemp; Rtemp = -(ki2 * ii2 - Rtemp); Itemp = ki2 * ir2 + Itemp; Rtemp = kr3 * ir3 + Rtemp; Itemp = kr3 * ii3 + Itemp; Rtemp = -(ki3 * ii3 - Rtemp); Itemp = ki3 * ir3 + Itemp; • SIMD C extensions • SIMD C extensions *Rout++ = Rtemp; *Iout++ = Itemp; increase code complexity irOld = ir0; iiOld = ii0; /* update old values */ – Hardware needs consideration consideration Contents of inner loop of convolution MIT Lincoln Laboratory GT Cell Workshop-11 SMHS 6/19/2007
Recommend
More recommend