A Hi h L A High-Level Signal Processing l Si l P i Library for - PowerPoint PPT Presentation

A Hi h L A High-Level Signal Processing l Si l P i Library for Multicore Processors Sharon Sacco, Nadya Bliss, Ryan Haney, Jeremy Kepner, Hahn Kim, Sanjeev Mohindra, J K H h Ki S j M hi d Glenn Schrader and Edward Rutledge This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government y y MIT Lincoln Laboratory

Outline • Overview • HPEC Challenge Benchmarks • • Parallel Vector Tile Optimizing Library Parallel Vector Tile Optimizing Library • Summary MIT Lincoln Laboratory GT Cell Workshop-2 SMHS 6/19/2007

Embedded Digital Systems Open System Rapid System Technologies Prototyping Next-Generation Warfighting Vision High Performance Embedded Computing— C ti Widebody Airborne Wid b d Ai b Software Initiative Sensor Platform Greenbank Next Gen Radar Triple Canopy Open Systems Architecture Foliage Penetration Advanced Hardware Network & Implementations Decision Support Initiatives Receiver on-Chip Integrated Sensing and Decision Support Very Large Scale Integration/ Field Programmable Gate Array Hybrids LLGrid MIT Lincoln Laboratory GT Cell Workshop-3 SMHS 6/19/2007

Embedded Processor Evolution High Performance Embedded Processors 10000 tt LOPS / Wat Cell i860 1000 MPC7447A SHARC MPC7410 MPC7400 PowerPC MFL 100 PowerPC with AltiVec SHARC 750 Cell (estimated) 10 i860 XR 603e 1990 2000 2010 Year Year • 20 years of exponential growth in FLOPS / Watt • Requires switching architectures every ~5 years • Cell processor is current high performance architecture MIT Lincoln Laboratory GT Cell Workshop-4 SMHS 6/19/2007

Outline • Overview • HPEC Challenge Benchmarks • Time-Domain FIR Filter • Results • P Parallel Vector Tile Optimizing Library ll l V t Til O ti i i Lib • Summary MIT Lincoln Laboratory GT Cell Workshop-5 SMHS 6/19/2007

HPEC Challenge Information and Knowledge Processing Kernels G Genetic Algorithm ti Al ith P tt Pattern Match M t h • Selection Crossover Mutation Compute best match for a pattern out of 0.4 set of candidate patterns 0.3 0.2 – Uses weighted mean square error Uses weighted mean-square error 0 1 0.1 Evaluation Candidate Pattern 1 • Evaluate each chromosome Pattern under test Candidate Pattern 2 • Select chromosomes for next generation Mag … M • Crossover: randomly pair up chromosomes Range and exchange portions Candidate Pattern N • Mutation: randomly change each chromosome Database Operations Database Operations Corner Turn Corner Turn • 0 1 2 3 0 4 8 1 Three generic database operations: 4 5 6 7 5 9 2 6 search: find all items – 8 9 10 11 10 3 7 11 in a given range in a given range Red-Black Tree * Numbers denote memory content – insert: add items to the Data Structure database • Memory rearrangement of matrix delete:remove item – contents from the database – Switch from row to column major Switch from row to column major Linked List layout Data Structures MIT Lincoln Laboratory GT Cell Workshop-6 SMHS 6/19/2007

HPEC Challenge Signal and Image Processing Kernels FIR FIR QR QR M Filters nnels Input Matrix (~10 coefficients) A Q R * M Chan (MxN) (MxM) (MxN) M Filters (>100 coefficients) • Computes the factorization of an input matrix, A=QR matrix, A QR • • Bank of filters applied to input data Bank of filters applied to input data • Implementation uses Fast Givens • FIR filters implemented in time and algorithm frequency domain SVD SVD CFAR CFAR Input Diagonal Bidiagonal Target List Dopplers C(i,j,k) Matrix Σ Matrix Matrix C (i,j,k) Range Normalize, T(i,j,k) Beams Threshold • Produces decomposition of an input • matrix, X=U Σ V H Creates a target list given a data cube • • Calculates normalized power for each Classic Golub-Kahan SVD cell, thresholds for target detection implementation MIT Lincoln Laboratory GT Cell Workshop-7 SMHS 6/19/2007

Time Domain FIR Algorithm • Number of Operations: 3 2 1 0 Filter slides along k – Filter size Single Filter (example size 4) reference to form n – Input size dot products x x x x nf - number of filters nf - number of filters . . . Total FOPs: ~ 8 x nf x n x k 0 1 2 3 4 5 6 7 n-2 n-1 • Output Size: n + k - 1 + Output point Reference Input data . . . 0 1 2 3 4 5 6 7 M-3 M-2 M-1 HPEC Challenge Parameters TDFIR • TDFIR uses complex data Set k n nf • TDFIR uses a bank of filters 1 1 128 128 4096 4096 64 64 – Each filter is used in a tapered convolution – A convolution is a series of dot products 2 12 1024 20 • FIR is one of the best ways to demonstrate FLOPS FIR i f th b t t d t t FLOPS MIT Lincoln Laboratory GT Cell Workshop-8 SMHS 6/19/2007

Performance Challenges Dual issue Efficiently partition the instructions application Keep the pipelines full Keep the pipelines full Don’t let Can the buses be controlled the Maximize the use of Tiles efficiently? cache SIMD Registers g slow things Access memory Keep the data Registers down efficiently flowing Instr. Operands Can the exact Cache Cache processor be Watch out Cover memory transfers Blocks selected? for race with computations Local Memory conditions Messages Remote Memory Remote Memory C Can information on disk be f Pages preloaded before needed? Disk • Price of performance is increased programming complexity Price of performance is increased programming comple it MIT Lincoln Laboratory GT Cell Workshop-9 SMHS 6/19/2007

Reference C implementation for (i = K; i > 0; i--){ • Computations take 2 /* Set accumulators and pointers for dot product lines for output point */ r1 = Rin; • Mostly loop control, r2 = Iin; pointers, and initialization o1 = Rout; o2 = Iout; • Output initialization / calculate contributions from a single kernel point / /* calculate contributions from a single kernel point */ assumed d for (j = 0; j < N; j++){ • SPE needs split complex *o1 += *k1 * *r1 - *k2 * *r2; • Separate real and *o2 += *k2 * *r1 + *k1 * *r2; imaginary vectors imaginary vectors r1++; r2++; o1++; o2++; } /* update input pointers */ k1++; k2++; Rout++; R t Reference C FIR is Iout++; } easy to understand MIT Lincoln Laboratory GT Cell Workshop-10 SMHS 6/19/2007

C with SIMD Extensions /* load reference data and shift */ • Inner loop contributes to 4 ir0 = *Rin++; ii0 = *Iin++; output points per pass ir1 = (vector float) spu_shuffle(irOld, ir0, shift1); ii1 = (vector float) spu_shuffle(iiOld, ii0, shift1); • SIMD registers in use • SIMD registers in use ir2 = (vector float) spu_shuffle(irOld, ir0, shift2); ii2 = (vector float) spu_shuffle(iiOld, ii0, shift2); • Shuffling of values in ir3 = (vector float) spu_shuffle(irOld, ir0, shift3); registers is a requirement ii3 = (vector float) spu_shuffle(iiOld, ii0, shift3); – Compilers are unlikely to Compilers are unlikely to Rtemp = kr0 * ir0 + Rtemp; Rtemp = kr0 * ir0 + Rtemp; Itemp = kr0 * ii0 + Itemp; Itemp = kr0 * ii0 + Itemp; recognize this type of code Rtemp = -(ki0 * ii0 - Rtemp); Itemp = ki0 * ir0 + Itemp; • Can rival assembly code Rtemp = kr1 * ir1 + Rtemp; Itemp = kr1 * ii1 + Itemp; Rtemp = -(ki1 * ii1 - Rtemp); Itemp = ki1 * ir1 + Itemp; with more effort Rtemp = kr2 * ir2 + Rtemp; Itemp = kr2 * ii2 + Itemp; Rtemp = -(ki2 * ii2 - Rtemp); Itemp = ki2 * ir2 + Itemp; Rtemp = kr3 * ir3 + Rtemp; Itemp = kr3 * ii3 + Itemp; Rtemp = -(ki3 * ii3 - Rtemp); Itemp = ki3 * ir3 + Itemp; • SIMD C extensions • SIMD C extensions *Rout++ = Rtemp; *Iout++ = Itemp; increase code complexity irOld = ir0; iiOld = ii0; /* update old values */ – Hardware needs consideration consideration Contents of inner loop of convolution MIT Lincoln Laboratory GT Cell Workshop-11 SMHS 6/19/2007

A Hi h L A High-Level Signal Processing l Si l P i Library for - PowerPoint PPT Presentation

A Hi h L A High-Level Signal Processing l Si l P i Library for Multicore Processors Sharon Sacco, Nadya Bliss, Ryan Haney, Jeremy Kepner, Hahn Kim, Sanjeev Mohindra, J K H h Ki S j M hi d Glenn Schrader and Edward Rutledge This work

Signal Processing - Introduction Signal Processing Analogue/digital filters: extensively used

Digital Signal Processing Solutions Digital Signal Processing Solutions SIGNAL PROCESSING

Speech Processing 15-492/18-492 Speech Synthesis Signal Processing Signal Manipulation Signal

Tx Signal: 1000 Hz sine wave; Attenuation; Random noise with 0.5ms spike Tx Signal Noise Rx

Waveform Generation Fundamental part of signal processing is the signal. Within the

Advanced Digital Signal Processing Part 5: Multi-Rate Digital Signal Processing Gerhard Schmidt

VLSI Digital Signal Processing Systems Keshab K. Parhi VLSI Digital Signal Processing Systems

Signal Processing in MATLAB Signal Processing in MATLAB February 2, 1998 Tom Krauss PhD Student

Efficient audio signal processing using LLVM and Haskell Henning Thielemann 2013-04-30

Machine Learning for Signal Processing Lecture 1: Signal Representations Class 1. 27 August

Signal Types Recall even digital signals are just voltages Analog signal Continuous

Signal Types Recall even digital signals are just voltages Analog signal Continuous

Signal Processing in the Pure Programming Signal Processing in Pure Language Albert Grf Dept.

Speech Processing 11-492/18-492 Speech Synthesis Signal Processing Signal Manipulation

UN High UN High UN High UN High- - - -Level Meeting on TB Level Meeting on TB Level Meeting

Sampling a Signal an analog signal together with some samples of the signal. The samples

Ethereum Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department of Electrical Engineering

Report from the 8 th International Conference on Physics Opportunities at an ElecTron-Ion-Collider

Solving the Equation The Variables for Womens Success in Engineering and Computing U.S.

ALPHA SIGMA PHI Epsilon Upsilon Chapter at Clemson University Our 90-man Alpha Sigma Phi

FREDERICK R. POINTS Career Consultant Jewish Family Services

2 0 1 7 W UN Conference Thinking Differently: Does Social Justice exist in the Y? Should it?

Spar Spar e the Air e the Air Youth Youth T e c hnic a l Adviso ry Co mmitte e Me e

Outsourcing Service Delivery in a Fragile State: Experimental Evidence from Liberia Mauricio