Real-Time High-Throughput Sonar Beamforming Kernels Using Native Signal Processing and Memory Latency Hiding Techniques Gregory E. Allen 1 1 Brian L. Evans Lizy K. John Department of Electrical and Computer Engineering The University of Texas at Austin http://www.ece.utexas.edu/~allen/
Introduction •Sonar beamforming is computationally intensive • GFLOPS of computation • 100 MB/s of data input/output •Current real-time implementation technologies • Custom hardware • Custom integration using commercial-off-the-shelf (COTS) processors (e.g. 100 digital signal processors in a VME chassis) • Low production volume (50 units), high development cost •Examine performance of commodity computers • Native signal processing, multimedia instruction sets • Memory latency hiding techniques 2
Native Signal Processing •Single-cycle multiply-accumulate (MAC) operation • Vector dot products, digital filters, and correlation N α ∑ x i i i = 1 • Missing extended precision accumulation •Single-instruction multiple-data (SIMD) processing • UltraSPARC Visual Instruction Set (VIS) and Pentium MMX : 64-bit registers, 8-bit and 16-bit fixed-point arithmetic • Pentium III , K6-2 3DNow! : 64-bit registers, 32-bit floating-point • PowerPC AltiVec: 128-bit registers, 4x32 bit floating-point MACs •Must hand-code using intrinsics and assembly code 3
Visual Instruction Set •50 new CPU instructions for UltraSPARC • Optimized for video and image processing • Partitioned data types in 32-bit or 64-bit FP registers • Includes arithmetic and logic, packing and unpacking, alignment and data conversion, etc. •Independent operation on each data cell (SIMD) vis_d64 A1 A2 A3 A4 63 47 31 15 0 vis_d64 B1 B2 B3 B4 63 47 31 15 0 vis_fpadd16 + + + + vis_d64 A1+B1 A2+B2 A3+B3 A4+B4 63 47 31 15 0 •Inline function library provided for use from C/C++ 4
Memory Latency Hiding •Fast processor stalls when accessing slow memory • Cache memories can help to alleviate this problem • High-throughput streams of data amplify this problem • Software techniques can reduce the penalty •Technique: Loop unrolling • Enlarges basic block size and reduces looping overhead • Can increase the time between data request and consumption • Low risk and no overhead, commonly used by compilers •Technique: Software pipelining • Data load and usage overlaped from different loop iterations • Increases register usage and lifetimes, hard for compiler 5
Software Data Prefetching •Non-blocking prefetch CPU instruction • Issued at some time prior to when data is needed • Data at effective address is brought into cache • At a later load instruction, the data is already cached •Problems: overhead and “prefetch distance” • Uses extra cache and issues extra instructions • Prefetch too far ahead: excessive cache usage, spillage • Not far enough ahead: stall at load instruction •Can be generated by a compiler •Implemented in the UltraSPARC-II CPU 6
Sonar Beamforming •We evaluate two key kernels for 3-D beamforming •Typically the computational bottleneck in sonar •High throughput streams of data • Goal: best performance using any means 7
Time-Domain Beamforming •Delay-and-sum weighted sensor outputs •Geometrically project the sensor elements onto a line to compute the time delays Projection for a beam pointing 20° off axis 20 M Σ α i x i (t– τ i ) b(t) = 15 i = 1 20° 10 b(t) beam outputi ith sensor output xi(t) 5 τ i ith sensor delay 0 α i ith sensor weight sensor element projected element -5 -20 -15 -10 -5 0 5 10 15 20 x position, inches 8
Horizontal Beamformer •Sample at just above the Nyquist rate, interpolate to obtain desired time delay resolution Interpolate up to Time delay α 1 interval δ = ∆ /L at interval δ Single z -N 1 Interpolate beam output α M Σ Stave data at • • b [ n ] interval ∆ • • z -N M Interpolate Digital Interpolation Beamformer •Modeled as a sparse FIR filter • Forming 61 beams from 80 elements with 2-point interpolation • 3000 index lookup plus 6000 floating-point MACs per sample • At each sample: 12 Kbytes of data, coefficient size of 36 Kbytes 9
Vertical Beamformer stave Multiple vertical transducers for every horizontal position •Vertical columns combined into 3 stave outputs • Multiple dot products (30 MACs per stave per sample) • Convert integer to floating-point for following stages •Ideal candidate for the Visual Instruction Set (VIS) • Use integer dot products (16x16-bit multiply, 32-bit add) • Highest precision (and slowest) VIS mode • Coefficients must be scaled for best dynamic range 10
Tools Utilized •Sun’s SPARCompiler5.0 • Automated prefetch instruction generation? • Inline assembly macros for VIS instructions • Wrote assembly macros for prefetch and fitos instructions •Shade: pficount (prefetch instruction counter) •INCAS (It’s a Nearly Cycle-Accurate Simulator) •perf-monitor: hardware performance counters •Benchmarks on a 336 MHz UltraSPARC-II 11
Horizontal Kernel Performance 450 maximum: 400 1.32 FLOPC 2.19 IPC 444 MFLOPS 350 66% of peak 300 multiple pass single pass inline PREFETCH 250 200 150 1 2 3 4 5 6 7 outer loop unrolling •Hand loop unrolling gives speedup of 2.4 •Multiple passes improve cache usage (93% / 97%) •Inline PREFETCH “breaks” compiler optimization 12
Vertical Kernel Performance 9 VIS, add software prefetching 0.93 IOPC 8 VIS, reschedule and pipeline 1.41 IPC 7 VIS, add double-loading 313 MIOPS 93% of peak 6 VIS, unrolled inner loop 5 VIS baseline 4 int (no VIS) 3 floating point, VIS loading 2 floating point in asm 1 floating point 0 50 100 150 200 250 300 350 MFLOPS (or MIOPS) •VIS offers a 46% boost over floating-point •Software prefetching gives an additional 34% •104 MB/s data input, 62.7 MB/s data output 13
Vertical Prefetch Statistics no stall load stall 4 read/write prefetching store stall 3 read prefetching only 2 write prefetching only 1 no prefetching 0 0.5 1 1.5 2 2.5 3 Exec time (sec) •Breakdown of execution time •Execution cycles (no stall) constant across trials •Internal cache statistics do not change 14
Conclusion •Beamforming kernel results: • Horizontal beamformer kernel: 444 MFLOPS, 66% of peak • Vertical beamformer kernel: 313 MFLOPS, 93% of peak • Loop unrolling: 2.4 speedup in horizontal kernel • VIS: 1.46 speedup in vertical kernel • prefetching: 1.34 speedup in vertical kernel •Near-peak performance can be achieved, but • Kernel optimization is difficult and time consuming • Compiler did not generate prefetch instructions •For high-throughput real-time signal processing, general purpose CPUs can be an attractive target 15
Recommend
More recommend