Real-Time High-Throughput Sonar Beamforming Kernels Using Native - PowerPoint PPT Presentation

Real-Time High-Throughput Sonar Beamforming Kernels Using Native Signal Processing and Memory Latency Hiding Techniques Gregory E. Allen 1 1 Brian L. Evans Lizy K. John Department of Electrical and Computer Engineering The University of Texas at Austin http://www.ece.utexas.edu/~allen/

Introduction •Sonar beamforming is computationally intensive • GFLOPS of computation • 100 MB/s of data input/output •Current real-time implementation technologies • Custom hardware • Custom integration using commercial-off-the-shelf (COTS) processors (e.g. 100 digital signal processors in a VME chassis) • Low production volume (50 units), high development cost •Examine performance of commodity computers • Native signal processing, multimedia instruction sets • Memory latency hiding techniques 2

Native Signal Processing •Single-cycle multiply-accumulate (MAC) operation • Vector dot products, digital filters, and correlation N α ∑ x i i i = 1 • Missing extended precision accumulation •Single-instruction multiple-data (SIMD) processing • UltraSPARC Visual Instruction Set (VIS) and Pentium MMX : 64-bit registers, 8-bit and 16-bit fixed-point arithmetic • Pentium III , K6-2 3DNow! : 64-bit registers, 32-bit floating-point • PowerPC AltiVec: 128-bit registers, 4x32 bit floating-point MACs •Must hand-code using intrinsics and assembly code 3

Visual Instruction Set •50 new CPU instructions for UltraSPARC • Optimized for video and image processing • Partitioned data types in 32-bit or 64-bit FP registers • Includes arithmetic and logic, packing and unpacking, alignment and data conversion, etc. •Independent operation on each data cell (SIMD) vis_d64 A1 A2 A3 A4 63 47 31 15 0 vis_d64 B1 B2 B3 B4 63 47 31 15 0 vis_fpadd16 + + + + vis_d64 A1+B1 A2+B2 A3+B3 A4+B4 63 47 31 15 0 •Inline function library provided for use from C/C++ 4

Memory Latency Hiding •Fast processor stalls when accessing slow memory • Cache memories can help to alleviate this problem • High-throughput streams of data amplify this problem • Software techniques can reduce the penalty •Technique: Loop unrolling • Enlarges basic block size and reduces looping overhead • Can increase the time between data request and consumption • Low risk and no overhead, commonly used by compilers •Technique: Software pipelining • Data load and usage overlaped from different loop iterations • Increases register usage and lifetimes, hard for compiler 5

Software Data Prefetching •Non-blocking prefetch CPU instruction • Issued at some time prior to when data is needed • Data at effective address is brought into cache • At a later load instruction, the data is already cached •Problems: overhead and “prefetch distance” • Uses extra cache and issues extra instructions • Prefetch too far ahead: excessive cache usage, spillage • Not far enough ahead: stall at load instruction •Can be generated by a compiler •Implemented in the UltraSPARC-II CPU 6

Sonar Beamforming •We evaluate two key kernels for 3-D beamforming •Typically the computational bottleneck in sonar •High throughput streams of data • Goal: best performance using any means 7

Time-Domain Beamforming •Delay-and-sum weighted sensor outputs •Geometrically project the sensor elements onto a line to compute the time delays Projection for a beam pointing 20° off axis 20 M Σ α i x i (t– τ i ) b(t) = 15 i = 1 20° 10 b(t) beam outputi ith sensor output xi(t) 5 τ i ith sensor delay 0 α i ith sensor weight sensor element projected element -5 -20 -15 -10 -5 0 5 10 15 20 x position, inches 8

Horizontal Beamformer •Sample at just above the Nyquist rate, interpolate to obtain desired time delay resolution Interpolate up to Time delay α 1 interval δ = ∆ /L at interval δ Single z -N 1 Interpolate beam output α M Σ Stave data at • • b [ n ] interval ∆ • • z -N M Interpolate Digital Interpolation Beamformer •Modeled as a sparse FIR filter • Forming 61 beams from 80 elements with 2-point interpolation • 3000 index lookup plus 6000 floating-point MACs per sample • At each sample: 12 Kbytes of data, coefficient size of 36 Kbytes 9

Vertical Beamformer stave Multiple vertical transducers for every horizontal position •Vertical columns combined into 3 stave outputs • Multiple dot products (30 MACs per stave per sample) • Convert integer to floating-point for following stages •Ideal candidate for the Visual Instruction Set (VIS) • Use integer dot products (16x16-bit multiply, 32-bit add) • Highest precision (and slowest) VIS mode • Coefficients must be scaled for best dynamic range 10

Tools Utilized •Sun’s SPARCompiler5.0 • Automated prefetch instruction generation? • Inline assembly macros for VIS instructions • Wrote assembly macros for prefetch and fitos instructions •Shade: pficount (prefetch instruction counter) •INCAS (It’s a Nearly Cycle-Accurate Simulator) •perf-monitor: hardware performance counters •Benchmarks on a 336 MHz UltraSPARC-II 11

Horizontal Kernel Performance 450 maximum: 400 1.32 FLOPC 2.19 IPC 444 MFLOPS 350 66% of peak 300 multiple pass single pass inline PREFETCH 250 200 150 1 2 3 4 5 6 7 outer loop unrolling •Hand loop unrolling gives speedup of 2.4 •Multiple passes improve cache usage (93% / 97%) •Inline PREFETCH “breaks” compiler optimization 12

Vertical Kernel Performance 9 VIS, add software prefetching 0.93 IOPC 8 VIS, reschedule and pipeline 1.41 IPC 7 VIS, add double-loading 313 MIOPS 93% of peak 6 VIS, unrolled inner loop 5 VIS baseline 4 int (no VIS) 3 floating point, VIS loading 2 floating point in asm 1 floating point 0 50 100 150 200 250 300 350 MFLOPS (or MIOPS) •VIS offers a 46% boost over floating-point •Software prefetching gives an additional 34% •104 MB/s data input, 62.7 MB/s data output 13

Vertical Prefetch Statistics no stall load stall 4 read/write prefetching store stall 3 read prefetching only 2 write prefetching only 1 no prefetching 0 0.5 1 1.5 2 2.5 3 Exec time (sec) •Breakdown of execution time •Execution cycles (no stall) constant across trials •Internal cache statistics do not change 14

Conclusion •Beamforming kernel results: • Horizontal beamformer kernel: 444 MFLOPS, 66% of peak • Vertical beamformer kernel: 313 MFLOPS, 93% of peak • Loop unrolling: 2.4 speedup in horizontal kernel • VIS: 1.46 speedup in vertical kernel • prefetching: 1.34 speedup in vertical kernel •Near-peak performance can be achieved, but • Kernel optimization is difficult and time consuming • Compiler did not generate prefetch instructions •For high-throughput real-time signal processing, general purpose CPUs can be an attractive target 15

Real-Time High-Throughput Sonar Beamforming Kernels Using Native - PowerPoint PPT Presentation

Real-Time High-Throughput Sonar Beamforming Kernels Using Native Signal Processing and Memory Latency Hiding Techniques Gregory E. Allen 1 1 Brian L. Evans Lizy K. John Department of Electrical and Computer Engineering The University of

#UDT2019 Motivation #UDT2019 Beamforming #UDT2019 Beamforming #UDT2019 Receive beamforming

Real-Time Sonar Beamforming on a Unix Workstation using Process Networks and POSIX Threads

Invention of Sonar Kassidy Kervin What is Sonar? So nar was invented in 1906 and is short

2014-2016 Cdiz, 20 October 2016 Interferometric Sonar An interferometric sonar can be

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

Blind Beamforming using Randomly Distributed Sensors Kung Yao UCLA DARPA CSP Workshop, Jan. 15,

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Marine Sonar and Seismic Surveys in the INOC Marine Sonar and Seismic Surveys in the INOC Member

WAVEFORM SONAR Mehmet Can Erdem Meteksan Defence, Turkey #UDT2019 Classical Sonar Waveforms

Time-domain beam signals for adaptive beamforming UDT 2019, Stockholm - 13th May a sound

Real- Real -Time Systems Time Systems Real- -Time Systems Time Systems Real

Real Real- -Time Systems Time Systems Designing a real- Designing a real -time system time

Real- Real -time systems time systems Real- Real -time programming time programming

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

Macroeconomic Effects of FOMC Forward Guidance Jeffrey R. Campbell Charles L. Evans Jonas D.M.

Monetary Policy: Lessons from the Past and Looking Forward to the Future The Federal Reserve at

Tidal Disruptions of Binary Stars Ilya Mandel University of Birmingham, UK (and Monash

CONSENSUS OR CONTROVERSY? Clinical Investigators Provide Perspectives on the Treatment of

Status of the Project LHC IR Upgrade - Phase I SLHC-IRP1 1. Elements of SLHC roadmap 2.

Weak Solutions to Partial Differential Equations Case study: Poissons Equation William Golding

Swift detects only about 5-10 per year. In a few cases, even a prompt slew fails to give an

Modal logics of polytopes what we know so far David Gabelaia in collaboration with Members