Pruned FFT Implementations Franz Franchetti, Markus Pschel - PowerPoint PPT Presentation

Carnegie Mellon Generating High Performance Pruned FFT Implementations Franz Franchetti, Markus Püschel Electrical and Computer Engineering Carnegie Mellon University Sponsors: DARPA DESA program, NSF-NGS/ITR, NSF-ACR, Mercury, and Intel

Carnegie Mellon The Idea: Pruned FFT  Input pruning E.g., center ¾ inputs are known to be zero  Output pruning E.g., only the low ½ frequencies are used  Simultaneous input & output pruning Some inputs known zeros and some outputs discarded Pruned FFT FFT Pruned DFT: 5% – 30% operations reduction in application settings

Carnegie Mellon The Problem Discrete Fourier Transform (single precision): 2 x Core2 Extreme 3 GHz 26 24 22 20 18 16 best code 14 (Spiral generated) 30x 12 Same operations count 10 8 6 4 Numerical Recipes 2 0 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072262144 Problem size Can we turn 5% – 30% operations savings into speed-up ?

Carnegie Mellon Organization  Spiral overview  Pruned FFT  Results  Concluding remarks

Carnegie Mellon Spiral  Library generator for linear transforms (DFT, DCT, DWT, filters, ….) and recently more …  Wide range of platforms supported: scalar, fixed point, vector, parallel, Verilog, GPU  Research Goal: “Teach” computers to write fast libraries  Complete automation of implementation and optimization  Conquer the “high” algorithm level for automation  When a new platform comes out: Regenerate a retuned library  When a new platform paradigm comes out (e.g., CPU+GPU): Update the tool rather than rewriting the library Intel uses Spiral to generate parts of their MKL and IPP libraries

Carnegie Mellon How Spiral Works Problem specification (transform) controls Spiral: Algorithm Generation Complete automation of Algorithm Optimization the implementation and algorithm optimization task controls Search Implementation Basic idea: Code Optimization Declarative representation of algorithms C code Compilation Rewriting systems to performance generate and optimize Compiler Optimizations algorithms Spiral Fast executable

Carnegie Mellon Fast Algorithms, Example: 4-point FFT  Fast algorithms = matrix factorizations 12 adds 4 adds 1 mult 4 adds 4 mults Kronecker product Permutation Fourier transform Identity  SPL = mathematical, declarative specification  SPL formula can be translated into program

Carnegie Mellon Transforms and Breakdown Rules “Teaches” Spiral about existing algorithm knowledge (~200 journal papers) Base case rules Goal: Derive Cooley-Tukey Pruned FFT rule

Carnegie Mellon Data Sparseness: Block Sequences  Sequence  Block sequence  Example - 2 =

Carnegie Mellon Zero-Padding: Scatter Matrix  Definition  Example S σ . =

Carnegie Mellon Cooley-Tukey Pruned FFT Rule  Recursive input pruning rule  Base case  Similar rule for output pruning and simultaneous pruning Pruned FFT FFT

Carnegie Mellon Derivation: Cooley-Tukey Pruned FFT Rule Cooley-Tukey FFT rule + Kronecker product identities

Carnegie Mellon DFT: Spiral vs. FFTW and MKL (2 cores, 4-way SSE) performance [Gflop/s], single-precision, Intel C++ 10.1, SSSE3, Windows XP 32-bit 20 18 16 14 12 10 8 Spiral DFT SSE+SMP Spiral DFT SSE 6 Intel MKL 9.0 FFTW DFT 4 Numerical Recipes in C 2 0 256 512 768 1,024 2,048 3,072 4,096 5,120 6,144 7,168 8,192 9,216 10,240 11,264 12,288 13,312 14,336 15,360 16,384 input size Spiral-generated DFT is good baseline

Carnegie Mellon Spiral: Pruned DFT vs. DFT (4-way SSE) performance [Gflop/s], single-precision, Intel C++ 10.1, SSSE3, Windows XP 32-bit 25 20 15 10 Pruned DFT (first 1/16 non-zero) Pruned DFT (center 7/8 zero) Pruned DFT (center 1/4 non-zero) 5 Pruned DFT (second half zero) Spiral DFT 0 256 512 768 1,024 2,048 3,072 4,096 5,120 6,144 7,168 8,192 9,216 10,240 11,264 12,288 13,312 14,336 15,360 16,384 input size FFT input pruning: speed-up for sequential vector DFT

Carnegie Mellon Spiral: Pruned DFT vs. DFT (2 cores, 4-way SSE) performance [Gflop/s], single-precision, Intel C++ 10.1, SSSE3, Windows XP 32-bit 25 20 15 10 Pruned DFT (first 1/16 non-zero) Pruned DFT (center 7/8 zero) Pruned DFT (center 1/4 non-zero) 5 Pruned DFT (second half zero) Spiral DFT 0 256 512 768 1,024 2,048 3,072 4,096 5,120 6,144 7,168 8,192 9,216 10,240 11,264 12,288 13,312 14,336 15,360 16,384 input size FFT input pruning: speed-up for parallel vector DFT

Carnegie Mellon Spiral: I/O Pruned DFT vs. DFT (4-way SSE) performance [Gflop/s], single-precision, Intel C++ 10.1, SSSE3, Windows XP 32-bit 25 20 15 10 I/O Pruned DFT (output: first 1/16 non-zero, input: center 3/4 zero) 5 I/O Pruned DFT (output: center 7/8 zero, input: first 1/4 non-zero) Spiral DFT 0 256 512 768 1,024 2,048 3,072 4,096 5,120 6,144 7,168 8,192 9,216 10,240 11,264 12,288 13,312 14,336 15,360 16,384 input size I/O pruning: better speed-up than input pruning only

Carnegie Mellon Summary  Spiral’s goal: “Teach” computers to write fast libraries From problem specification to very fast code---automatically (click button)  Optimization at a high level of abstraction Memory hierarchy, vector SIMD, multicore ,…  The generated programs are very fast Often better than human-written code  Pruned FFT: lower operations count translates into speed-up up to 30% over best vector SIMD and multicore code for input pruning

Pruned FFT Implementations Franz Franchetti, Markus Pschel - PowerPoint PPT Presentation

Carnegie Mellon Generating High Performance Pruned FFT Implementations Franz Franchetti, Markus Pschel Electrical and Computer Engineering Carnegie Mellon University Sponsors: DARPA DESA program, NSF-NGS/ITR, NSF-ACR, Mercury, and Intel

The Fast Fourier Transform - FFT Sound Design and Interactive Music - FFT Learning Objectives

FFT Application Examples and Implementation FFT Example 1: Signal Sparsity in time Frequency

FFT libraries on Cray XT: CRay Adaptive FFT (CRAFFT) Jonathan Bentz Cray Inc. Outline

A Pruned Problem Transformation Method for Multi-label Classification Jesse Read

2DECOMP&FFT A Highly Scalable 2D Decomposition Library and FFT Interface Ning Li and

FFT analysis of DNA sequences Harvey Lab Group Meeting March 1, 2004 Russell Hanson 2 Nave

The FFT Via Matrix Factorizations A Key to Designing High Performance Implementations Charles

Contracts vs. Implementations: Where? Common Eiffel Errors: Instructions for Implementations :

Threshold Implementations Svetla Nikova Threshold Implementations A provably secure

Lancet: Better Network Resilience by Designing for Pruned Failure Sets Yiyang Chang* , Chuan

Pruned Dynamic Programming for Steiner Tree Yoichi Iwata (NII) Takuto Shigemura (U-Tokyo)

What is AST? Abstract Syntax Tree pruned CST What is it? From Wikipedia: Why use it instead of

FFT Group Update Singapore Roadshow www.fftsecurity.com Forward Looking Statements This

Fast Convolutions Via the Overlap- and-Save Method Using Shared Memory FFT Karel Admek , Sofia

LOW-COMMUNICATION FFT WITH FAST MULTIPOLE METHOD Cris Cecka, Senior Research Scientist. May 11,

Analyzing fluid flows via the ergodicity defect ergodicity defect Sherry E. Scott FFT 2013

Formal Grammars Why Study Grammars? Whats a Grammar? August 24, 2014 Parsing Brian A.

CSC2/458 Parallel and Distributed Systems Mutual Exclusion and Leader Elections Sreepathi Pai

CMOS Comparator Design Extra Slides Vishal Saxena, Boise State University

Memory Forensics of a Java Card Dump jean-louis.lanet@inria.fr Cardis 2014 Paris Nov. 5-7 2014

Dynamic control method of queuing delay with/without OEO conversion in a multi stage access

Nonlinear Control Lecture # 36 Tracking & Regulation Nonlinear Control Lecture # 36 Tracking

Looking at the Big Picture The RCRA Functional Equivalency Policy and the Policy on More

Feb 16, 2012 Please mute yourself in order to keep interference and feedback to a minimum

Pruned FFT Implementations Franz Franchetti, Markus Pschel - PowerPoint PPT Presentation

Carnegie Mellon Generating High Performance Pruned FFT Implementations Franz Franchetti, Markus Pschel Electrical and Computer Engineering Carnegie Mellon University Sponsors: DARPA DESA program, NSF-NGS/ITR, NSF-ACR, Mercury, and Intel

The Fast Fourier Transform - FFT Sound Design and Interactive Music - FFT Learning Objectives

FFT Application Examples and Implementation FFT Example 1: Signal Sparsity in time Frequency

FFT libraries on Cray XT: CRay Adaptive FFT (CRAFFT) Jonathan Bentz Cray Inc. Outline

A Pruned Problem Transformation Method for Multi-label Classification Jesse Read

2DECOMP&amp;FFT A Highly Scalable 2D Decomposition Library and FFT Interface Ning Li and

FFT analysis of DNA sequences Harvey Lab Group Meeting March 1, 2004 Russell Hanson 2 Nave

The FFT Via Matrix Factorizations A Key to Designing High Performance Implementations Charles

Contracts vs. Implementations: Where? Common Eiffel Errors: Instructions for Implementations :

Threshold Implementations Svetla Nikova Threshold Implementations A provably secure

Lancet: Better Network Resilience by Designing for Pruned Failure Sets Yiyang Chang* , Chuan

Pruned Dynamic Programming for Steiner Tree Yoichi Iwata (NII) Takuto Shigemura (U-Tokyo)

What is AST? Abstract Syntax Tree pruned CST What is it? From Wikipedia: Why use it instead of

FFT Group Update Singapore Roadshow www.fftsecurity.com Forward Looking Statements This

Fast Convolutions Via the Overlap- and-Save Method Using Shared Memory FFT Karel Admek , Sofia

LOW-COMMUNICATION FFT WITH FAST MULTIPOLE METHOD Cris Cecka, Senior Research Scientist. May 11,

Analyzing fluid flows via the ergodicity defect ergodicity defect Sherry E. Scott FFT 2013

Formal Grammars Why Study Grammars? Whats a Grammar? August 24, 2014 Parsing Brian A.

CSC2/458 Parallel and Distributed Systems Mutual Exclusion and Leader Elections Sreepathi Pai

CMOS Comparator Design Extra Slides Vishal Saxena, Boise State University

Memory Forensics of a Java Card Dump jean-louis.lanet@inria.fr Cardis 2014 Paris Nov. 5-7 2014

Dynamic control method of queuing delay with/without OEO conversion in a multi stage access

Nonlinear Control Lecture # 36 Tracking &amp; Regulation Nonlinear Control Lecture # 36 Tracking

Looking at the Big Picture The RCRA Functional Equivalency Policy and the Policy on More

Feb 16, 2012 Please mute yourself in order to keep interference and feedback to a minimum

2DECOMP&FFT A Highly Scalable 2D Decomposition Library and FFT Interface Ning Li and

Nonlinear Control Lecture # 36 Tracking & Regulation Nonlinear Control Lecture # 36 Tracking