A Rewriting System for the Vectorization of Signal Transforms Franz - PowerPoint PPT Presentation

A Rewriting System for the Vectorization of Signal Transforms Franz Franchetti Yevgen Voronenko Markus Püschel Department of Electrical & Computer Engineering Carnegie Mellon University http://www.spiral.net Supported by: NSF ACR-0234293, ITR/NGS-0325687, DARPA NBCH-105000, Intel, Austrian FWF

The Problem (Example FFT Performance) best available implementation (FFTW, Intel IPP, Spiral) 10x roughly the same operations count reasonable implementation (Numerical recipes. GNU scientific library) Solution: program generators like Atlas and Spiral, adaptive libraries like FFTW

Organization  Spiral overview  SIMD vector instructions  Vectorization by rewriting  Extension to SMP and Multicore  Experimental results  Summary

Spiral Program generation from a  problem specification for linear digital signal processing (DSP) transforms (DFT, DCT, DWT, filters, ….) Goal 1: A flexible push-button  program generation framework for an entire domain of algorithms Goal 2: With new architectures,  update the tool rather than the individual programs in the library Spiral: generates DSP programs for SIMD vector, shared memory, Principle 1: Domain knowledge in the system Knowledge of the platform: By evaluating runtime Principle 2: Optimization at a high level of abstraction multicore, distributed memory, FPGAs, embedded CPUs Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo, SPIRAL: Code Generation for DSP Transforms, Proceedings of the IEEE 93(2), 2005

What is a DSP Transform?  Mathematically: Matrix-vector multiplication input vector (signal) transform = matrix output vector (signal)  Example: Discrete Fourier transform (DFT)

DSP Algorithms: Example 4-point DFT Algorithm = sparse matrix factorization  Reduce computation cost from O( n 2 ) to O( n log n )  For every transform there are many fast algorithms  12 adds 4 adds 1 mult 4 adds 4 mults (when multiplied with input vector x ) SPIRAL generates the space of algorithms using breakdown rules  in the domain-specific Signal Processing Language (SPL)

Some Transforms Spiral currently contains 45 transforms

Some Breakdown Rules Base case rules Spiral currently contains 165 rules

SPL (Signal Processing Language)  SPL expresses transform algorithms as structured sparse matrix factorization  Examples:  Kronecker product = loop (parallel, vector) for i = 0:n-1 y[im:im+m-1] = B·x[im:im+m-1] endfor

Formula Level Optimization: Idea Move optimizations to higher abstraction level: Domain knowledge overcomes compiler limitations Formulas Code Traditionally optimizations by C/Fortran compilers Formula level optimizations in Spiral:  Loop merging Implemented through rewriting systems  Vectorization  Parallelization

SIMD (Signal Instruction Multiple Data) Vector Instructions in a Nutshell  What are these instructions?  Extension of the ISA. Data types and instructions for parallel computation on short (2-way – 16-way) vectors of integers and floats vector register  Intel MMX xmm0 1 2 4 4 5 1 1 3 xmm1  AMD 3DNow!  Intel SSE vector operation + + + +  AMD Enhanced 3DNow! addps xmm0, xmm1  Motorola AltiVec xmm0 6 3 5 7  AMD 3DNow! Professional  Itanium  Problems:  Intel XScale  Not standardized  Intel SSE2  AMD-64  Compiler vectorization limited  IBM BlueGene/L PPC440FP2  Low- level issues (data alignment,…)  Intel Wireless MMX  Intel SSE3  Reordering data kills runtime  … One can easily slow down a program by vectorizing it

Vectorization of Formulas by Rewriting  Naturally vectorizable construct Franchetti and Püschel (IPDPS 2002/2003) A 4 A 4 A 4 A 4 A 4 A 4 A 4 A 4 Operates on 4-way vectors vector length (any two-power)  Rewriting rules to vectorize formulas Introduces data reorganization (permutations) vector construct further rewriting base case Definition: Vectorized formula := vector constructs and base cases, A ¢ B , and IA of vectorized formulas

Example: DFT vector constructs base cases Formula is vectorized w.r.t. Definition

Some Vectorization Rules

Shared Memory Parallelization by Rewriting Load balanced, contiguous blocks No false sharing (entire cache lines are swapped) F. Franchetti, Y. Voronenko, and M. Püschel: “FFT Program Generation for Shared Memory: SMP and Multicore,” to appear in SC|06

How Good is Our Generated Vector Code? Complex 1D DFT on Intel Pentium 4, 3.6 GHz, 4-way SSE (float) 9000 9000 9000 Spiral vector code 8000 8000 8000 (automatically generated) 7000 7000 7000 Intel MKL 8.1 6000 6000 6000 pseudo Mflop/s pseudo Mflop/s pseudo Mflop/s (handcoded) FFTW 3.1 SSE 5000 5000 5000 3.5x (adapted, but scalar Spiral code + vectorizing compiler hand-vectorized) 4000 4000 4000 3000 3000 3000 better 2000 2000 2000 scalar (x87) Spiral code 1000 1000 1000 (automatically generated) 0 0 0 4 5 6 7 8 9 10 11 12 13 14 15 16 17 4 5 6 7 8 9 10 11 12 13 14 15 16 17 4 5 6 7 8 9 10 11 12 13 14 15 16 17 problem size (log2 N ) problem size (log2 N ) problem size (log2 N ) Spiral generated code performs comparable to expertly hand-tuned code

What About 8-way Vector Code? Complex 1D DFT on Intel Pentium 4, 3.6 GHz, 8-way SSE2 (16-bit int) 16000 14000 Spiral vector code (automatically generated) 12000 10000 MIPS 8000 Intel IPP 5.0 (handcoded) 6000 4000 better 2000 0 64 128 256 512 1024 2048 4096 8192 problem sizes (N) Spiral generated code clearly outperforms expertly hand-tuned code

Combined Multicore and Vector Code Pentium D 3.6 GHz (Dual Core, 2-way SIMD), double precision 1-D DFT 6000 parallel + vector 5000 4000 2.5x pseudo Mflop/s parallel 3000 2000 sequential better 1000 0 7 8 9 10 11 12 13 14 15 16 17 18 19 20 problem size (log2 N)  2.5x speed-up from parallel + vector  Parallelization speed-up for small problems

Summary  Parallelization and vectorization in Spiral  Entirely automatic  Principled approach  Rewriting system  Generated code is very fast  Works for other hardware as well  Distributed memory: MPI with C.W. Ueberhuber, A. Bonelli, and J. Lorenz, Vienna University of Technology  Hardware: FPGAs with J.C. Hoe and Peter Milder, Carnegie Mellon University

(Part of the) Spiral Team www.spiral.net

A Rewriting System for the Vectorization of Signal Transforms Franz - PowerPoint PPT Presentation

A Rewriting System for the Vectorization of Signal Transforms Franz Franchetti Yevgen Voronenko Markus Pschel Department of Electrical & Computer Engineering Carnegie Mellon University http://www.spiral.net Supported by: NSF

Is vectorization easy? Is vectorization enough? Sbastien Ponce Florian Lemaitre Plan

Function Call Re-Vectorization Pupil: Rubens Emilio Alves Moreira Advisor: Fernando Magno Quinto

LLVM Auto-Vectorization Past Present Future Renato Golin www.linaro.org LLVM

Lecture 3 SIMD and Vectorization GPU Architecture Todays lecture Vectorization and SSE

Rewriting Part 4. Termination of Term Rewriting Systems Temur Kutsia RISC, JKU Linz Termination

Tx Signal: 1000 Hz sine wave; Attenuation; Random noise with 0.5ms spike Tx Signal Noise Rx

On Data-Structure Rewriting Rachid Echahed LIG Lab, Grenoble France June, 2010 Rewriting

Solution 1: Rule Rewriting The grammar rewriting approach attempts to Natural Language

On Formal Analysis of OO Languages using Rewriting Logic: Designing for Performance Mark Hills

Term rewriting Equational logic Term rewriting systems Termination Confluence

YARR! Yet Another Rewriting Reasoner Jrg Schnfisch ORE Workshop 2013 Ulm, Germany,

The Rewriting Logic Semantics Project and its Maude Implementation Jos e Meseguer University

Higher-order rewriting of String Diagrams Vladimir Zamdzhiev 21 April 2016 Vladimir Zamdzhiev

Nominal Rewriting and Unification Theory Maribel Fern andez FoPSS 2019 Maribel Fern andez

Term rewriting in partial algebras Norbert Dojer 20.06.2014 Term rewriting motivating

Rewriting Part 6. Completion of Term Rewriting Systems Temur Kutsia RISC, JKU Linz Word problem

GNURadio as a general purpose digital signal Basics of radiofrequency processing environment

Formal Verification and Digital Filters Diane Gallois-Wong supervised by Sylvie Boldo and

Computer Graphics (Fall 2011) CS 184 Guest Lecture: Sampling and Reconstruction Ravi Ramamoorthi

Correlation of signals MATLAB tutorial series (Part 1.2) Pouyan Ebrahimbabaie Laboratory for

Detection and Estimation Theory Introduction to ECE 531 Mojtaba Soltanalian- UIC The course

Lecture Notes on Discrete-Time Signal Processing EE424 Course @ Bilkent University September 26,

Cyber-Physical Systems Embedded Architecture IECE 553/453 Fall 2019 Prof. Dola Saha 1

Modeling Digital Systems with VHDL Reference: Roth & John text Chapter 2 Michael Smith

A Rewriting System for the Vectorization of Signal Transforms Franz - PowerPoint PPT Presentation

A Rewriting System for the Vectorization of Signal Transforms Franz Franchetti Yevgen Voronenko Markus Pschel Department of Electrical & Computer Engineering Carnegie Mellon University http://www.spiral.net Supported by: NSF

Is vectorization easy? Is vectorization enough? Sbastien Ponce Florian Lemaitre Plan

Function Call Re-Vectorization Pupil: Rubens Emilio Alves Moreira Advisor: Fernando Magno Quinto

LLVM Auto-Vectorization Past Present Future Renato Golin www.linaro.org LLVM

Lecture 3 SIMD and Vectorization GPU Architecture Todays lecture Vectorization and SSE

Rewriting Part 4. Termination of Term Rewriting Systems Temur Kutsia RISC, JKU Linz Termination

Tx Signal: 1000 Hz sine wave; Attenuation; Random noise with 0.5ms spike Tx Signal Noise Rx

On Data-Structure Rewriting Rachid Echahed LIG Lab, Grenoble France June, 2010 Rewriting

Solution 1: Rule Rewriting The grammar rewriting approach attempts to Natural Language

On Formal Analysis of OO Languages using Rewriting Logic: Designing for Performance Mark Hills

Term rewriting Equational logic Term rewriting systems Termination Confluence

YARR! Yet Another Rewriting Reasoner Jrg Schnfisch ORE Workshop 2013 Ulm, Germany,

The Rewriting Logic Semantics Project and its Maude Implementation Jos e Meseguer University

Higher-order rewriting of String Diagrams Vladimir Zamdzhiev 21 April 2016 Vladimir Zamdzhiev

Nominal Rewriting and Unification Theory Maribel Fern andez FoPSS 2019 Maribel Fern andez

Term rewriting in partial algebras Norbert Dojer 20.06.2014 Term rewriting motivating

Rewriting Part 6. Completion of Term Rewriting Systems Temur Kutsia RISC, JKU Linz Word problem

GNURadio as a general purpose digital signal Basics of radiofrequency processing environment

Formal Verification and Digital Filters Diane Gallois-Wong supervised by Sylvie Boldo and

Computer Graphics (Fall 2011) CS 184 Guest Lecture: Sampling and Reconstruction Ravi Ramamoorthi

Correlation of signals MATLAB tutorial series (Part 1.2) Pouyan Ebrahimbabaie Laboratory for

Detection and Estimation Theory Introduction to ECE 531 Mojtaba Soltanalian- UIC The course

Lecture Notes on Discrete-Time Signal Processing EE424 Course @ Bilkent University September 26,

Cyber-Physical Systems Embedded Architecture IECE 553/453 Fall 2019 Prof. Dola Saha 1

Modeling Digital Systems with VHDL Reference: Roth &amp; John text Chapter 2 Michael Smith

Modeling Digital Systems with VHDL Reference: Roth & John text Chapter 2 Michael Smith