Carnegie Mellon Operator Language: A Program Generation Framework for Fast Kernels Franz Franchetti, Frédéric de Mesmay, Daniel McFarlin, Markus Püschel Electrical and Computer Engineering Carnegie Mellon University Sponsors: DARPA DESA program, NSF-NGS/ITR, NSF-ACR, and Intel
Carnegie Mellon The Problem: Example MMM Matrix-Matrix Multiplication (MMM) on 2xCore2Duo 3 GHz (double precision) Performance [Gflop/s] 50 45 40 Best code (K. Goto) 35 30 25 160x 20 15 10 5 Triple loop 0 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 matrix size Similar plots can be shown for all numerical kernels in linear algebra, signal processing, coding, crypto, … What’s going on? Hardware is becoming increasingly complex.
Carnegie Mellon Automatic Performance Tuning Current vicious circle: Whenever a new platform comes out, the same functionality needs to be rewritten and reoptimized Automatic Performance Tuning BLAS: ATLAS, PHiPAC Linear algebra: Sparsity/OSKI, Flame Sorting Fourier transform: FFTW Linear transforms (and beyond): Spiral … others How to build an extensible system? For more problem classes? For yet un-invented platforms? Proceedings of the IEEE special issue, Feb. 2005
Carnegie Mellon What is Spiral? Traditionally Spiral Approach Spiral Comparable High performance library High performance library performance optimized for given platform optimized for given platform
Carnegie Mellon Idea: Common Abstraction and Rewriting Model: common abstraction = spaces of matching formulas = domain-specific language abstraction abstraction ν p defines rewriting search μ pick algorithm architecture space space Architectural parameter: Kernel: optimization Vector length, problem size, #processors, … algorithm choice
Carnegie Mellon Some Kernels as OL Formulas. Linear Transforms Viterbi Decoding convolutional 11 10 01 01 10 10 11 00 Viterbi 010001 11 10 00 01 10 01 11 00 010001 encoder decoder £ Matrix-Matrix Multiplication Synthetic Aperture Radar (SAR) matched preprocessing interpolation 2D iFFT = £ filtering
Carnegie Mellon How Spiral Works Problem specification (transform) Spiral: Complete automation of the controls implementation and Algorithm Generation optimization task Algorithm Optimization algorithm Basic ideas: controls Declarative representation Search Implementation of algorithms Code Optimization C code Rewriting systems to generate and optimize Compilation performance algorithms at a high level Compiler Optimizations of abstraction Spiral Fast executable Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo: SPIRAL: Code Generation for DSP Transforms. Special issue, Proceedings of the IEEE 93(2), 2005
Carnegie Mellon Organization Operator language and algorithms Optimizing algorithms for platforms Performance results Summary
Carnegie Mellon Organization Operator language and algorithms Optimizing algorithms for platforms Performance results Summary
Carnegie Mellon Operators Definition Operator: Multiple complex vectors ! multiple complex vectors Higher-dimensional data is linearized Operators are potentially nonlinear Example: Matrix-matrix-multiplication (MMM) A C B
Carnegie Mellon Operator Language
Carnegie Mellon OL Tensor Product: Repetitive Structure Kronecker product (structured matrices) OL Tensor product (structured operators) Definition (extension to non-linear)
Carnegie Mellon Translating OL Formulas Into Programs
Carnegie Mellon Example: Matrix Multiplication (MMM) Breakdown rules: capture various forms of blocking
Carnegie Mellon Example: SAR Computation as OL Rules Grid Compute Range Azimuth 2D FFT Interpolation Interpolation
Carnegie Mellon Organization Operator language and algorithms Optimizing algorithms for platforms Performance results Summary
Carnegie Mellon Modeling Multicore: Base Cases Hardware abstraction: shared cache with cache lines Tensor product: embarrassingly parallel operator A Processor 0 A Processor 1 A Processor 2 A Processor 3 x y Permutation: problematic; may produce false sharing x y
Carnegie Mellon Parallelization: OL Rewriting Rules Tags encode hardware constraints Rules are algorithm-independent Rules encode program transformations
Carnegie Mellon The Joint Rule Set: MMM Algorithm rules: breakdown rules Hardware constraints: base cases Program transformations: manipulation rules Combined rule set spans search space for empirical optimization
Carnegie Mellon Parallelization Through Rewriting: MMM Load-balanced No false sharing
Carnegie Mellon Same Approach for Different Paradigms Threading: Vectorization: GPUs: Verilog for FPGAs:
Carnegie Mellon Organization Operator language and algorithms Optimizing algorithms for platforms Performance results Summary
Carnegie Mellon Matrix Multiplication Library MKL 10.0 MKL 10.0 GotoBLAS 1.26 Spiral-generated library Spiral-generated library GotoBLAS 1.26 Rank-k Update , single precision, k=4 Rank-k Update , double precision, k=4 performance [Gflop/s] Dual Intel Xeon 5160, 3Ghz performance [Gflop/s] Dual Intel Xeon 5160, 3Ghz 18 9 Spiral-generated library 16 8 Spiral-generated library 14 7 12 6 10 5 MKL 10.0 MKL 10.0 8 4 6 3 4 2 2 1 Input size Input size 0 0 2 4 8 16 32 64 128 256 512 2 4 8 16 32 64 128 256 512
Carnegie Mellon Result: Spiral-Generated PFA SAR on Core2 Quad SAR Image Formation on Intel platforms performance [Gflop/s] 50 3.0 GHz Core 2 (65nm) 44 43 3.0 GHz Core 2 (45nm) 40 2.66 GHz Core i7 newer 3.0 GHz Core i7 (Virtual) platforms 30 20 10 0 100 Megapixels 16 Megapixels Algorithm by J. Rudin (best paper award, HPEC 2007): 30 Gflop/s on Cell Each implementation: vectorized, threaded, cache tuned, ~13 MB of code
Carnegie Mellon Organization Operator language and algorithms Optimizing algorithms for platforms Performance results Summary
Carnegie Mellon Summary Platforms are powerful yet complicated optimization will stay a hard problem Image: Intel OL: unified mathematical framework captures platforms and algorithms Spiral: program generation and autotuning architecture kernel can provide full automation M (») A(µ) Performance of supported kernels is competitive with expert tuning
Recommend
More recommend