Carnegie Mellon Performance Spiral Computer Generation of Performance Libraries Applications José M. F. Moura Markus Püschel Franz Franchetti Platforms & the Spiral Team
Carnegie Mellon What is Spiral? Traditionally Spiral Approach Spiral Comparable High performance library High performance library performance optimized for given platform optimized for given platform
Carnegie Mellon Main Idea: Program Generation Model: common abstraction = spaces of matching formulas abstraction abstraction ν p defines rewriting search μ pick algorithm architecture space space Architectural parameter: Kernel: optimization Vector length, problem size, #processors, … algorithm choice
Carnegie Mellon How Spiral Works Problem specification (“DFT 1024” or “DFT”) Complete automation of controls Algorithm Generation the implementation and optimization task Algorithm Optimization algorithm Basic ideas: controls • Declarative representation Implementation Search of algorithms Code Optimization • Rewriting systems to C code generate and optimize algorithms at a high level Compilation performance of abstraction Compiler Optimizations Spiral Fast executable
Carnegie Mellon Algorithms: Rules in Domain Specific Language Linear Transforms Viterbi Decoding convolutional 11 10 01 01 10 10 11 00 Viterbi 010001 11 10 00 01 10 01 11 00 010001 encoder decoder £ Matrix-Matrix Multiplication Synthetic Aperture Radar (SAR) matched preprocessing interpolation 2D iFFT = £ filtering
Carnegie Mellon One Approach for all Types of Parallelism Multithreading (Multicore) Vector SIMD (SSE, VMX/Altivec ,…) Message Passing (Clusters, MPP) Streaming/multibuffering (Cell) Graphics Processors (GPUs) Gate-level parallelism (FPGA) HW/SW partitioning (CPU + FPGA)
Carnegie Mellon Example: Code Generation for Multicore CPUs Hardware abstraction: shared cache with cache lines Tensor product: embarrassingly parallel operator A Processor 0 A Processor 1 A Processor 2 A Processor 3 x y Permutation: problematic; may produce false sharing x y
Carnegie Mellon Spiral: Meta-Tool to Autotuning Libraries Input: Transform : Algorithms : Vectorization : 2-way SSE Threading : Yes Output: Optimized library (10,000 lines of C++) Spiral Library Generator For general input size ( not collection of fixed sizes) Vectorized High-Performance Library “FFTW - like” Multithreaded With runtime adaptation mechanism Performance competitive with hand-written code
Carnegie Mellon Verification and Testing Verify algorithms symbolically = ? Check rules through verification of instances = ? Check code empirically = ? DFT4([0,1,0,0]) DFT4([0.1,1.77,2.28,-55.3]) = ? DFT4_rnd([0.1,1.77,2.28,-55.3]))
Carnegie Mellon Range: Cell Phone To Supercomputer Global FFT (1D FFT, HPC Challenge) performance [Gflop/s] 6.4 Tflop/s BlueGene/P Samsung i9100 Galaxy S II BlueGene/P at Argonne National Laboratory Dual-core ARM at 1.2GHz with NEON ISA 128k cores (quad-core CPUs) at 850 MHz SIMD vectorization + multi-threading SIMD vectorization + multi-threading + MPI G. Almási, B. Dalton, L. L. Hu, F. Franchetti, Y. Liu, A. Sidelnik, T. Spelce, I. G. Tānase , E. Tiotto, Y. Voronenko, X. Xue: 2010 IBM HPC Challenge Class II Submission. Winner of the 2010 HPC Challenge Class II Award (Most Productive System).
Carnegie Mellon More Results: Spiral Outperforms Humans FFT on Multicore SAR SDR improvement FFT on FPGA
Carnegie Mellon More Information: www.spiral.net www.spiralgen.com
Recommend
More recommend