of software and the atlas project
play

of Software and the ATLAS project* Software Engineering Seminar - PowerPoint PPT Presentation

Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Sprri *R. Clint Whaley, Antoine Petitet and Jack Dongarra. Parallel Computing, 27(1-2):3-35, 2001. Is Search Really Necessary to Generate


  1. Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal SpΓΆrri *R. Clint Whaley, Antoine Petitet and Jack Dongarra. Parallel Computing, 27(1-2):3-35, 2001. Is Search Really Necessary to Generate High-Performance BLAS? Kamen Yotov and Xiaoming Li and Gang Ren and Maria Garzaran and David Padua and Keshav Pingali and Paul Stodghill PROCEEDINGS OF THE IEEE, VOL. 93, NO. 2, FEBRUARY 2005

  2. INTRODUCTION

  3. BLAS (Basic Linear Algebra Subprograms) β€’ Level 1 Vector operations 𝒛 ← π›½π’š + π’œ β€’ Level 2 Matrix-Vector operations 𝒛 ← π›½π‘©π’š + π’œ β€’ Level 3 Matrix-Matrix operations 𝑬 ← 𝛽𝑩π‘ͺ + 𝛾𝑫

  4. ATLAS (Automatically Tuned Linear Algebra Software) β€’ Implements BLAS β€’ Applies empirical optimization techniques to source code to generate an optimized library β€’ Fully automatic β€’ Produces ANSI-C code

  5. ATLAS Matrix-Matrix Multiplication MMM MMM MMM "Automated Empirical Optimization of Software and the ATLAS project" by R. Clint Whaley, Antoine Petitet and Jack Dongarra. Parallel Computing, 27(1-2):3-35, 2001.

  6. ARCHITECTURE

  7. ATLAS Architecture L1 Cache Detect Parameters ATLAS Search ATLAS Code Hardware CPU parameters Engine Generator Parameters Source Code MFLOPS Execute And Measure Multiple versions Kamen Yotov and Xiaoming Li and Gang Ren and Maria Garzaran and David Padua and Keshav Pingali and Paul Stodghill PROCEEDINGS OF THE IEEE, VOL. 93, NO. 2, FEBRUARY 2005

  8. ATLAS CODE GENERATOR

  9. ATLAS Optimizations β€’ Case: Matrix-Matrix multiplication 𝐷 𝐢 𝑂 𝑂 𝐡 Γ— 𝐿 = 𝑁 𝑁 𝐿 𝑔𝑝𝑠 𝑗 ∈ 0: 1: 𝑂 βˆ’ 1 𝑔𝑝𝑠 π‘˜ ∈ 0: 1: 𝑁 βˆ’ 1 𝑔𝑝𝑠 𝑙 ∈ 0: 1: 𝐿 βˆ’ 1 𝐷 𝑗,π‘˜ = 𝐷 𝑗,π‘˜ + 𝐡 π‘˜,𝑙 Γ— 𝐢 𝑙,π‘˜

  10. Loop Ordering 𝐡 𝐢 𝐷 𝑔𝑝𝑠 π’Œ ∈ 0: 1: 𝑂 βˆ’ 1 𝑔𝑝𝑠 𝒋 ∈ 0: 1: 𝑁 βˆ’ 1 𝑔𝑝𝑠 𝒍 ∈ 0: 1: 𝐿 βˆ’ 1 𝑗 Γ— = 𝐷 𝑗,π‘˜ = 𝐷 𝑗,π‘˜ + 𝐡 π‘˜,𝑙 Γ— 𝐢 𝑙,π‘˜ π‘˜ Store 𝑩 in Cache 𝐡 𝐷 𝐢 𝑔𝑝𝑠 𝒋 ∈ 0: 1: 𝑂 βˆ’ 1 𝑔𝑝𝑠 π’Œ ∈ 0: 1: 𝑁 βˆ’ 1 𝑔𝑝𝑠 𝒍 ∈ 0: 1: 𝐿 βˆ’ 1 𝑗 Γ— = 𝐷 𝑗,π‘˜ = 𝐷 𝑗,π‘˜ + 𝐡 π‘˜,𝑙 Γ— 𝐢 𝑙,π‘˜ π‘˜ Store π‘ͺ in Cache

  11. 1st Level Blocking 𝑂 𝐢 𝑂 Γ— 𝐿 𝑂 = 𝑂 𝐢 𝑁 𝑁 𝐿 𝑂 𝐢 is choosen in such that the 𝑔𝑝𝑠 𝑗 ∈ 0: 𝑢 π‘ͺ : 𝑂 βˆ’ 1 working set fits into 𝑀 1 𝑔𝑝𝑠 π‘˜ ∈ 0: 𝑢 π‘ͺ : 𝑁 βˆ’ 1 𝑔𝑝𝑠 𝑙 ∈ 0: 𝑢 π‘ͺ : 𝐿 βˆ’ 1 𝑔𝑝𝑠 π‘˜ β€² ∈ [π‘˜: 1: π‘˜ + 𝑂 𝐢 βˆ’ 1] Γ— = 𝑔𝑝𝑠 𝑗 β€² ∈ [𝑗: 1: 𝑗 + 𝑂 𝐢 βˆ’ 1] 𝑔𝑝𝑠 𝑙′ ∈ [𝑙: 1: 𝑙 + 𝑂 𝐢 βˆ’ 1] 𝐷 𝑗′,π‘˜β€² = 𝐷 𝑗′,π‘˜β€² + 𝐡 π‘˜β€²,𝑙′ Γ— 𝐢 𝑙′,π‘˜β€²

  12. 2nd Level Blocking 𝑔𝑝𝑠 𝑗 ∈ 0: 𝑢 π‘ͺ : 𝑂 βˆ’ 1 𝑔𝑝𝑠 π‘˜ ∈ 0: 𝑢 π‘ͺ : 𝑁 βˆ’ 1 𝑔𝑝𝑠 𝑙 ∈ 0: 𝑢 π‘ͺ : 𝐿 βˆ’ 1 𝑔𝑝𝑠 π‘˜ β€² ∈ [π‘˜: 𝑢 𝑽 : π‘˜ + 𝑂 𝐢 βˆ’ 1] 𝑔𝑝𝑠 𝑗 β€² ∈ [𝑗: 𝑡 𝑽 : 𝑗 + 𝑂 𝐢 βˆ’ 1] Γ— = 𝑔𝑝𝑠 𝑙′ ∈ [𝑙: 𝑳 𝑽 : 𝑙 + 𝑂 𝐢 βˆ’ 1] 𝑁 𝑉 + 𝑂 𝑉 + 𝑁 𝑉 Γ— 𝑂 𝑉 ≀ 𝑂 𝑆 𝑔𝑝𝑠 𝑙 β€²β€² ∈ 𝑙 β€² : 1: 𝑙′ + 𝐿 𝑉 βˆ’ 1 𝑔𝑝𝑠 π‘˜β€² β€² ∈ π‘˜ β€² : 1: π‘˜β€² + 𝑂 𝑉 βˆ’ 1 𝑔𝑝𝑠 𝑗′′ ∈ [𝑗 β€² : 1: 𝑗′ + 𝑁 𝑉 βˆ’ 1] 𝐷 𝑗′′,π‘˜β€²β€² = 𝐷 𝑗′′,π‘˜β€²β€² + 𝐡 π‘˜β€²β€²,𝑙′′ Γ— 𝐢 𝑙′′,π‘˜β€²β€² Unroll Loop 𝑂 𝑉 𝑁 𝑉 Γ— = 𝑙 𝑙 Graphic from β€œHow To Write Fast Numerical Code: A Small Introduction” Srinivas Chellappa, Franz Franchetti, and Markus PΓΌschel

  13. Scalar Replacement β€’ Replace array accesses with scalars Stored in memory Store intermediate results in registers do doubl uble t[2]; do doubl uble t0, t1, x0, x1, D0; for or (i=0; i<8; i++) { for or (i=0; i<8; i++) { x0 = x[2*i]; x1 = x[2*i+1]; D0 = D[2*i]; Store for reuse t[0] = x[2*i] + x[2*i+1]; t0 = x0 + x1; t[1] = x[2*i] - x[2*i+1]; t1 = x0 - x1; y[2*i] = t[0] * D[2*i]; y[2*i] = t0 * D0; y[2*i+1] = t[0] * D[2*i]; y[2*i+1] = t1 * D0; } } How To Write Fast Numerical Code: A Small Introduction Srinivas Chellappa, Franz Franchetti, and Markus PΓΌschel

  14. Scalar Replacement a11 = A[1][1] c11 = a11*b11 a12 = A[1][2] c11 += a12*b21 a13 = A[1][3] c11 += a13*b31 a14 = A[1][4] … … c12 = a11*b12 b11 = B[1][1] c12 += a12*b22 b12 = B[1][2] c12 += a13*b32 b13 = B[1][3] … b14 = B[1][4] … C[1][1] = c11 C[1][2] = c12 C[1][3] = c13

  15. Data Hazards IF ID EX WB LD R1, 0(R2) MEM DSUB R4, R1, R5 IF ID EX WB MEM IF ID EX WB AND R6, R1, R7 MEM IF ID EX WB OR R8, R1, R9 MEM IF ID EX WB XOR R10, R1, R11 MEM Skewing Factor Jens Teubner Β· Data Processing on Modern Hardware Β· Fall 2010

  16. Pipeline Scheduling Interleave π‘›π‘£π‘š and 𝑏𝑒𝑒 sequences π‘›π‘£π‘š 1 π‘›π‘£π‘š 2 Skewing factor 𝑀 𝑇 … π‘›π‘£π‘š 𝑀 𝑇 𝑏𝑒𝑒 1 π‘›π‘£π‘š 𝑀 𝑇 +1 𝑏𝑒𝑒 2 π‘›π‘£π‘š 𝑀 𝑇 +2 𝑏𝑒𝑒 3 …

  17. Pipeline Scheduling a11 = A[1][1] c11 = a11*b11 a12 = A[1][2] c12 = a11*b12 a13 = A[1][3] … a14 = A[1][4] c11 += a12*b21 … c12 += a12*b22 b11 = B[1][1] … b12 = B[1][2] c11 += a13*b31 b13 = B[1][3] c12 += a13*b32 b14 = B[1][4] … … C[1][1] = c11 C[1][2] = c12 C[1][3] = c13

  18. EMPIRICAL OPTIMIZATION IN ATLAS

  19. ATLAS Architecture L1 Cache Detect Parameters ATLAS Search ATLAS Code Hardware CPU parameters Engine Generator Parameters Optimize 𝑔(𝑦 1 , 𝑦 2 , 𝑦 3 , … , 𝑦 π‘œ ) Source Code Execute MFLOPS And Measure Multiple versions Kamen Yotov and Xiaoming Li and Gang Ren and Maria Garzaran and David Padua and Keshav Pingali and Paul Stodghill PROCEEDINGS OF THE IEEE, VOL. 93, NO. 2, FEBRUARY 2005

  20. Optimization Order 1. Find best block size for outer loop 2. Find best block sizes for inner loop 3. Find best skewing factor 4. Find best parameters for scheduling of loads 5. Additional parameters 𝑂 𝑉 𝑂 𝐢 𝑂 Γ— 𝐿 𝑂 = 𝑂 𝐢 𝑁 𝑉 Γ— = 𝑙 𝑙 𝑁 𝑁 𝐿

  21. Search for best Outer Loop Size Restrict search space 16 ≀ 𝑂 𝐢 ≀ min(80, 𝑀 1 𝑇𝑗𝑨𝑓) 𝑂 𝐢 𝑂 𝑂 Γ— 𝐿 = 𝑂 𝐢 𝑁 𝑁 𝐿 β€’ 𝑂 𝐢 must be a multiple of 4 β€’ Use fastest version Try with and without unrolling the inner loop

  22. DISCUSSION

  23. Comparison to PhiPAC PhiPAC ATLAS β€’ Coding methodology to β€’ Library generator write fast code β€’ Automatic generation of β€’ Precursor for ATLAS optimized BLAS β€’ Specialized Code Generator for BLAS Matrix-Matrix β€’ Support for handcoded Multiplication routines β€’ Optimizes parameters for inner and outer loop

  24. ATLAS Matrix-Matrix Multiplication MMM MMM MMM "Automated Empirical Optimization of Software and the ATLAS project" by R. Clint Whaley, Antoine Petitet and Jack Dongarra. Parallel Computing, 27(1-2):3-35, 2001.

  25. Comparison to eigen http://eigen.tuxfamily.org/index.php?title=Benchmark Intel(R) Core(TM)2 Quad CPU Q9400 @ 2.66GHz ( x86_64 )

  26. Conclusion Pro β€’ Fast method to generate an optimized library for a new platform β€’ Supports hand optimized code β€’ Implements BLAS Contra β€’ Needs constant adjustment to support new architectures β€’ Outdated

  27. Further Information β€’ ATLAS Project http://math-atlas.sourceforge.net/ β€’ BLAS http://netlib.org/blas/

Recommend


More recommend