automatic performance tuning and machine learning
play

Automatic Performance Tuning and Machine Learning Markus Pschel - PowerPoint PPT Presentation

Carnegie Mellon Automatic Performance Tuning and Machine Learning Markus Pschel Computer Science, ETH Zrich with: Frdric de Mesmay PhD, Electrical and Computer Engineering, Carnegie Mellon M arkus Pschel, ETH Zurich, 201 1


  1. Carnegie Mellon Automatic Performance Tuning and Machine Learning Markus Püschel Computer Science, ETH Zürich with: Frédéric de Mesmay PhD, Electrical and Computer Engineering, Carnegie Mellon M arkus Püschel, ETH Zurich, 201 1

  2. Carnegie Mellon PhD and Postdoc openings: High performance computing  Compilers  Theory  Programming languages/Generative programming  M arkus Püschel, ETH Zurich, 201 1

  3. Carnegie Mellon Why Autotuning? Matrix-Matrix Multiplication (MMM) on quadcore Intel platform Performance [Gflop/s] 50 45 40 Best implementation 35 30 25 160x 20 15 10 5 Triple loop 0 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 matrix size  Same (mathematical) operation count (2n 3 )  Compiler underperforms by 160x M arkus Püschel, ETH Zurich, 201 1

  4. Carnegie Mellon Same for All Critical Compute Functions WiFi Receiver (Physical layer) on one Intel Core Throughput [Mbit/s] vs. Data rate [Mbit/s] 35x 30x M arkus Püschel, ETH Zurich, 201 1

  5. Carnegie Mellon Solution: Autotuning Definition: Search over alternative implementations or parameters to find the fastest. Definition: Automating performance optimization with tools that complement/aid the compiler or programmer. However: Search is an important tool. But expensive. Solution: Machine learning M arkus Püschel, ETH Zurich, 201 1

  6. Carnegie Mellon Organization  Autotuning examples  An example use of machine learning M arkus Püschel, ETH Zurich, 201 1

  7. Carnegie Mellon time of time of time of implementation installation use platform known problem parameters known M arkus Püschel, ETH Zurich, 201 1

  8. Carnegie Mellon PhiPac/ATLAS: MMM Generator Whaley, Bilmes, Demmel, Dongarra , … c a b = * Blocking improves locality c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i+=B) for (j = 0; j < n; j+=B) for (k = 0; k < n; k+=B) /* B x B mini matrix multiplications */ for (i1 = i; i1 < i+B; i++) for (j1 = j; j1 < j+B; j++) for (k1 = k; k1 < k+B; k++) c[i1*n+j1] += a[i1*n + k1]*b[k1*n + j1]; } M arkus Püschel, ETH Zurich, 201 1

  9. Carnegie Mellon PhiPac/ATLAS: MMM Generator Compile Mflop/s Execute Measure L1Size NB MU,NU,KU Detect MiniMMM ATLAS ATLAS MMM NR xFetch Hardware Source Search Engine Code Generator MulAdd MulAdd Parameters L * Latency source: Pingali, Yotov, Cornell U. M arkus Püschel, ETH Zurich, 201 1

  10. Carnegie Mellon ATLAS MMM generator time of time of time of implementation installation use platform known problem parameters known M arkus Püschel, ETH Zurich, 201 1

  11. Carnegie Mellon FFTW: Discrete Fourier Transform (DFT) Frigo, Johnson Installation n = 1024 configure/make radix 16 16 64 base case radix 8 Usage Twiddles Search for fastest 8 8 d = dft(n) base case base case computation strategy d(x,y) M arkus Püschel, ETH Zurich, 201 1

  12. Carnegie Mellon FFTW: Codelet Generator Frigo n DFT codelet generator dft_n (*x, *y, …) fixed size DFT function straightline code M arkus Püschel, ETH Zurich, 201 1

  13. Carnegie Mellon FFTW codelet ATLAS FFTW adaptive generator MMM generator library time of time of time of implementation installation use platform known problem parameters known M arkus Püschel, ETH Zurich, 201 1

  14. Carnegie Mellon OSKI: Sparse Matrix-Vector Multiplication Vuduc, Im, Yelick, Demmel = *  Blocking for registers:  Improves locality (reuse of input vector)  But creates overhead (zeros in block) = * M arkus Püschel, ETH Zurich, 201 1

  15. Carnegie Mellon OSKI: Sparse Matrix-Vector Multiplication Gain by blocking (dense MVM) Overhead by blocking = * 16/9 = 1.77 1.4 1.4/1.77 = 0.79 (no gain) M arkus Püschel, ETH Zurich, 201 1

  16. Carnegie Mellon OSKI OSKI sparse MVM sparse MVM FFTW codelet ATLAS FFTW adaptive generator MMM generator library time of time of time of implementation installation use platform known problem parameters known M arkus Püschel, ETH Zurich, 201 1

  17. Carnegie Mellon Spiral: Linear Transforms & More Algorithm knowledge Platform description Spiral Optimized implementation regenerated for every new platform M arkus Püschel, ETH Zurich, 201 1

  18. Carnegie Mellon Program Generation in Spiral (Sketched) Transform Optimization at all user specified abstraction levels Algorithm rules Fast algorithm parallelization in SPL vectorization many choices loop Σ -SPL optimizations constant folding Optimized implementation scheduling …… + search M arkus Püschel, ETH Zurich, 201 1

  19. Carnegie Mellon Machine learning Machine learning Spiral: transforms Spiral: transforms general input size general input size Spiral: transforms fixed input size OSKI OSKI sparse MVM sparse MVM FFTW codelet ATLAS FFTW adaptive generator MMM generator library time of time of time of implementation installation use platform known problem parameters known M arkus Püschel, ETH Zurich, 201 1

  20. Carnegie Mellon Organization  Autotuning examples  An example use of machine learning M arkus Püschel, ETH Zurich, 201 1

  21. Carnegie Mellon Online tuning Offline tuning (time of use) (time of installation) Installation Installation configure/make configure/make for a few n: search learn decision trees Use Use Twiddles Search for fastest Twiddles d = dft(n) d = dft(n) computation strategy d(x,y) d(x,y) Goal M arkus Püschel, ETH Zurich, 201 1

  22. Carnegie Mellon Integration with Spiral-Generated Libraries Voronenko 2008 Online tunable Spiral + some platform information library M arkus Püschel, ETH Zurich, 201 1

  23. Carnegie Mellon Organization  Autotuning examples  An example use of machine learning  Anatomy of an adaptive discrete Fourier transform library  Decision tree generation using C4.5  Results M arkus Püschel, ETH Zurich, 201 1

  24. Carnegie Mellon Discrete/Fast Fourier Transform  Discrete Fourier transform (DFT):  Cooley/Tukey fast Fourier transform (FFT):  Dataflow (right to left): 16 = 4 x 4 M arkus Püschel, ETH Zurich, 201 1

  25. Carnegie Mellon Adaptive Scalar Implementation (FFTW 2.x) void dft( int n, cpx *y, cpx *x) { if (use_dft_base_case(n)) Choices used for dft_bc(n, y, x); adaptation else { int k = choose_dft_radix(n); for ( int i=0; i < k; ++i) dft_strided(m, k, t + m*i, x + m*i); for ( int i=0; i < m; ++i) dft_scaled(k, m, precomp_d[i], y + i, t + i); } } void dft_strided( int n, int istr, cpx *y, cpx *x) { ... } void dft_scaled( int n, int str, cpx *d, cpx *y, cpx *x) { ... } M arkus Püschel, ETH Zurich, 201 1

  26. Carnegie Mellon Decision Graph of Library void dft( int n, cpx *y, cpx *x) { if (use_dft_base_case(n)) Choices used for dft_bc(n, y, x); adaptation else { int k = choose_dft_radix(n); for ( int i=0; i < k; ++i) dft_strided(m, k, t + m*i, x + m*i); for ( int i=0; i < m; ++i) dft_scaled(k, m, precomp_d[i], y + i, t + i); } } void dft_strided( int n, int istr, cpx *y, cpx *x) { ... } void dft_scaled( int n, int str, cpx *d, cpx *y, cpx *x) { ... } M arkus Püschel, ETH Zurich, 201 1

  27. Carnegie Mellon Spiral-Generated Libraries Standard Scalar Spiral Vectorized Threading Buffering 20 mutually recursive functions  10 different choices (occurring recursively)  Choices are heterogeneous (radix, threading, buffering, …)  M arkus Püschel, ETH Zurich, 201 1

  28. Carnegie Mellon Our Work Upon installation, generate decision trees for each choice Example: M arkus Püschel, ETH Zurich, 201 1

  29. Carnegie Mellon Statistical Classification: C4.5 Features (events) C4.5 Entropy of Features P(play|windy=false) = 6/8 H(windy=false) = 0.81 P(don’t play|windy=false) = 2/8 H(windy) = 0.89 P(play|windy=true) = 1/2 H(windy=true) = 1.0 H(outlook) = 0.69 P(don’t play|windy=false) = 1/2 H(humidity) = … M arkus Püschel, ETH Zurich, 201 1

  30. Carnegie Mellon Application to Libraries  Features = arguments of functions (except variable pointers) dft( int n, cpx *y, cpx *x) dft_strided( int n, int istr, cpx *y, cpx *x) dft_scaled( int n, int str, cpx *d, cpx *y, cpx *x)  At installation time:  Run search for a few input sizes n  Yields training set: features and associated decisions (several for each size)  Generate decision trees using C4.5 and insert into library M arkus Püschel, ETH Zurich, 201 1

  31. Carnegie Mellon Issues  Correctness of generated decision trees  Issue: learning sizes n in {12, 18, 24, 48}, may find radix 6  Solution: correction pass through decision tree  Prime factor structure n = 2 i 3 j = 2, 3, 4, 6, 9, 12, 16, 18, 24, 27, 32, … i Compute i, j and add to features j M arkus Püschel, ETH Zurich, 201 1

  32. Carnegie Mellon Experimental Setup  3GHz Intel Xeon 5160 (2 Core 2 Duos = 4 cores)  Linux 64-bit, icc 10.1  Libraries:  IPP 5.3  FFTW 3.2 alpha 2  Spiral-generated library M arkus Püschel, ETH Zurich, 201 1

  33. Carnegie Mellon Learning works as expected M arkus Püschel, ETH Zurich, 201 1

  34. Carnegie Mellon “All” Sizes  All sizes n ≤ 2 18 , with prime factors ≤ 19 M arkus Püschel, ETH Zurich, 201 1

Recommend


More recommend