Introduction PetaBricks OpenTuner Conclusions Autotuning Programs with Algorithmic Choice Jason Ansel MIT - CSAIL December 18, 2013
Parallelism choices s e c i o h c c i m h t i r o g A l Accuracy choices Introduction PetaBricks OpenTuner Conclusions High Performance Search Problem • Parallelism
Parallelism choices s e c i o h c c i m h t i r o g A l Accuracy choices Introduction PetaBricks OpenTuner Conclusions High Performance Search Problem • Parallelism Performance • Exploiting parallelism is necessary but not sufficient
Introduction PetaBricks OpenTuner Conclusions High Performance Search Problem Performance search space: • Parallelism Performance • Exploiting parallelism is necessary but not sufficient Parallelism choices • Performance is a multi-dimensional search problem s e c i o h • Normally done by expert c c i m h t i r o programmers g A l Accuracy choices • Optimization decisions often change program results
Introduction PetaBricks OpenTuner Conclusions High Performance Search Problem Goal of this work To automate the process of program optimization to create programs that can adapt to changing environments and goals.
Introduction PetaBricks OpenTuner Conclusions High Performance Search Problem Goal of this work To automate the process of program optimization to create programs that can adapt to changing environments and goals. • Language level solutions for concisely representing algorithmic choice spaces. • Processes and compilation techniques to manage and explore these spaces. • Autotuning techniques to efficiently solve these search problems.
Introduction PetaBricks OpenTuner Conclusions Research Covered in This Talk • The PetaBricks programming language: algorithmic choice at the language level [PLDI’09] • Language level support for variable accuracy [CGO’11] • Automated construction of multigrid V-cycles [SC’09] • Code generation and autotuning for heterogeneous CPU/GPU mix of parallel processing units [ASPLOS’13] • Solution for input sensitivity based on adaptive overhead-aware classifiers [Under review] • OpenTuner: an extensible framework for program autotuning [Under review]
Introduction PetaBricks OpenTuner Conclusions Research Covered in This Talk • The PetaBricks programming language: algorithmic choice at the language level [PLDI’09] • Language level support for variable accuracy [CGO’11] • Automated construction of multigrid V-cycles [SC’09] • Code generation and autotuning for heterogeneous CPU/GPU mix of parallel processing units [ASPLOS’13] • Solution for input sensitivity based on adaptive overhead-aware classifiers [Under review] • OpenTuner: an extensible framework for program autotuning [Under review] • Won’t be talking about work in: ASPLOS’09, ASPLOS’12, GECCO’11, IPDPS’09, PLDI’11, and many others
Introduction PetaBricks OpenTuner Conclusions A Motivating Example for Algorithmic Choice • How would you write a fast sorting algorithm?
Introduction PetaBricks OpenTuner Conclusions A Motivating Example for Algorithmic Choice • How would you write a fast sorting algorithm? • Insertion sort • Quick sort • Merge sort • Radix sort
Introduction PetaBricks OpenTuner Conclusions A Motivating Example for Algorithmic Choice • How would you write a fast sorting algorithm? • Insertion sort • Quick sort • Merge sort • Radix sort • Poly-algorithms
Introduction PetaBricks OpenTuner Conclusions std::stable sort /usr/include/c++/4.5.2/bits/stl algo.h lines 3350-3367
Introduction PetaBricks OpenTuner Conclusions std::stable sort /usr/include/c++/4.5.2/bits/stl algo.h lines 3350-3367
Introduction PetaBricks OpenTuner Conclusions Why 15? • Why 15?
Introduction PetaBricks OpenTuner Conclusions Why 15? • Why 15? • Dates back to at least 2000 (June 2000 SGI release) • Still in current C++ STL shipped with GCC • cutoff = 15 survived 10+ years • In the source code for millions 1 of C++ programs • There is nothing the compiler can do about it 1Any C++ program with “ #include < algorithm > ”, conservative estimate based on: http://c2.com/cgi/wiki?ProgrammingLanguageUsageStatistics
Introduction PetaBricks OpenTuner Conclusions Is 15 The Right Number? • The best cutoff (CO) changes • Depends on competing costs: • Cost of computation ( < operator, call overhead, etc) • Cost of communication (swaps) • Cache behavior (misses, prefetcher, locality)
Introduction PetaBricks OpenTuner Conclusions Is 15 The Right Number? • The best cutoff (CO) changes • Depends on competing costs: • Cost of computation ( < operator, call overhead, etc) • Cost of communication (swaps) • Cache behavior (misses, prefetcher, locality) • Sorting 100000 doubles with std::stable sort : • CO ≈ 200 optimal on a Phenom 905e (15% speedup) • CO ≈ 400 optimal on a Opteron 6168 (15% speedup) • CO ≈ 500 optimal on a Xeon E5320 (34% speedup) • CO ≈ 700 optimal on a Xeon X5460 (25% speedup) • If the best cutoff has changed, perhaps best algorithm has also changed
Introduction PetaBricks OpenTuner Conclusions Algorithmic Choice • Compiler’s hands are tied, it is stuck with 15 • Need a better way to represent algorithmic choices • PetaBricks is the first language with support for algorithmic choice
Introduction PetaBricks OpenTuner Conclusions Sort in PetaBricks Language function Sort to out [ n ] from in [ n ] { either { I n s e r t i o n S o r t ( out , in ) ; } or { QuickSort ( out , in ) ; } or { MergeSort ( out , in ) ; } or { RadixSort ( out , in ) ; } }
Introduction PetaBricks OpenTuner Conclusions Sort in PetaBricks Language function Sort to out [ n ] from in [ n ] { Representation either { I n s e r t i o n S o r t ( out , in ) ; Decision tree ⇒ } or { synthesized by our QuickSort ( out , in ) ; autotuner } or { MergeSort ( out , in ) ; } or { RadixSort ( out , in ) ; } }
Introduction PetaBricks OpenTuner Conclusions Decision Trees Optimized for a Xeon E7340 ( 8 cores): N < 600 Insertion Sort N < 1420 Merge Sort Quick Sort (2-way)
Introduction PetaBricks OpenTuner Conclusions Decision Trees Optimized for Sun Fire T200 Niagara ( 8 cores): N < 1461 N < 75 N < 2400 Merge Sort Merge Sort Merge Sort Merge Sort (16-way) (8-way) (4-way) (2-way)
Introduction PetaBricks OpenTuner Conclusions Sort Algorithm Timings 2 0.0025 InsertionSort QuickSort MergeSort RadixSort 0.002 Autotuned 0.0015 Time (s) 0.001 0.0005 0 0 250 500 750 1000 1250 1500 1750 Input Size 2 On an 8-way Xeon E7340 system
Introduction PetaBricks OpenTuner Conclusions Iteration Order Choices • Many other choices related to execution order • By rows? • By columns? • Diagonal? Reverse order? Blocked? • Parallel? • Choices both within a single (possibly parallel) task and between different tasks
Introduction PetaBricks OpenTuner Conclusions Iteration Order Choices • Many other choices related to execution order • By rows? • By columns? • Diagonal? Reverse order? Blocked? • Parallel? • Choices both within a single (possibly parallel) task and between different tasks • This is main motivation for a new language as opposed to a library
Introduction PetaBricks OpenTuner Conclusions Synthesized Outer Control Flow • PetaBricks programs have synthesized outer control flow • Declarative (data flow like) outer syntax • Imperative inner code • Programs start as completely parallel • Added dependencies restrict the space of legal executions • May only access data explicitly depended on Parallel loop X. c e l l ( i ) from () { . . . } Sequential loop X. c e l l ( i ) from (X. c e l l ( i − 1) l e f t ) { . . . }
Introduction PetaBricks OpenTuner Conclusions Matrix Multiply transform MatrixMultiply to AB[w, h ] from A[ c , h ] , B[w, c ] { AB. c e l l ( x , y ) from (A. row ( y ) a , B. column ( x ) b) { return dot (a , b ) ; } }
Introduction PetaBricks OpenTuner Conclusions Matrix Multiply transform MatrixMultiply to AB[w, h ] from A[ c , h ] , B[w, c ] { AB. c e l l ( x , y ) from (A. row ( y ) a , B. column ( x ) b) { return dot (a , b ) ; } to (AB. region ( x , y , x + 4 , y + 4) out ) from (A. region (0 , y , c , y + 4) a , B. region ( x , 0 , x + 4 , c ) b) { // . . . compute 4 x 4 block . . . } }
Recommend
More recommend