online tuning of stream programs or how to get the most
play

Online Tuning of Stream Programs or How To Get The Most Ouf Of Your - PowerPoint PPT Presentation

Online Tuning of Stream Programs or How To Get The Most Ouf Of Your Multicore Walter F. Tichy Institute for Program Structures and Data Organization KIT University of Baden-Wrttemberg and National Research Center of the Helmholtz


  1. Online Tuning of Stream Programs or How To Get The Most Ouf Of Your Multicore Walter F. Tichy Institute for Program Structures and Data Organization KIT – University of Baden-Württemberg and National Research Center of the Helmholtz Association

  2. Where is Karlsruhe? University of Karlsruhe - KIT, Germany Faculty of Computer Science One of the leading CS departments in Europe >40 faculty, >400 PhD students in CS 2

  3. The changing parallel computing landscape Cray vector computer 1976 3 Multicore-Transformation

  4. The first five-core mobile phone HTC One X, Feb. 1912, Powered by Nvidia Tegra 3 4

  5. Nvidia Tegra 3 5

  6. Nvidia Tegra 3 Schematic 1 core at 500 MHz (battery saver) 4 cores at 1.5 GHz 1 GPU 6

  7. AMD Opteron 12 cores Sun Niagara3 16 cores ~1.8 Bill. T. on 2x3.46cm 2 ~1 Bill. T. on 3.7cm 2 Intel SCC 48 cores ~1.3 Bill. T. on 5.6 cm 2 Intel 8 cores ~2.3 Bill. T. on 6.8cm 2 Intel 4 cores ~582 Mio. T on 2.86cm 2 Intel Sandy Bridge 4+6 cores ~1 Bill. T. on 2.2 cm 2 Intel 2 cores ~167 Mio. T. on 1.1cm 2 Bus Bu Bus Bu 7 7

  8. The 2011 Intel Sandy Bridge Currently: 4 CPUs, 6 graphics Execution Units Later: 8 CPUs, 12 graphics Execution Units 8

  9. AMD Opteron 12 cores Sun Niagara3 16 cores ~1.8 Bill. T. on 2x3.46cm 2 ~1 Bill. T. on 3.7cm 2 Intel SCC 48 cores ~1.3 Bill. T. on 5.6 cm 2 Intel 8 cores ~2.3 Bill. T. on 6.8cm 2 Intel 4 cores ~582 Mio. T on 2.86cm 2 Intel Sandy Bridge 4+6 cores ~1 Bill. T. on 2.2 cm 2 Intel 2 cores ~167 Mio. T. on 1.1cm 2 Bus Bu Bus Bu 9 9 Victor Pankratius

  10. Fixing Parallel Performance Problems � Parallelization is complex and error- ? ? ? prone � Parallel programs contain a number a= 1 b= 2 c= 3 of tuning parameters a= ? b= ? c= ? ! � Manual optimization difficult and A Examples for Tuning time-consuming Parameters ! � Each target platform may require • Number of pipeline a= 4 b= 5 c= 6 re-tuning stages • Choice of best algorithm B implementation � Auto-Tuning : Let the computer do • Order of execution the tuning! • Size of data partitions • Number of workers • Type of core • Load balancing strategy 10

  11. Online Auto-Tuning � Auto-Tuning Cycle: Result of measurement: Optimize (calculate new Parameter Configuration Performance value parameter values) Parallel program Execute and measure Apply Configuration to with program Program Executable Tuning Parameters program � Example (pseudo code) TuningParameter numthreads(3, 64); Tuning Parameter TuningParameter blocksize(100, 900, 100); for(int i=0; i<numfiles; ++i) { startMeasurement(); Measurement compress(files[i], blocksize, numthreads); Section stopMeasurement(); } 11

  12. Auto-Tuning: BZip2 example Parallelized BZip2, compressing 50 files on a machine with 8 cores Initial tuning parameter values: 3 threads, block size 700 kB Runtime without tuning: 22,9 s Runtime with Auto-Tuner: 8 s Best possible time (start with best configuration): 6,5 s 12

  13. Stream Programming Paradigm � A stream of elements flows through a graph of processing modules called filters . F 2 � Task parallelism F 1 F 4 F 3 � Pipeline parallelism F 1 F 2 F 3 F 4 F 5 F 2 � Data parallelism (by filter replication) F 1 F 2 F 3 Split Join F 2 13

  14. (Some) Implicit Tuning Parameters � Replication factor: F 1 S J F 2 F ··· F n � Cut-off depth: � Alternative Algorithms/Cores: AL 1 ? AL 2 ··· AL n 14

  15. Measurement Sections in Stream Programs � „Classic“ Fork/Join pattern: Measurement Section Measurement Section Measurement Section Seq. parallel 1 Seq. parallel 1 Seq. parallel 1 parallel 2 parallel 3 parallel 2 parallel 3 parallel 2 parallel 3 � Stream program: Measurement Section(s)? Seq. Filter 1 Filter 2 Filter 2 Filter 3 Filter 1 Filter 2 Filter 4 Filter 1 Filter 1 Filter 4 Filter 4 Filter 4 Filter 3 Filter 3 Filter 2 � Solution: � Count „heart beats“ (events triggered by stream elements) � Use heart beats to evaluate performance 15

  16. Using Heartbeats for Online Tuning � Heartbeats are emitted by sink filters � The faster the heartbeat, the better the performance � Heartbeats serve as an input signal for online auto-tuners Illustrating Example: Filter 1 Filter 2 Filter 2 Filter 1 Filter 2 Filter 2 Filter 1 Filter 2 Filter 2 Filter 2 Filter 4 Filter 1 Filter 2 Filter 4 Filter 1 Filter 2 Filter 4 Filter 1 Filter 3 Filter 3 Filter 4 Filter 3 Filter 3 Filter 4 Filter 3 Filter 3 Filter 4 Filter 4 Filter 4 Filter 4 Filter 2 Filter 1 Filter 2 Filter 1 Filter 2 Filter 1 time new parameter new parameter new parameter configuration configuration configuration 50 70 80 Auto-Tuner 16

  17. Benchmark 1: Video zoom * Scale I ? ? * * replicable S J Read Cut Write ? First Come/First Serve * Scale II 1200,00 1000,00 pre: Statically predicted Execution time tun: On-line auto-tuned 800,00 best: Started with best known configuration, w/o Auto-Tuning 600,00 400,00 200,00 0,00 pre tun best pre tun best pre tun best Quadcore Dell Niagara 17

  18. Benchmark 2: Electric (Placement of circuits on a die) � Part of VLSI design application � 5 Filters with feedback loop and teleports � 4 Tuning parameters * * * Calculate Repair Producer Movement Finish forces overlaps * replicable 18

  19. Electric: Results 1400,00 1200,00 1000,00 Execution time 800,00 600,00 400,00 200,00 0,00 pre tun best pre tun best pre tun best Quadcore Dell Niagara 19

  20. Benchmarks on 4 cores Fractions of best parallel 120% performance (= 100%) 100% 80% seq 60% pre tun 40% 20% 0% DS Electric Series Vscale Vzoom 20

  21. Benchmarks on 64 cores (Niagara) 120% 100% 80% seq 60% pre tun 40% 20% 0% DS Electric Series Vscale Vzoom 21

  22. Related Work (Selection) � ATLAS/AEOS (Whaley et al., 2000) � Auto-tuning system for algebraic operations and algorithms � Domain specific approach � No support for parallel programs � Active Harmony (Tapus et al., 2002) � Search-based auto-tuning system for library optimization � Comprehensive analysis of search algorithms � Not applicable for parallel programs � MATE (Morajko et al., 2007) � Model-based tuning system for distributed PVM programs � Provides good performance predictions � Limited to special program structures � ATUNE (Schaefer, Tichy, 2010) � General-purpose auto-tuner � Offline tuner (trial runs) � Pattern language for expressing parallel patterns (TADL) 22

  23. Benchmark 3: Desktop search * * Read Write to * replicable Read file directory index 120,00 100,00 80,00 Execution time 60,00 40,00 20,00 0,00 pre tun best pre tun best pre tun best Quadcore Dell Niagara 23

  24. Summary � Computers are not the bottleneck. � Programmers are! � Stream programming simplifies parallel programming � Typical parallel patterns easy to write � Auto-tuning finds optimal operating conditions � Saves lots of tuning work � Further research � Improved online search algorithms � Use static model to predict good starting values � Use auto-tuning to distribute work over heterogeneous cores 24

  25. THANK YOU! QUESTIONS? With many thanks to Frank Otto, Thomas Karcher, Jonas Thedering, Victor Pankratius For more information, see: http://www.ipd.kit.edu/Tichy/ 25

  26. BACKUP SLIDES 26

  27. Benchmarks on 8 cores 120% 100% 80% seq 60% pre tun 40% 20% 0% DS Electric Series Vscale Vzoom 27

Recommend


More recommend