Online Tuning of Stream Programs or How To Get The Most Ouf Of Your Multicore Walter F. Tichy Institute for Program Structures and Data Organization KIT – University of Baden-Württemberg and National Research Center of the Helmholtz Association
Where is Karlsruhe? University of Karlsruhe - KIT, Germany Faculty of Computer Science One of the leading CS departments in Europe >40 faculty, >400 PhD students in CS 2
The changing parallel computing landscape Cray vector computer 1976 3 Multicore-Transformation
The first five-core mobile phone HTC One X, Feb. 1912, Powered by Nvidia Tegra 3 4
Nvidia Tegra 3 5
Nvidia Tegra 3 Schematic 1 core at 500 MHz (battery saver) 4 cores at 1.5 GHz 1 GPU 6
AMD Opteron 12 cores Sun Niagara3 16 cores ~1.8 Bill. T. on 2x3.46cm 2 ~1 Bill. T. on 3.7cm 2 Intel SCC 48 cores ~1.3 Bill. T. on 5.6 cm 2 Intel 8 cores ~2.3 Bill. T. on 6.8cm 2 Intel 4 cores ~582 Mio. T on 2.86cm 2 Intel Sandy Bridge 4+6 cores ~1 Bill. T. on 2.2 cm 2 Intel 2 cores ~167 Mio. T. on 1.1cm 2 Bus Bu Bus Bu 7 7
The 2011 Intel Sandy Bridge Currently: 4 CPUs, 6 graphics Execution Units Later: 8 CPUs, 12 graphics Execution Units 8
AMD Opteron 12 cores Sun Niagara3 16 cores ~1.8 Bill. T. on 2x3.46cm 2 ~1 Bill. T. on 3.7cm 2 Intel SCC 48 cores ~1.3 Bill. T. on 5.6 cm 2 Intel 8 cores ~2.3 Bill. T. on 6.8cm 2 Intel 4 cores ~582 Mio. T on 2.86cm 2 Intel Sandy Bridge 4+6 cores ~1 Bill. T. on 2.2 cm 2 Intel 2 cores ~167 Mio. T. on 1.1cm 2 Bus Bu Bus Bu 9 9 Victor Pankratius
Fixing Parallel Performance Problems � Parallelization is complex and error- ? ? ? prone � Parallel programs contain a number a= 1 b= 2 c= 3 of tuning parameters a= ? b= ? c= ? ! � Manual optimization difficult and A Examples for Tuning time-consuming Parameters ! � Each target platform may require • Number of pipeline a= 4 b= 5 c= 6 re-tuning stages • Choice of best algorithm B implementation � Auto-Tuning : Let the computer do • Order of execution the tuning! • Size of data partitions • Number of workers • Type of core • Load balancing strategy 10
Online Auto-Tuning � Auto-Tuning Cycle: Result of measurement: Optimize (calculate new Parameter Configuration Performance value parameter values) Parallel program Execute and measure Apply Configuration to with program Program Executable Tuning Parameters program � Example (pseudo code) TuningParameter numthreads(3, 64); Tuning Parameter TuningParameter blocksize(100, 900, 100); for(int i=0; i<numfiles; ++i) { startMeasurement(); Measurement compress(files[i], blocksize, numthreads); Section stopMeasurement(); } 11
Auto-Tuning: BZip2 example Parallelized BZip2, compressing 50 files on a machine with 8 cores Initial tuning parameter values: 3 threads, block size 700 kB Runtime without tuning: 22,9 s Runtime with Auto-Tuner: 8 s Best possible time (start with best configuration): 6,5 s 12
Stream Programming Paradigm � A stream of elements flows through a graph of processing modules called filters . F 2 � Task parallelism F 1 F 4 F 3 � Pipeline parallelism F 1 F 2 F 3 F 4 F 5 F 2 � Data parallelism (by filter replication) F 1 F 2 F 3 Split Join F 2 13
(Some) Implicit Tuning Parameters � Replication factor: F 1 S J F 2 F ··· F n � Cut-off depth: � Alternative Algorithms/Cores: AL 1 ? AL 2 ··· AL n 14
Measurement Sections in Stream Programs � „Classic“ Fork/Join pattern: Measurement Section Measurement Section Measurement Section Seq. parallel 1 Seq. parallel 1 Seq. parallel 1 parallel 2 parallel 3 parallel 2 parallel 3 parallel 2 parallel 3 � Stream program: Measurement Section(s)? Seq. Filter 1 Filter 2 Filter 2 Filter 3 Filter 1 Filter 2 Filter 4 Filter 1 Filter 1 Filter 4 Filter 4 Filter 4 Filter 3 Filter 3 Filter 2 � Solution: � Count „heart beats“ (events triggered by stream elements) � Use heart beats to evaluate performance 15
Using Heartbeats for Online Tuning � Heartbeats are emitted by sink filters � The faster the heartbeat, the better the performance � Heartbeats serve as an input signal for online auto-tuners Illustrating Example: Filter 1 Filter 2 Filter 2 Filter 1 Filter 2 Filter 2 Filter 1 Filter 2 Filter 2 Filter 2 Filter 4 Filter 1 Filter 2 Filter 4 Filter 1 Filter 2 Filter 4 Filter 1 Filter 3 Filter 3 Filter 4 Filter 3 Filter 3 Filter 4 Filter 3 Filter 3 Filter 4 Filter 4 Filter 4 Filter 4 Filter 2 Filter 1 Filter 2 Filter 1 Filter 2 Filter 1 time new parameter new parameter new parameter configuration configuration configuration 50 70 80 Auto-Tuner 16
Benchmark 1: Video zoom * Scale I ? ? * * replicable S J Read Cut Write ? First Come/First Serve * Scale II 1200,00 1000,00 pre: Statically predicted Execution time tun: On-line auto-tuned 800,00 best: Started with best known configuration, w/o Auto-Tuning 600,00 400,00 200,00 0,00 pre tun best pre tun best pre tun best Quadcore Dell Niagara 17
Benchmark 2: Electric (Placement of circuits on a die) � Part of VLSI design application � 5 Filters with feedback loop and teleports � 4 Tuning parameters * * * Calculate Repair Producer Movement Finish forces overlaps * replicable 18
Electric: Results 1400,00 1200,00 1000,00 Execution time 800,00 600,00 400,00 200,00 0,00 pre tun best pre tun best pre tun best Quadcore Dell Niagara 19
Benchmarks on 4 cores Fractions of best parallel 120% performance (= 100%) 100% 80% seq 60% pre tun 40% 20% 0% DS Electric Series Vscale Vzoom 20
Benchmarks on 64 cores (Niagara) 120% 100% 80% seq 60% pre tun 40% 20% 0% DS Electric Series Vscale Vzoom 21
Related Work (Selection) � ATLAS/AEOS (Whaley et al., 2000) � Auto-tuning system for algebraic operations and algorithms � Domain specific approach � No support for parallel programs � Active Harmony (Tapus et al., 2002) � Search-based auto-tuning system for library optimization � Comprehensive analysis of search algorithms � Not applicable for parallel programs � MATE (Morajko et al., 2007) � Model-based tuning system for distributed PVM programs � Provides good performance predictions � Limited to special program structures � ATUNE (Schaefer, Tichy, 2010) � General-purpose auto-tuner � Offline tuner (trial runs) � Pattern language for expressing parallel patterns (TADL) 22
Benchmark 3: Desktop search * * Read Write to * replicable Read file directory index 120,00 100,00 80,00 Execution time 60,00 40,00 20,00 0,00 pre tun best pre tun best pre tun best Quadcore Dell Niagara 23
Summary � Computers are not the bottleneck. � Programmers are! � Stream programming simplifies parallel programming � Typical parallel patterns easy to write � Auto-tuning finds optimal operating conditions � Saves lots of tuning work � Further research � Improved online search algorithms � Use static model to predict good starting values � Use auto-tuning to distribute work over heterogeneous cores 24
THANK YOU! QUESTIONS? With many thanks to Frank Otto, Thomas Karcher, Jonas Thedering, Victor Pankratius For more information, see: http://www.ipd.kit.edu/Tichy/ 25
BACKUP SLIDES 26
Benchmarks on 8 cores 120% 100% 80% seq 60% pre tun 40% 20% 0% DS Electric Series Vscale Vzoom 27
Recommend
More recommend