Portable, Scalable, per-Core P t bl S l bl C Power Estimation Sally A. McKee Chalmers University of Technology Chalmers University of Technology Why Care about Power? Packaging/cooling Operating costs Performance Reliability Battery lifetime Device lifetime Ergonomics Slide 2 1
What Can We Do with Power Info? Optimize thread allocation Manage workloads for Power constraints P t i t Temperature constraints Data locality Budget power per core, process, or thread Adapt frequencies for performance requirements Adapt frequencies for performance requirements Resize/turn off structures Slide 3 More Observations Energy efficiency essential at all scales Component power consumption difficult to measure Processors share same power plane P h l External meters give total node power Meter per node impossible for large-scale systems Embedded measurement devices financially infeasible Even invasive hardware still suffers inaccuracy But dynamic power estimation possible using performance monitoring counters (PMCs) Slide 4 2
Approach Analytic models based on PMCs Gather performance data from microbenchmarks Collect power measurements Categorize counters Choose counters most strongly correlated with power Advantages Easy Portable Dynamic Application-independent Slide 5 Approach (cont.) Microbenchmarks stress PMCs Blowup of Core Four categories sufficient: Four categories sufficient: FP ops 128-bit FPU 512kB Memory Load/ L1 Data L2 Stalls Store Cache Cache Instructions retired Execution Future applications also Fetch/ described by model described by model Decode/ L1 Instr Branch Cache AMD Phenom 9500 Core source: www.amd.com Slide 6 3
Initial Setup Measurement: pfmon, Watts Up Pro meter Benchmarks: SPEC 2006, SPEC OMP, NAS , , (gcc 4.2 –O3 [–OpenMP]) Slide 7 Forming the Model Counters with highest correlation become model inputs Counters e i normalized to cycle count to give r i Piece-wise linear model for per-core power Slide 8 4
Forming the Model: AMD Phenom Function behavior differs for very low values of L2 counter All except FP correlate positively with power All except FP correlate positively with power Including temperature increases accuracy e 1 : L2_CACHE_MISS e 2 : RETIRED_UOPS e 3 : RETIRED MMX AND FP INSTRUCTIONS e 4 : DISPATCH_STALLS Slide 9 Model Validation Comparison of estimated and measured power At wall socket At ATX rails At ATX rails On motherboard Three benchmark suites (45 benchmarks) Single- and multi-threaded Floating point and integer Floating point and integer Six platforms (2-8 cores from Intel/AMD) Slide 10 5
Maximum Estimation Errors Quad Core Benchmark AMD Intel Intel Core i7 Phenom 9500 Q6600 SPEC2006 3.51 % 1.05 % 1.61 % NAS 4.52 % 1.59 % 3.11 % SPEC OMP 5.16 % 1.59 % 4.14 % Dual Core 8 – Core Benchmark Intel Core Intel AMD Opteron Duo E5430 8212 SPEC2006 4.01 % 2.76 % 4.80 % NAS 3.73 % 3.90 % 2.55 % SPEC OMP 4.36 % 3.53 % 3.35 % Median Errors Slide 11 Median Estimation Errors Quad Core Benchmark AMD Intel Intel Core i7 Phenom 9500 Q6600 SPEC2006 3.51 % 1.61 % 1.05 % NAS 4.52 % 1.59 % 3.11 % SPEC OMP 1.59 % 4.14 % 5.16 % Dual Core 8 – Core Benchmark Intel Core Intel AMD Opteron Duo E5430 8212 SPEC2006 4.01 % 2.76 % 4.80 % NAS 3.73 % 3.90 % 2.55 % SPEC OMP 4.36 % 3.53 % 3.35 % Median Errors Slide 12 6
Estimation Results: Intel Q6600 NAS SPEC OMP SPEC 2006 Slide 13 Estimation Results: Intel Q6600 NAS SPEC OMP SPEC 2006 Slide 14 7
Estimation Results: Intel Q6600 Best: 0.2% lbm Worst: 8.4% cg 98% of estimations < 10% error 85% of estimation< 5% error Overall: SPEC 2006 2.4%, NAS 3.5%, SPEC-OMP 2.0% Slide 15 Estimation Results: Intel E5430 NAS SPEC OMP SPEC 2006 Slide 16 8
Estimation Results: Intel E5430 NAS SPEC OMP SPEC 2006 Slide 17 Estimation Results: Intel 5430 8-Core Best: 0.3% ua Worst: 7.0% hmmer 98% of estimations < 10% error 85% of estimations < 5% error Overall: SPEC 2006 3.5%, NAS 3.9%, SPEC-OMP 2.8% Slide 18 9
Standard Deviation of Error: E5430 10 8 6 % SD 4 2 0 bt bt cg cg ep ep ft ft lu lu lu-hp -hp mg mg sp sp ua ua NAS Slide 19 Standard Deviation of Error: E5430 10 8 6 % SD 4 2 0 p p u u i i t t d d t t d d e e m m e e s s r r r r m m l l 3 3 i i k k s s a a o o p p p p r r i i a a f f g g a a w w i i p p a a w w m m a a m m u u s s a a g p q a f u w SPEC OMP Slide 20 10
Estimation Results: AMD Phenom 9500 NAS SPEC OMP SPEC 2006 Slide 21 Estimation Results: AMD Phenom 9500 NAS SPEC OMP SPEC 2006 Slide 22 11
Estimation Results: AMD Phenom 9500 Best: 0.9% libquantum Worst: 9.3% xalancbmk 92% of estimations < 10% error 73% of estimations < 5% error Overall: SPEC 2006 4.5%, NAS 3.5%, SPEC-OMP 5.2% Slide 23 Estimation Results: AMD Opteron 8212 NAS SPEC OMP SPEC 2006 Slide 24 12
Estimation Results: AMD Opteron 8212 NAS SPEC OMP SPEC 2006 Slide 25 Estimation Results: AMD Opteron 8212 Best: 1.0% cactusADM Worst: 10.6% leslie3d 92% of estimations < 10% error 73% of estimations < 5% error Overall: SPEC 2006 4.5%, NAS 3.5%, SPEC-OMP 5.2% Slide 26 13
Estimation Results: Intel Core i7 NAS SPEC OMP SPEC 2006 Slide 27 Factors Affecting Model Accuracy Availability of representative PMCs PMCs available for simultaneous sampling Sampling rate of power measurement Accuracy of thermal sensors These look pretty good but what are we missing? These look pretty good, but what are we missing? Could we do better w/ a different meter? Slide 28 14
Power Measurement Infrastructures Wall outlet (Watts Up Pro) Least intrusive Low sampling rate Low sampling rate PSU output on the ATX power rails Moderately intrusive Requires custom hardware Processor socket Processor socket Most intrusive Requires soldering on motherboard Slide 29 Comparative Power Measurement Setup Power Measured at three points simultaneously Test machine used to collect samples different from target Core i7 from target Core i7 Custom sense hardware placed inside target machine cabinet Slide 30 15
PSU Output Measurement Slide 31 Measurement at PSU Output Slide 32 16
Measurement at Processor Socket V_CPU = Core Voltage IMON = Voltage proportional to regulator proportional to regulator current output Slide 33 Estimation Results (PSU Output) NAS SPEC OMP SPEC 2006 Slide 34 17
Estimation Results (Socket) NAS SPEC OMP SPEC 2006 Slide 35 Comparative Results: SPEC OMP/Core i7 Wall Socket PSU (ATX Rails) dY8To5AD Processor Socket (Motherboard) Slide 36 18
Power Measurement Experiments Sampling frequency (samples per second) At wall outlet: 1 At ATX power rails and on MB: 50000 p Measurements averaged over 50 samples Test workload: 32x32 matmul in infinite loop Theoretical measurement sensitivity Current measurement at ATX rails: 2mA C t t t ATX il 2 A CPU voltage measurement on motherboard: 47.2 uV CPU current measurement on motherboard: 7mA Slide 37 Power Measurement Results idle power activating 1-4 cores Slide 38 19
CPU versus Memory-Bound Applications memory Slide 39 DVFS + Throttling 40 20
Power Measurement Results – Efficiency 41 So What? Our models work pretty well More accurate measurement → more accurate models models All measurement methods incur some error Intel Shady Brook uses similar approach to implement “digital power meter” So we must be doing something right! Slide 42 21
Live Power Management Proof-of-concept Goal Schedule tasks under strict power budget S h d l t k d t i t b d t Minimal overhead Methodology User-level meta scheduler DVFS + process suspension to maintain power DVFS + process suspension to maintain power envelope Two sample policies for process selection Slide 43 Live Power Management Three categories of benchmarks CPU bound Memory bound Memory bound Mixed Power envelope set to 95%, 90%, 85% Results for both with/without DVFS Slide 44 22
Workloads with Different Intensities CPU bound ep, gamess, namd, povray calculix, ep, gamess, gromacs, h264ref, namd, , p, g , g , , , perlbench, povray Moderate art, lu, wupwise, xalancmbk bwaves, cactusADM, fma3d, gcc, leslie3d, sp, ua, xalancbmk Memory bound astar, mcf, milc, soplex applu, astar, lbm, mcf, milc, omnetpp, soplex, swim Slide 45 Meta-Scheduler Results: Intel Q6600 Max Instructions/Watt Max Instructions/Watt 90% Power Envelope Moderate Computational Intensity Per-core Fair 95% Power Envelope CPU-bound Workload Slide 46 23
Meta-Scheduler Results: AMD Phenom Max Instructions/Watt Max Instructions/Watt 90% Power Envelope Moderate Computational Intensity Per-core Fair 95% Power Envelope CPU-bound Workload Slide 47 Performance Results: Intel Q6600 CPU-Bound Memory-Bound Moderate Slide 48 24
Performance Results: AMD Phenom CPU-Bound Memory-Bound Moderate Slide 49 Performance Results: AMD Phenom CPU-Bound Memory-Bound Moderate Slide 50 25
Recommend
More recommend