portable scalable per core p t bl s l bl c power
play

Portable, Scalable, per-Core P t bl S l bl C Power Estimation - PDF document

Portable, Scalable, per-Core P t bl S l bl C Power Estimation Sally A. McKee Chalmers University of Technology Chalmers University of Technology Why Care about Power? Packaging/cooling Operating costs Performance


  1. Portable, Scalable, per-Core P t bl S l bl C Power Estimation Sally A. McKee Chalmers University of Technology Chalmers University of Technology Why Care about Power?  Packaging/cooling  Operating costs  Performance  Reliability  Battery lifetime  Device lifetime  Ergonomics Slide 2 1

  2. What Can We Do with Power Info?  Optimize thread allocation  Manage workloads for  Power constraints P t i t  Temperature constraints  Data locality  Budget power per core, process, or thread  Adapt frequencies for performance requirements  Adapt frequencies for performance requirements  Resize/turn off structures Slide 3 More Observations  Energy efficiency essential at all scales  Component power consumption difficult to measure  Processors share same power plane P h l  External meters give total node power  Meter per node impossible for large-scale systems  Embedded measurement devices financially infeasible  Even invasive hardware still suffers inaccuracy  But dynamic power estimation possible using performance monitoring counters (PMCs) Slide 4 2

  3. Approach  Analytic models based on PMCs  Gather performance data from microbenchmarks  Collect power measurements  Categorize counters  Choose counters most strongly correlated with power  Advantages  Easy  Portable  Dynamic  Application-independent Slide 5 Approach (cont.)  Microbenchmarks stress PMCs Blowup of Core  Four categories sufficient:  Four categories sufficient: FP ops 128-bit FPU 512kB Memory Load/ L1 Data L2 Stalls Store Cache Cache Instructions retired Execution  Future applications also Fetch/ described by model described by model Decode/ L1 Instr Branch Cache AMD Phenom 9500 Core source: www.amd.com Slide 6 3

  4. Initial Setup Measurement: pfmon, Watts Up Pro meter Benchmarks: SPEC 2006, SPEC OMP, NAS , , (gcc 4.2 –O3 [–OpenMP]) Slide 7 Forming the Model  Counters with highest correlation become model inputs  Counters e i normalized to cycle count to give r i  Piece-wise linear model for per-core power Slide 8 4

  5. Forming the Model: AMD Phenom  Function behavior differs for very low values of L2 counter  All except FP correlate positively with power  All except FP correlate positively with power  Including temperature increases accuracy e 1 : L2_CACHE_MISS e 2 : RETIRED_UOPS e 3 : RETIRED MMX AND FP INSTRUCTIONS e 4 : DISPATCH_STALLS Slide 9 Model Validation  Comparison of estimated and measured power  At wall socket  At ATX rails At ATX rails  On motherboard  Three benchmark suites (45 benchmarks)  Single- and multi-threaded  Floating point and integer  Floating point and integer  Six platforms (2-8 cores from Intel/AMD) Slide 10 5

  6. Maximum Estimation Errors Quad Core Benchmark AMD Intel Intel Core i7 Phenom 9500 Q6600 SPEC2006 3.51 % 1.05 % 1.61 % NAS 4.52 % 1.59 % 3.11 % SPEC OMP 5.16 % 1.59 % 4.14 % Dual Core 8 – Core Benchmark Intel Core Intel AMD Opteron Duo E5430 8212 SPEC2006 4.01 % 2.76 % 4.80 % NAS 3.73 % 3.90 % 2.55 % SPEC OMP 4.36 % 3.53 % 3.35 % Median Errors Slide 11 Median Estimation Errors Quad Core Benchmark AMD Intel Intel Core i7 Phenom 9500 Q6600 SPEC2006 3.51 % 1.61 % 1.05 % NAS 4.52 % 1.59 % 3.11 % SPEC OMP 1.59 % 4.14 % 5.16 % Dual Core 8 – Core Benchmark Intel Core Intel AMD Opteron Duo E5430 8212 SPEC2006 4.01 % 2.76 % 4.80 % NAS 3.73 % 3.90 % 2.55 % SPEC OMP 4.36 % 3.53 % 3.35 % Median Errors Slide 12 6

  7. Estimation Results: Intel Q6600 NAS SPEC OMP SPEC 2006 Slide 13 Estimation Results: Intel Q6600 NAS SPEC OMP SPEC 2006 Slide 14 7

  8. Estimation Results: Intel Q6600 Best: 0.2% lbm Worst: 8.4% cg 98% of estimations < 10% error 85% of estimation< 5% error Overall: SPEC 2006 2.4%, NAS 3.5%, SPEC-OMP 2.0% Slide 15 Estimation Results: Intel E5430 NAS SPEC OMP SPEC 2006 Slide 16 8

  9. Estimation Results: Intel E5430 NAS SPEC OMP SPEC 2006 Slide 17 Estimation Results: Intel 5430 8-Core Best: 0.3% ua Worst: 7.0% hmmer 98% of estimations < 10% error 85% of estimations < 5% error Overall: SPEC 2006 3.5%, NAS 3.9%, SPEC-OMP 2.8% Slide 18 9

  10. Standard Deviation of Error: E5430 10 8 6 % SD 4 2 0 bt bt cg cg ep ep ft ft lu lu lu-hp -hp mg mg sp sp ua ua NAS Slide 19 Standard Deviation of Error: E5430 10 8 6 % SD 4 2 0 p p u u i i t t d d t t d d e e m m e e s s r r r r m m l l 3 3 i i k k s s a a o o p p p p r r i i a a f f g g a a w w i i p p a a w w m m a a m m u u s s a a g p q a f u w SPEC OMP Slide 20 10

  11. Estimation Results: AMD Phenom 9500 NAS SPEC OMP SPEC 2006 Slide 21 Estimation Results: AMD Phenom 9500 NAS SPEC OMP SPEC 2006 Slide 22 11

  12. Estimation Results: AMD Phenom 9500 Best: 0.9% libquantum Worst: 9.3% xalancbmk 92% of estimations < 10% error 73% of estimations < 5% error Overall: SPEC 2006 4.5%, NAS 3.5%, SPEC-OMP 5.2% Slide 23 Estimation Results: AMD Opteron 8212 NAS SPEC OMP SPEC 2006 Slide 24 12

  13. Estimation Results: AMD Opteron 8212 NAS SPEC OMP SPEC 2006 Slide 25 Estimation Results: AMD Opteron 8212 Best: 1.0% cactusADM Worst: 10.6% leslie3d 92% of estimations < 10% error 73% of estimations < 5% error Overall: SPEC 2006 4.5%, NAS 3.5%, SPEC-OMP 5.2% Slide 26 13

  14. Estimation Results: Intel Core i7 NAS SPEC OMP SPEC 2006 Slide 27 Factors Affecting Model Accuracy  Availability of representative PMCs  PMCs available for simultaneous sampling  Sampling rate of power measurement  Accuracy of thermal sensors These look pretty good but what are we missing? These look pretty good, but what are we missing? Could we do better w/ a different meter? Slide 28 14

  15. Power Measurement Infrastructures  Wall outlet (Watts Up Pro)  Least intrusive  Low sampling rate  Low sampling rate  PSU output on the ATX power rails  Moderately intrusive  Requires custom hardware  Processor socket  Processor socket  Most intrusive  Requires soldering on motherboard Slide 29 Comparative Power Measurement Setup  Power Measured at three points simultaneously  Test machine used to collect samples different from target Core i7 from target Core i7  Custom sense hardware placed inside target machine cabinet Slide 30 15

  16. PSU Output Measurement Slide 31 Measurement at PSU Output Slide 32 16

  17. Measurement at Processor Socket  V_CPU = Core Voltage  IMON = Voltage proportional to regulator proportional to regulator current output Slide 33 Estimation Results (PSU Output) NAS SPEC OMP SPEC 2006 Slide 34 17

  18. Estimation Results (Socket) NAS SPEC OMP SPEC 2006 Slide 35 Comparative Results: SPEC OMP/Core i7 Wall Socket PSU (ATX Rails) dY8To5AD Processor Socket (Motherboard) Slide 36 18

  19. Power Measurement Experiments  Sampling frequency (samples per second)  At wall outlet: 1  At ATX power rails and on MB: 50000 p  Measurements averaged over 50 samples  Test workload: 32x32 matmul in infinite loop  Theoretical measurement sensitivity  Current measurement at ATX rails: 2mA C t t t ATX il 2 A  CPU voltage measurement on motherboard: 47.2 uV  CPU current measurement on motherboard: 7mA Slide 37 Power Measurement Results idle power activating 1-4 cores Slide 38 19

  20. CPU versus Memory-Bound Applications memory Slide 39 DVFS + Throttling 40 20

  21. Power Measurement Results – Efficiency 41 So What?  Our models work pretty well  More accurate measurement → more accurate models models  All measurement methods incur some error  Intel Shady Brook uses similar approach to implement “digital power meter” So we must be doing something right! Slide 42 21

  22. Live Power Management  Proof-of-concept  Goal  Schedule tasks under strict power budget S h d l t k d t i t b d t  Minimal overhead  Methodology  User-level meta scheduler  DVFS + process suspension to maintain power  DVFS + process suspension to maintain power envelope  Two sample policies for process selection Slide 43 Live Power Management  Three categories of benchmarks  CPU bound  Memory bound  Memory bound  Mixed  Power envelope set to 95%, 90%, 85%  Results for both with/without DVFS Slide 44 22

  23. Workloads with Different Intensities  CPU bound  ep, gamess, namd, povray  calculix, ep, gamess, gromacs, h264ref, namd, , p, g , g , , , perlbench, povray  Moderate  art, lu, wupwise, xalancmbk  bwaves, cactusADM, fma3d, gcc, leslie3d, sp, ua, xalancbmk  Memory bound  astar, mcf, milc, soplex  applu, astar, lbm, mcf, milc, omnetpp, soplex, swim Slide 45 Meta-Scheduler Results: Intel Q6600 Max Instructions/Watt Max Instructions/Watt 90% Power Envelope Moderate Computational Intensity Per-core Fair 95% Power Envelope CPU-bound Workload Slide 46 23

  24. Meta-Scheduler Results: AMD Phenom Max Instructions/Watt Max Instructions/Watt 90% Power Envelope Moderate Computational Intensity Per-core Fair 95% Power Envelope CPU-bound Workload Slide 47 Performance Results: Intel Q6600 CPU-Bound Memory-Bound Moderate Slide 48 24

  25. Performance Results: AMD Phenom CPU-Bound Memory-Bound Moderate Slide 49 Performance Results: AMD Phenom CPU-Bound Memory-Bound Moderate Slide 50 25

Recommend


More recommend