application controlled frequency scaling
play

Application-controlled Frequency Scaling Jons-Tobias Wamhoff - PowerPoint PPT Presentation

Application-controlled Frequency Scaling Jons-Tobias Wamhoff Stephan Diestelhorst Christof Fetzer Technische Universitt Dresden, Germany Patrick Marlier Pascal Felber Universit de Neuchtel, Switzerland Dave Dice Oracle Labs, USA


  1. Application-controlled Frequency Scaling Jons-Tobias Wamhoff Stephan Diestelhorst Christof Fetzer Technische Universität Dresden, Germany Patrick Marlier Pascal Felber Université de Neuchâtel, Switzerland Dave Dice Oracle Labs, USA

  2. Overview • Dynamic voltage and frequency scaling (DVFS) • traditionally: used to save energy or boost sequential bottlenecks/serial peak loads • today: improve performance by exposing asymmetric properties of applications • Outline • Recap DVFS features on current x86 multicores • DVFS properties: latency and power • Applying DVFS on application-level 2

  3. P- and C-states • P-states: performance states • predefined frequency/voltage pairs P turbo frequency/voltage • controlled through machine-specific registers … P base (MSRs, privileged rdmsr / wrmsr ) • C-states: power states … P slow • trade entry/wakeup latency for higher power C0 savings halted C1-Cn • entered by hlt or monitor / mwait 3

  4. AMD Intel & Turbo CORE Turbo Boost HT HT x86 FPU x86 P base P base P base P base • Voltage and frequency domain: module vs. package P turbo ≥ C1 ≥ C1 ≥ C1 • Boosting: deterministic vs. thermal P turbo P slow P slow P slow • AMD only: asymmetric frequencies with manual boost 4

  5. Evaluation Setup Acquire entry Acquire exit Release t wait t CS f P base time • Critical sections (CS) protected by MCS queue lock • Decorations on acquire/release → trigger DVFS • Variable size of CS → amortize DVFS cost t CS • Effective CS frequency : f CS = f base · t A + CS + R • Energy for 1 hour at P base : E NORM = E sample · t A + CS + R t CS 5

  6. Automatic Frequency Scaling t CS t P turbo → P base f P turbo t P base → C halt t C halt → P base t wait f P base t ramp OS halt: entry, wakeup CPU deeper C-state boosted P-state • Decoration: spinning vs. blocking • P-state transitions triggered by hardware 6

  7. Blocking vs. Spinning Locks Frequency AMD Frequency Intel 4 . 0 3 . 9 3 . 4 3 . 1 f CS (GHz) ↑ ↑ 1.5M 4M 1 . 4 0 . 8 0 . 0 0 . 0 Energy AMD Energy Intel 0 . 6 0 . 6 E NORM (kWh) 0 . 5 spin 0 . 5 futex 0 . 4 0 . 4 0 . 3 0 . 3 10k 1M, t wait = 7M t wait = 70k 0 . 2 0 . 2 ↓ ↓ 0 . 1 0 . 1 0 . 0 0 . 0 10 3 10 4 10 5 10 6 10 7 10 2 10 3 10 4 10 5 10 6 10 7 Size CS (cycles, log) Size CS (cycles, log) 7

  8. Manual Frequency Scaling t CS t P turbo → P base f P turbo t P base → P slow t P slow → P turbo f P base t wait t ramp f P slow ioctl 1k 1k 1k wrmsr 28k 2k 23k transition 2k 225k 1k • Decoration: spin and application-level DVFS control 8

  9. Manual Lock Boosting Frequency AMD Energy AMD 0 . 8 4 . 0 0 . 7 spin ownr E NORM (kWh) 3 . 1 0 . 6 dlgt ↖ f CS (GHz) ↗ mgrt 0 . 5 200k ↑ 600k 0 . 4 400k 0 . 3 1 . 4 0 . 2 0 . 1 0 . 0 0 . 0 10 3 10 4 10 5 10 6 10 7 10 8 10 3 10 4 10 5 10 6 10 7 10 8 Size CS (cycles, log) Size CS (cycles, log) futex: 1.5M • delegate: dedicated wrmsr core • spin: static P base • owner: dynamically boost • migrate: statically boosted core 9

  10. T URBO Library • Convenient programmatical application-level DVFS control • Testbed to explore challenges of future heterogeneous cores Execution ThreadRegistry ThreadControl control - Create/Register - Decorate lock, barriers, …: boosting/profiling Performance Thread P-States PerformanceMonitor configuration - Migrate to core - Setting & configuration - Low-level profiling Hardware Topology PCI-Configuration MSR-Interface PerfEvent abstraction - P-states - HW counters Linux kernel and hardware interfaces https://bitbucket.org/donjonsn/turbo 10

  11. Boosting Applications • Expose application knowledge • Asymmetric software transactional memory: 
 up to 50% speedup with only 2% more energy • Tradeoffs when IPC depends on core frequency • Hash table resize in memcached: 
 9% speedup but 22% higher frequency • Outweigh P-state latency by delegating CS • High cross-module round-trip delay (2k cycles) • Intra-module delay scales with P-state (P boost : 280 cycles) 11

  12. Next Steps • Intel Haswell-EP supports per core P-states • Allows to give hints • Application domains • Real-time scheduling • Fork-join benchmarks • …? 12

Recommend


More recommend