dynamic fine grained scheduling for
play

Dynamic Fine-Grained Scheduling for Energy-Efficient Main-Memory - PowerPoint PPT Presentation

Dynamic Fine-Grained Scheduling for Energy-Efficient Main-Memory Queries Iraklis Psaroudakis (EPFL, SAP AG), Thomas Kissinger (TU Dresden), Danica Porobic (EPFL), Thomas Ilsche (TU Dresden), Erietta Liarou (EPFL), Pinar Tzn (EPFL), Anastasia


  1. Dynamic Fine-Grained Scheduling for Energy-Efficient Main-Memory Queries Iraklis Psaroudakis (EPFL, SAP AG), Thomas Kissinger (TU Dresden), Danica Porobic (EPFL), Thomas Ilsche (TU Dresden), Erietta Liarou (EPFL), Pinar Tözün (EPFL), Anastasia Ailamaki (EPFL), Wolfgang Lehner (TU Dresden) 1

  2. Why care about power? Monthly datacenter costs [J. R. Hamilton] Energy proportionality Servers 4% 13% Networking Power Equipment Power Distribution & 18% 57% Cooling Today Power 8% Ideal Utilization Other Getting there: 30% power-related • Power management features Dynamic fraction increasing • Power-aware software We need to make DBMS power-aware 2

  3. Power management features • Dynamic voltage and frequency scaling (DVFS) > 2.9GHz • Turbo boost 1.2GHz 2.9GHz • Idle states (C-states) • Power-related H/W counters We can exploit these to improve energy efficiency 3

  4. Current approaches • Black box – e.g. dynamic concurrency throttling [TPDS13] unpredictable behavior DBMS • Query optimizer [ICDE10] coarse-grained, without low-level tuning + power costs We need fine-grained energy-awareness in the database 4

  5. Fine-grained energy-aware scheduling Σ How do you schedule this query plan? S • parameters: – parallelism – thread placement – data placement – dynamic voltage and frequency scaling (DVFS) Calibration of operators under different parameters 5

  6. Concurrent partitioned scans • Each thread scans 128MB of integers for 5 secs • Maximize 𝑞𝑓𝑠𝑔𝑝𝑠𝑛𝑏𝑜𝑑𝑓 𝑞𝑓𝑠 𝑞𝑝𝑥𝑓𝑠 = 𝑢ℎ𝑠𝑝𝑣𝑕ℎ𝑞𝑣𝑢 𝑞𝑝𝑥𝑓𝑠 – under different parallelism, scheduling, and frequency settings • Machine – Two 8-core Intel Xeon E5-2690, HT enabled, 64GB RAM, frequencies from 1.2GHz to 2.9GHz • Power measurements – Hardware performance counters RAPL (CPU & DRAM) – External equipment 6

  7. Socket-fill scheduling Socket 1 Socket 2 Core 1 & HT Core 2 & HT Core 8 & HT Core 9 & HT Core 10 & HT Core 16 & HT … … 1 9 2 10 8 16 17 25 18 26 24 32 4.0 Throughput per Watt 3.5 bandwidth saturation 3.0 2.5 2.0 1.5 1.0 0.5 Auto (RAPL) 0.0 0 4 8 12 16 20 24 28 32 # Threads 7

  8. Socket-fill scheduling Socket 1 Socket 2 Core 1 & HT Core 2 & HT Core 8 & HT Core 9 & HT Core 10 & HT Core 16 & HT … … 1 9 2 10 8 16 17 25 18 26 24 32 4.0 3.5 Throughput per Watt 3.0 2.5 2.0 1.5 constant difference 1.0 Auto (RAPL) 0.5 Auto (external equipment) 0.0 0 4 8 12 16 20 24 28 32 # Threads 8

  9. Socket-fill scheduling Socket 1 Socket 2 Core 1 & HT Core 2 & HT Core 8 & HT Core 9 & HT Core 10 & HT Core 16 & HT … … 1 9 2 10 8 16 17 25 18 26 24 32 best frequency 4.0 different 3.5 Throughput per Watt saturation points 3.0 2.5 2.0 1.5 1.0 1.2GHz 2.0GHz 0.5 2.9GHz Auto 0.0 0 4 8 12 16 20 24 28 32 # Threads 9

  10. Socket-fill HT scheduling Socket 1 Socket 2 Core 1 & HT Core 2 & HT Core 8 & HT Core 9 & HT Core 10 & HT Core 16 & HT … … 1 2 3 4 15 16 17 18 19 20 31 32 4.0 HT draws negligible power 3.5 Throughput per Watt 3.0 2.5 2.0 1.5 1.0 1.2GHz 2.0GHz 0.5 2.9GHz Auto 0.0 0 4 8 12 16 20 24 28 32 # Threads 10

  11. Socket-wise scheduling Socket 1 Socket 2 Core 1 & HT Core 2 & HT Core 8 & HT Core 9 & HT Core 10 & HT Core 16 & HT … … 1 17 3 19 15 31 2 18 4 20 16 32 4.0 3.5 Throughput per Watt 3.0 2.5 2.0 avoids socket-specific 1.5 bandwidth saturation 1.0 1.2GHz 2.0GHz 0.5 2.9GHz Auto 0.0 0 4 8 12 16 20 24 28 32 # Threads 11

  12. Socket-wise HT scheduling Socket 1 Socket 2 Core 1 & HT Core 2 & HT Core 8 & HT Core 9 & HT Core 10 & HT Core 16 & HT … … 1 2 5 6 29 30 3 4 7 8 31 32 best energy 4.0 efficiency 3.5 Throughput per Watt 1.3x 3.0 2.5 2.0 1.5 1.2GHz 2.0GHz 1.0 0.5 2.9GHz Auto 0.0 0 4 8 12 16 20 24 28 32 # Threads 12

  13. Parallel aggregation • 𝑏 = 𝑐 𝑗 + 𝑑 𝑗 , 4GB arrays • Minimize 𝑓𝑜𝑓𝑠𝑕𝑧 𝑒𝑓𝑚𝑏𝑧 𝑞𝑠𝑝𝑒𝑣𝑑𝑢 (𝐹𝐸𝑄) = 𝑠𝑓𝑡𝑞𝑝𝑜𝑡𝑓 𝑢𝑗𝑛𝑓 𝑡𝑓𝑑 ∗ 𝑓𝑜𝑓𝑠𝑕𝑧( 𝐾) – under different parallelism, scheduling, and memory placement • Machine – Two 8-core Intel Xeon E5-2640, HT disabled, 256GB of RAM • Memory placement – On first socket – Interleaved 13

  14. Parallel aggregation Memory on first socket Memory interleaved 100 100 Socket-fill Socket-fill Socket-wise Socket-wise EDP (kJ x sec) 10 10 socket-wise 1 1 better bandwidth constrained 0.1 0.1 0 4 8 12 16 0 4 8 12 16 # Threads # Threads 14

  15. Main-memory memory-bound operations • Intermediate frequency has best efficiency – Different saturation points • Avoid memory bandwidth saturation – by data and thread placement • Up to 4x energy efficiency 15

  16. Fine-grained energy awareness Calibration analysis Measurements Runtime decisions of operators and hardware counters and/or scheduling, resource parameters external equipment allocation, power management Energy efficiency Power # Threads Time power parallelism CPU utilization data & thread placement memory utilization DVFS Thank you! THIS PAPER 16

  17. References • [J. R. Hamilton] Internet-Scale Datacenter Economics: Where the Costs And Opportunities Lie. HPTS, 2011. • [TPDS13] D. Li, B. R. de Supinski, M. Schulz, D. S. Nikolopoulos, and K. W. Cameron. Strategies for energy-ecient resource management of hybrid programming models. IEEE TPDS, 24(1):144-157, 2013. • [ICDE10] Z. Xu, Y.-C. Tu, and X. Wang. Exploring power-performance tradeos in database systems. In ICDE, pages 485-496, 2010. 17

Recommend


More recommend