Dynamic Fine-Grained Scheduling for Energy-Efficient Main-Memory Queries Iraklis Psaroudakis (EPFL, SAP AG), Thomas Kissinger (TU Dresden), Danica Porobic (EPFL), Thomas Ilsche (TU Dresden), Erietta Liarou (EPFL), Pinar Tözün (EPFL), Anastasia Ailamaki (EPFL), Wolfgang Lehner (TU Dresden) 1
Why care about power? Monthly datacenter costs [J. R. Hamilton] Energy proportionality Servers 4% 13% Networking Power Equipment Power Distribution & 18% 57% Cooling Today Power 8% Ideal Utilization Other Getting there: 30% power-related • Power management features Dynamic fraction increasing • Power-aware software We need to make DBMS power-aware 2
Power management features • Dynamic voltage and frequency scaling (DVFS) > 2.9GHz • Turbo boost 1.2GHz 2.9GHz • Idle states (C-states) • Power-related H/W counters We can exploit these to improve energy efficiency 3
Current approaches • Black box – e.g. dynamic concurrency throttling [TPDS13] unpredictable behavior DBMS • Query optimizer [ICDE10] coarse-grained, without low-level tuning + power costs We need fine-grained energy-awareness in the database 4
Fine-grained energy-aware scheduling Σ How do you schedule this query plan? S • parameters: – parallelism – thread placement – data placement – dynamic voltage and frequency scaling (DVFS) Calibration of operators under different parameters 5
Concurrent partitioned scans • Each thread scans 128MB of integers for 5 secs • Maximize 𝑞𝑓𝑠𝑔𝑝𝑠𝑛𝑏𝑜𝑑𝑓 𝑞𝑓𝑠 𝑞𝑝𝑥𝑓𝑠 = 𝑢ℎ𝑠𝑝𝑣ℎ𝑞𝑣𝑢 𝑞𝑝𝑥𝑓𝑠 – under different parallelism, scheduling, and frequency settings • Machine – Two 8-core Intel Xeon E5-2690, HT enabled, 64GB RAM, frequencies from 1.2GHz to 2.9GHz • Power measurements – Hardware performance counters RAPL (CPU & DRAM) – External equipment 6
Socket-fill scheduling Socket 1 Socket 2 Core 1 & HT Core 2 & HT Core 8 & HT Core 9 & HT Core 10 & HT Core 16 & HT … … 1 9 2 10 8 16 17 25 18 26 24 32 4.0 Throughput per Watt 3.5 bandwidth saturation 3.0 2.5 2.0 1.5 1.0 0.5 Auto (RAPL) 0.0 0 4 8 12 16 20 24 28 32 # Threads 7
Socket-fill scheduling Socket 1 Socket 2 Core 1 & HT Core 2 & HT Core 8 & HT Core 9 & HT Core 10 & HT Core 16 & HT … … 1 9 2 10 8 16 17 25 18 26 24 32 4.0 3.5 Throughput per Watt 3.0 2.5 2.0 1.5 constant difference 1.0 Auto (RAPL) 0.5 Auto (external equipment) 0.0 0 4 8 12 16 20 24 28 32 # Threads 8
Socket-fill scheduling Socket 1 Socket 2 Core 1 & HT Core 2 & HT Core 8 & HT Core 9 & HT Core 10 & HT Core 16 & HT … … 1 9 2 10 8 16 17 25 18 26 24 32 best frequency 4.0 different 3.5 Throughput per Watt saturation points 3.0 2.5 2.0 1.5 1.0 1.2GHz 2.0GHz 0.5 2.9GHz Auto 0.0 0 4 8 12 16 20 24 28 32 # Threads 9
Socket-fill HT scheduling Socket 1 Socket 2 Core 1 & HT Core 2 & HT Core 8 & HT Core 9 & HT Core 10 & HT Core 16 & HT … … 1 2 3 4 15 16 17 18 19 20 31 32 4.0 HT draws negligible power 3.5 Throughput per Watt 3.0 2.5 2.0 1.5 1.0 1.2GHz 2.0GHz 0.5 2.9GHz Auto 0.0 0 4 8 12 16 20 24 28 32 # Threads 10
Socket-wise scheduling Socket 1 Socket 2 Core 1 & HT Core 2 & HT Core 8 & HT Core 9 & HT Core 10 & HT Core 16 & HT … … 1 17 3 19 15 31 2 18 4 20 16 32 4.0 3.5 Throughput per Watt 3.0 2.5 2.0 avoids socket-specific 1.5 bandwidth saturation 1.0 1.2GHz 2.0GHz 0.5 2.9GHz Auto 0.0 0 4 8 12 16 20 24 28 32 # Threads 11
Socket-wise HT scheduling Socket 1 Socket 2 Core 1 & HT Core 2 & HT Core 8 & HT Core 9 & HT Core 10 & HT Core 16 & HT … … 1 2 5 6 29 30 3 4 7 8 31 32 best energy 4.0 efficiency 3.5 Throughput per Watt 1.3x 3.0 2.5 2.0 1.5 1.2GHz 2.0GHz 1.0 0.5 2.9GHz Auto 0.0 0 4 8 12 16 20 24 28 32 # Threads 12
Parallel aggregation • 𝑏 = 𝑐 𝑗 + 𝑑 𝑗 , 4GB arrays • Minimize 𝑓𝑜𝑓𝑠𝑧 𝑒𝑓𝑚𝑏𝑧 𝑞𝑠𝑝𝑒𝑣𝑑𝑢 (𝐹𝐸𝑄) = 𝑠𝑓𝑡𝑞𝑝𝑜𝑡𝑓 𝑢𝑗𝑛𝑓 𝑡𝑓𝑑 ∗ 𝑓𝑜𝑓𝑠𝑧( 𝐾) – under different parallelism, scheduling, and memory placement • Machine – Two 8-core Intel Xeon E5-2640, HT disabled, 256GB of RAM • Memory placement – On first socket – Interleaved 13
Parallel aggregation Memory on first socket Memory interleaved 100 100 Socket-fill Socket-fill Socket-wise Socket-wise EDP (kJ x sec) 10 10 socket-wise 1 1 better bandwidth constrained 0.1 0.1 0 4 8 12 16 0 4 8 12 16 # Threads # Threads 14
Main-memory memory-bound operations • Intermediate frequency has best efficiency – Different saturation points • Avoid memory bandwidth saturation – by data and thread placement • Up to 4x energy efficiency 15
Fine-grained energy awareness Calibration analysis Measurements Runtime decisions of operators and hardware counters and/or scheduling, resource parameters external equipment allocation, power management Energy efficiency Power # Threads Time power parallelism CPU utilization data & thread placement memory utilization DVFS Thank you! THIS PAPER 16
References • [J. R. Hamilton] Internet-Scale Datacenter Economics: Where the Costs And Opportunities Lie. HPTS, 2011. • [TPDS13] D. Li, B. R. de Supinski, M. Schulz, D. S. Nikolopoulos, and K. W. Cameron. Strategies for energy-ecient resource management of hybrid programming models. IEEE TPDS, 24(1):144-157, 2013. • [ICDE10] Z. Xu, Y.-C. Tu, and X. Wang. Exploring power-performance tradeos in database systems. In ICDE, pages 485-496, 2010. 17
Recommend
More recommend