a high precision gpu cpu and memory power model for the
play

A High-Precision GPU, CPU and Memory Power Model for the Tegra K1 - PowerPoint PPT Presentation

A High-Precision GPU, CPU and Memory Power Model for the Tegra K1 SoC Kristoffer Robin Stokke krisrst@ifi.uio.no Learning Outcome Deep, low-level knowledge of the Tegra K1 GK20A GPU, ARM Cortex-A15 CPU, DDR3 RAM Accurate, generic


  1. A High-Precision GPU, CPU and Memory Power Model for the Tegra K1 SoC Kristoffer Robin Stokke krisrst@ifi.uio.no

  2. Learning Outcome • Deep, low-level knowledge of the Tegra K1 – GK20A GPU, ARM Cortex-A15 CPU, DDR3 RAM • Accurate, generic power modelling for the Tegra K1 – Method, model training and evaluation • Hardware-software codesign for power-aware computing – Analysing power usage of joint GPU-CPU execution – Optimising kernels for power 3/24/2016 2

  3. Motivating Example: Detailed Power Breakdown 3/24/2016 3

  4. Tegra K1: Hereogeneous Multicore 28 nm SoC • Tegra family of mobile Systems-on-Chip (SoC), < 12 W power usage • (Tegra 2, 3, 4..) • Tegra K1 & Tegra X1 • Programmable GPU (CUDA) • Power management capabilities Tegra K1 Tegra X1 High Performance 4 x ARM Cortex-A15 4 x ARM Cortex-A57 CPU Low Power 1 x ARM Cortex-A15 4 x ARM Cortex-A53 192-Core Kepler 256-Core Maxwell GPU 2 GB (Jetson-TK1) 4 GB (Jetson-TX1) Memory 3/24/2016 4

  5. GPU-Accelerated Mobile Systems • Drones, cars, smart phones, space exploration • Video processing, vehicular applications, neural networks, object tracking • Energy – Battery limitation – Environmental aspect – Device failure 3/24/2016 5

  6. Energy-Efficient Video Processing «Shaky video» • Consider an HD video processing pipeline • E.g. a Tegra-enabled drone live- streaming a football stadium • Raw video is lens-distorted and shaky • We implement several video filters to Debarrel Frame filter compensate for these effects stream 60 FPS Rotation • «Goal»: Reach 60 FPS using as little filter energy as possible using hardware capabilities • How can we understand the relationship between software activity, power management capabilities and ? power usage? ?? 3/24/2016 6

  7. Measuring Power • Surprisingly hard • Few tools to measure power • We use an external power source and measurement unit • Keithley K2280-S • 100 nA precision, high sampling rate For details and code, check our paper [2] and • blog [3] • VGA BIOS dumps [1] reveal rail measurement sensors (I 2 C) on most NVIDIA GPUs • Reading them breaks GPUs and hangs Linux [1] Peres, M. Reverse engineering power management on NVIDIA GPUs - A detailed overview [2] Stokke, K.R. et. al., 2015. Why Race-to-Finish is Energy-Inefficient for Continuous Multimedia Workloads. [3] http://mlab.no/blog/2015/08/a-peek-in-the-lab-tegra-k1-power-and-voltage-measurements/ 3/24/2016 7

  8. Tegra K1 SoC Architecture: Rails and Clocks • Power on a rail can be described using the standard CMOS equations [1][2] � ���� � � ���� � � ��� � � ��� � ��� � � ���� � � ���� � ���� ���� Transistor leakage Cycles per second Capacitance load per cycle • Rail voltage � ���� • Increases with clock frequency • Total power • ..is the sum of power of all rails [1] Nam Sung et. al., 2003. Leakage Current: Moore’s Law Meets Static Power. [2] Castagnetti et. al., 2010. Power Consumption Modeling for DVFS Exploitation. 3/24/2016 8

  9. Tegra K1 SoC Architecture: Rails and Clocks GPU Rail Voltage vs. GPU Frequency • Clock frequency, rail voltage and power usage are deeply coupled • Increasing clock frequency increases voltage, and vice versa • From previous slide: power ∝ � � � � ���� � � ���� � ���� � ��� � ��� � ���� Measured (Idle) GPU Power Frequency Clock Rail Description Steps Range [MHz] cpu_g HP Rail HP Cluster 20 204 -> 2320 cpu_lp LP Core 9 51 -> 1092 Core Rail emc Memory 10 40 -> 924 gpu GPU Rail GPU 15 72 -> 852 Important clocks for software power optimisation 3/24/2016 9

  10. Tegra K1 SoC Architecture: Rails and Clocks • Core rail voltage depends on two clocks • Memory and LP core frequency • HP rail voltage depends on HP core frequency 3/24/2016 10

  11. Related Work: Rate-Based Power Models • Have achieved extremely widespread use since 1997 [1] – Advanced uses: On-line power models for smart phones [2][3] • Main advantage: concept is simple – Power is correlated with utilisation levels (events per second) • E.g. rate at which instructions are executed, or rate of cache misses • Cost of events per second estimated with multivariable, linear regression – A typical model for total power Events per second � � � ��� � � � � � � � � � ��� Constant base power � Cost ( ����� ��� ������ ) [1] Feeney L.M., 1997. An Energy Consumption Model for Performance Analysis of Routing Protocols for Mobile Ad Hoc Networks. [2] Xiao, Y. et. al., 2010. A System-Level Model for Runtime Power Estimation on Mobile Devices. [3] Dong, M. and Zhong, L., 2011. Self-Constructive High-Rate System Energy Modeling for Battery-Powered Mobile Systems. 3/24/2016 11

  12. A Rate-Based Power Model for the Tegra K1 Device Predictor (CUPTI and PERF) Coefficient • Disadvantages L2 32B read transactions per second -18.6 nW per eps – Ignores important factors L1 4B read transactions per second 0.0 nW per eps • Clock-gating L1 4B write transactions per second -3.7 nW per eps • Power-gating Integer instructions per second 6.2 pW per eps • Voltage variations GPU Float 32 instructions per second 6.6 pW per eps • Frequency scaling Float 64 instructions per second 279 pW per eps • Hardware contention – Tends to yield negative Misc. instructions per second -300 pW per eps coefficients (we «gain» power Conversion instructions per second 236 pW per eps per event per second) Active CPU cycles per second 887 pW per eps CPU • Illogical and confusing CPU instructions per second 1.47 nW per eps 3/24/2016 12

  13. A Rate-Based Power Model for the Tegra K1 • Estimating power of a motion estimation GPU kernel – Model performs poorly at different memory and GPU frequency levels – Estimation error can be as high as 80 %, and for some areas (green) it is near perfect at 0 % Estimation error for a motion estimation CUDA kernel POINT Rate-based models should be used with care over frequency ranges 3/24/2016 13

  14. Related Work: CMOS-Based Power Models Some authors[1][2][3] attempt to model switching capacitance �� • directly for rails using the CMOS equations – Slightly more complicated � ���� � � ���� � � ��� � � ���� � ���� ��� � ���� • Run a workload on several CPU-GPU-memory frequencies, log rail voltages and power – Estimate � ���� and �� using multivariable, linear regression • Advantages – Voltages and leakage currents considered [1] Castagnetti, A. et. al., 2010. Power Consumption Modeling for DVFS Exploitation. [2] Pathania, A. et. al., 2015. Power-Performance Modelling of Mobile Gaming Workloads on Heterogeneous MPSoCs. [3] Stokke, K.R. et. al., 2015. Why Race-to-Finish is Energy-Inefficient for Continuous Multimedia Workloads 3/24/2016 14

  15. Modelling Switching Capacitance • So how does such a model perform on the Tegra K1? – Better than the rate-based one – Accuracy generally > 85 %, but only about 50 % accurate on high frequencies • Disadvantages / reasons – �� varies depending on workload Estimation error for a motion estimation CUDA kernel – Switching activity in one domain (memory) varies depending on frequency in another (CPU) – ..but model assumes independent relationship between �� and frequency in other domains 3/24/2016 15

  16. Building High-Precision Power Models • Rate- and CMOS-based models are complementary Rate-based CMOS-based • Considers detailed • Considers rail Advantages utilisation through voltages and HPCs leakage currents • Does not consider • Does not consider Disadvantages rail voltages and detailed hardware leakage currents utilisation – They «solve each other’s problems» • We need the physical insight from CMOS based models, and the statistical insight into hardware utilisation from rate-based models 3/24/2016 16

  17. Building High-Precision Power Models • The problem is in the dynamic part of the CMOS equation: � � ���� � � ���� � ���� � ���� � ..which doesn’t consider that �� on a rail is actually depending on frequencies in – other domains (e.g. memory rail �� depends on CPU and GPU frequency) • We now want to express switching activity in terms of measurable hardware activity similarly to rate-based models: Number of utilisation � � predictors on rail R � � ���� � � ���� � ���� � � � �,� � �,� � � ��� Hardware utilisation Capacitive load predictor (events per per event per second second) 3/24/2016 17

Recommend


More recommend