In-network Monitoring and Control Policy for DVFS of CMP Networks- - PowerPoint PPT Presentation

In-network Monitoring and Control Policy for DVFS of CMP Networks- on-Chip and Last Level Caches Xi Chen 1 , Zheng Xu 1 , Hyungjun Kim 1 , Paul V. Gratz 1 , Jiang Hu 1 , Michael Kishinevsky 2 and Umit Ogras 2 1 Computer Engineering and Systems Group, Department of ECE, Texas A&M University 2 Strategic CAD Labs, Intel Corp.

Introduction – The Power/Performance Challenge • VLSI Technology Trends ● Continued transistor scaling – More transistors ● Traditional VLSI gains stop – Power increasing and transistor performance stagnant • Achieving performance in modern VLSI ● Multi-core/CMP for performance – NoCs for communication ● CMP power management to permit further performance gains and new challenges Computer Engineering and Systems Group 2

Core Power Management Typically power management covers only the core and lower-level caches • Simpler problem (relatively speaking) uP core – All performance information locally available L1i L1d • Instructions per cycle • Lower-level cache miss rates L2 • Idle time – Each core can act independently – Performance scales approximately linearly with frequency • Cores are only part of the problem – Power management in the uncore is a different domain… Computer Engineering and Systems Group 3

Typical Chip-Multiprocessors • Chip-multiprocessors (CMPs): Complexity moves from the cores up the memory system hierarchy. • Multi-level hierarchies uP L3 – Private lower levels core cache – Shared last-level slice L1i L1d • Networks-on-chip for: L2 – Cache block transfers Dir R – Cache coherence Computer Engineering and Systems Group 4 Computer Engineering and Systems Group 4

CMP Power Management Challenge • Chip-multiprocessors (CMPs): Complexity moves from the cores up the memory system hierarchy. • Multi-level hierarchies uP L3 – Private lower levels core cache – Shared last-level slice L1i L1d • Networks-on-chip for: L2 – Cache block transfers Dir R – Cache coherence • Large fraction of the power outside of cores – LLC shared among many cores (distributed!) – Network-on-chip interconnects cores • 12 W on the Single Chip Cloud Computer! • Indirect impact on system performance – Depends upon lower-level cache miss-rates Computer Engineering and Systems Group 5 Computer Engineering and Systems Group 5

CMP DVFS Partitioning Domains per tile Computer Engineering and Systems Group 6

CMP DVFS Partitioning Domains per core Domains per tile Separate domain for uncore Computer Engineering and Systems Group 7

Project Goals Develop a power management policy for a CMP uncore. • Maximum savings with minimal impact on performance (< 5% IPC loss). – What to monitor? – How to propagate information to the central controller? – What policy to implement? Computer Engineering and Systems Group 8

Outline • Introduction • Design Description – Uncore Power Management – Metrics – Information Propagation – PID Control • Evaluation • Conclusions and Future Work Computer Engineering and Systems Group 9 Computer Engineering and Systems Group 9

Uncore Power Management • Effective uncore power management – Inputs: • Current performance demand • Current power state (DVFS level) – Outputs: • Next power state • Classic control problem – Constraints • High speed decisions • Low hardware overhead • Low impact on system from management overheads Computer Engineering and Systems Group 10 Computer Engineering and Systems Group 10

Design Outline Three major components to uncore power management: • Uncore performance metric – Average memory access time (AMAT) • Status propagation – In-network, unused header portion • Control policy – PID Control over a fixed time window Computer Engineering and Systems Group 11 Computer Engineering and Systems Group 11

Performance Metrics Uncore: LLC + NoC Which performance • metric? – NoC Centric? • Credits • Free VCs • Per-hop latency – LLC Centric? • LLC Access rate • LLC Miss rate Computer Engineering and Systems Group Computer Engineering and Systems Group 12 12

Performance Metrics Ultimately who cares Uncore: LLC + NoC about uncore Which performance • performance? metric? Need a metric that – NoC Centric? • quantifies the memory • Credits system’s effect on • Free VCs system performance! • Per-hop latency Average memory – LLC Centric? • access time (AMAT) • LLC Access rate • LLC Miss rate Computer Engineering and Systems Group 13

Average Memory Access Time AMAT = HitRateL1*AccTimeL1+(1-HitRateL1)* (HitRateL2*AccTimeL2+ ((1-HitRateL2) * LatencyUncore)) Direct measurement • memory system performance AMAT increase X • yields IPC loss of ~1/2X for small X Experimentally – AMAT vs Uncore clock rate for two cases: determined f0 – no private hits; f1 – all private hits. Computer Engineering and Systems Group 14

Average Memory Access Time AMAT = HitRateL1*AccTimeL1+(1-HitRateL1)* (HitRateL2*AccTimeL2+ ((1-HitRateL2) * LatencyUncore)) Direct measurement • memory system performance AMAT increase X • yields IPC loss of ~1/2X for small X Experimentally – AMAT vs Uncore clock rate for two cases: determined f0 – no private hits; f1 – all private hits. Note: HitRateL1, HitRateL2, and LatencyUncore require information from each core to calculate weighted averages! Computer Engineering and Systems Group 15

Information Propagation ● In-network status packets too costly ● Bursts of status would impact performance ● Increased dynamic energy ● Dedicated status network would be overkill – Somewhat low data rate: ~8 bytes per core per 50000-cycle time window – Constant power drain Computer Engineering and Systems Group 16

Information Propagation ● In-network status packets too costly ● Bursts of status would impact performance “Piggieback” info in packet ● Increased dynamic energy headers – Link width often an even ● Dedicated status network divisor of cache line size – would be overkill unused space in header – Somewhat low data rate: – No congestion or power ~8 bytes per core per impact 50000-cycle time window Status info timeliness? • – Constant power drain Computer Engineering and Systems Group 17

Information Propagation One power controller node • Node 6 in figure – Status opportunistically sent • Info harvested as packet pass • through controller node However, per-core info not • received at the end of every window… Uncore NoC, grey tile contains perf. monitor. Dashed arrows represent packet paths. Computer Engineering and Systems Group 18

Extrapolation • AMAT calculation requires information from all nodes at the end of each time window • Opportunistic piggy-backing provides no guarantees on information timeliness – Naïvely using last-packet received leads to bias in weighted average of AMAT • Extrapolate packet counts to the end of the time window – More accurate weights for AMAT calculation – Nodes for which no data is received are excluded from AMAT Computer Engineering and Systems Group 19

Power Management Controller PID (Proportional-Integral-Derivative) Control • – Computationally simpler than computer learning techniques – More readily and quickly adapts to many different workloads than rule based approaches – Theoretical grounds for stability • (proof in paper) Computer Engineering and Systems Group 20

Outline • Introduction • Design Description • Evaluation – Methodology – Power and Performance • Estimated AMAT + PID • Vs. Perfect AMAT + PID • Vs. Rule-based – Analysis • Tracking ideal DVFS ratio selection • Conclusions and Future Work Computer Engineering and Systems Group 21

Methodology Memory system traces • PARSEC applications – M5 trace generation – First 250M memory – operations Custom Simulator: • L1 + L2 + NoC + LLC+ – Directory Energy savings calculated • based on dynamic power Some benefit to static – power as well, future work Computer Engineering and Systems Group 22

Power and Performance Normalized dynamic energy consumption Normalized performance loss Average of 33% energy savings versus baseline • Average of ~5% AMAT loss (<2.5% IPC) • Computer Engineering and Systems Group 23

Comparison vs. Perfect AMAT Normalized dynamic energy consumption Normalized performance loss Virtually identical power savings vs. perfect AMAT • Slight loss in performance vs. perfect AMAT • Computer Engineering and Systems Group 24

Comparison vs. Rule-Based Normalized dynamic energy consumption Normalized performance loss Virtually identical power savings vs. Rule-Based • 50% less performance loss • Computer Engineering and Systems Group 25

Analysis: PID tracking vs. ideal Generally PID is slightly conservative • Reacts quickly and accurately to spikes in need • Computer Engineering and Systems Group 26

Conclusions and Future Work • We introduce a power management system for the CMP Uncore – Performance metric: estimated AMAT – Information propagation: In-network, piggy-backed – Control Algorithm: PID • 33% energy savings with insignificant performance loss – Near ideal AMAT estimation – Outperforms rule-based techniques Computer Engineering and Systems Group 27

In-network Monitoring and Control Policy for DVFS of CMP Networks- - PowerPoint PPT Presentation

In-network Monitoring and Control Policy for DVFS of CMP Networks- on-Chip and Last Level Caches Xi Chen 1 , Zheng Xu 1 , Hyungjun Kim 1 , Paul V. Gratz 1 , Jiang Hu 1 , Michael Kishinevsky 2 and Umit Ogras 2 1 Computer Engineering and Systems

Atelier Num erique OMP Code Optimization: Vectorization Bertrand Putigny July 5, 2016 1 / 27

Pre-2012 CMP 2012 CMP Amendments 2018 CMP Amendments Above: Solar panel carports

ClkScrew Aaron Zhang Outline Introduction to DVFS and background information. What makes

DVFS PERFORMANCE PREDICTION FOR MANAGED MULTITHREADED APPLICATIONS Shoaib Akram, Jennifer B.

Workshop 1 North Central Texas Council of Governments CMP Workshop Overview Overview of

THE CMP INTEGRATES LIFELONG LEARNING WITH ASSESSMENT THE CMP INTEGRATES LIFELONG LEARNING WITH

http://cmp.imag.fr CMP annual users meeting, 4 Feb. 2016, PARIS Pr Process Portf rtfolio lio fr

DVFS-Control Techniques for Dense Linear Algebra Operations on Multi-Core Processors Pedro Alonso

4th ANNUAL CIVIL MONEY PENALTY (CMP) GRANT TRAINING MAY 7, 2019 Hosted by: Mississippi

MANAGEMENT PLAN (CMP) PROPOSED AMENDMENTS Classification - Public THE DEVELOPMENT

NC MULTI - SLIDES PROGRAMMABLE CMP 350 OPTIMIZATION // FLEXIBILITY = FAST PRODUCTION The CMP 350

MEMS Processes at CMP Bulk Micromachining MUMPs from MEMSCAP Teledyne DALSA MIDIS Micralyne

An Energy-aware Scheduling Algorithm in DVFS-enabled Networked Data Centers CLOSER 2016 - TEEC

Proposed CMP Amendments Policy & Implementation Committee May 31, 2019 Public development

Apollo 13: Abort HONR 269i To the Moon and Back: The Apollo Program Ear arly A y Apo pollo

CMP Process Development Techniques for New Materials Robert L. Rhoades, Ph.D. ECS 213 th Meeting

Effectiveness of CMPs Prepared for: ICO Ref: jn1666/BW Date: April/2014 1 UK I

Advances in Loop Analysis Frameworks and Optimizations Adam Nemet & Michael Zolotukhin

Crankshaft Turbocharging the next generation of web applications Overview Why did we

EE 457 Unit 9c Thread Level Parallelism 2 Credits Some of the material in this presentation

Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks

Binary Search Trees These slides are not fully polished: - some transitions are rough - some

Writing better code with Writing better code with help from the compiler help from the compiler

Modifications Progress Update Place your chosen image here. The four corners must just cover

Sambuz

Useful Links

Newsletter

Mail Us

In-network Monitoring and Control Policy for DVFS of CMP Networks- - PowerPoint PPT Presentation

In-network Monitoring and Control Policy for DVFS of CMP Networks- on-Chip and Last Level Caches Xi Chen 1 , Zheng Xu 1 , Hyungjun Kim 1 , Paul V. Gratz 1 , Jiang Hu 1 , Michael Kishinevsky 2 and Umit Ogras 2 1 Computer Engineering and Systems

Atelier Num erique OMP Code Optimization: Vectorization Bertrand Putigny July 5, 2016 1 / 27

Pre-2012 CMP 2012 CMP Amendments 2018 CMP Amendments Above: Solar panel carports

ClkScrew Aaron Zhang Outline Introduction to DVFS and background information. What makes

DVFS PERFORMANCE PREDICTION FOR MANAGED MULTITHREADED APPLICATIONS Shoaib Akram, Jennifer B.

Workshop 1 North Central Texas Council of Governments CMP Workshop Overview Overview of

THE CMP INTEGRATES LIFELONG LEARNING WITH ASSESSMENT THE CMP INTEGRATES LIFELONG LEARNING WITH

http://cmp.imag.fr CMP annual users meeting, 4 Feb. 2016, PARIS Pr Process Portf rtfolio lio fr

DVFS-Control Techniques for Dense Linear Algebra Operations on Multi-Core Processors Pedro Alonso

4th ANNUAL CIVIL MONEY PENALTY (CMP) GRANT TRAINING MAY 7, 2019 Hosted by: Mississippi

MANAGEMENT PLAN (CMP) PROPOSED AMENDMENTS Classification - Public THE DEVELOPMENT

NC MULTI - SLIDES PROGRAMMABLE CMP 350 OPTIMIZATION // FLEXIBILITY = FAST PRODUCTION The CMP 350

MEMS Processes at CMP Bulk Micromachining MUMPs from MEMSCAP Teledyne DALSA MIDIS Micralyne

An Energy-aware Scheduling Algorithm in DVFS-enabled Networked Data Centers CLOSER 2016 - TEEC

Proposed CMP Amendments Policy &amp; Implementation Committee May 31, 2019 Public development

Apollo 13: Abort HONR 269i To the Moon and Back: The Apollo Program Ear arly A y Apo pollo

CMP Process Development Techniques for New Materials Robert L. Rhoades, Ph.D. ECS 213 th Meeting

Effectiveness of CMPs Prepared for: ICO Ref: jn1666/BW Date: April/2014 1 UK I

Advances in Loop Analysis Frameworks and Optimizations Adam Nemet &amp; Michael Zolotukhin

Crankshaft Turbocharging the next generation of web applications Overview Why did we

EE 457 Unit 9c Thread Level Parallelism 2 Credits Some of the material in this presentation

Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks

Binary Search Trees These slides are not fully polished: - some transitions are rough - some

Writing better code with Writing better code with help from the compiler help from the compiler

Modifications Progress Update Place your chosen image here. The four corners must just cover

Sambuz

Useful Links

Newsletter

Mail Us

Proposed CMP Amendments Policy & Implementation Committee May 31, 2019 Public development

Advances in Loop Analysis Frameworks and Optimizations Adam Nemet & Michael Zolotukhin