Memory Hierarchy Optimizations with Compilers/Software Jaejin Lee - PDF document

Memory Hierarchy Optimizations with Compilers/Software Jaejin Lee Advanced Compiler Research Laboratory School of Computer Science and Engineering Seoul National University jlee@cse.snu.ac.kr http://aces.snu.ac.kr/~jlee Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Augl-04 1 Seoul Nat ional Universit y Outline ■ Heterogeneous Multithreading (Helper Threading) - Intelligent Memory - Coexecution - Prefetching ■ Compiler-Assisted Demand Paging ■ Wrap-up Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 2 Seoul Nat ional Universit y 1

Memory Wall Problem ■ The performance gap of processors and memory - Microprocessor performance has been improving at a rate of 60% per year. - The access time to DRAM has been improving at a rate of less than 10% per year. ■ The performance of applications is dominated by memory. ■ Thousands of papers. Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 3 Seoul Nat ional Universit y The Intelligent Memory Architecture Processor Chip P.host Main t hread L2 $ L1 $ Of f -t he-shelf Memory Chip int erconnect ion P.mem Helper t hread L1 $ Could be DRAM a DI MM module or a memory cont roller Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 4 Seoul Nat ional Universit y 2

Co-execution ■ Using a compiler, - Partition code into compute-/memory- intensive sections (so called modules). ▪ Performance prediction - The memory-intensive sections are wrapped into a helper thread. - Statically/dynamically map the sections to the best processor. Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 5 Seoul Nat ional Universit y Overview of the Co-execution Algorithm Numerical Non-numerical Applicat ions Applicat ions Basic Part it ioning Basic Part it ioning Af f init y Est imat ion Af f init y Est imat ion (perf ormance model) (prof iling) Advanced Advanced Overlapping Part it ioning Part it ioning Mapping Mapping st at ic dynamic st at ic dynamic Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 6 Seoul Nat ional Universit y 3

Static Mapping ■ Performance model (numerical apps) - Execution time = T comp + T memstall - Stack distance model for the number of misses T T T = + fp T max( int , , ldst ) T comp N N N other int fp ldst ∑ = • T miss penalty memstall i i ∈ i caches ■ Profiling (non-numerical apps) - Gather execution time and the number of invocations for all modules and subroutines. Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 7 Seoul Nat ional Universit y Dynamic Mapping ■ Decision runs at runtime to determine affinity ■ Coarse and CoarseR - Decision runs are module invocations I nvocat ion 1 2 3 4 5 ••• Coarse P.host P.mem CoarseR P.host P.mem Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 8 Seoul Nat ional Universit y 4

Overall Speedups for Co-execution ■ Our co-execution algorithm delivers speedups that are comparable to the ideal speedup. Apps. P.host (alone) P.host (alone) Amdahl’s 2-processor / AdvCoarseR / OverDyn 2 P.host s SGI 1.67 2.71 2.00 1.85 Swim 1.17 1.60 1.67 1.44 Tomcat v 1.26 1.22 1.04 0.99 LU 1.42 1.22 1.91 0.80 TFFT2 1.05 1.55 1.94 1.47 Mgrid Average 1.31 1.31 1. 66 1. 71 Bzip2 1.37 - 1.01 0.99 Mcf 1.37 - 1.01 1.00 Go 0.97 - 1.01 0.57 M88ksim 1.01 - 1.03 1.00 Average 1. 18 - 1. 02 0.89 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 9 Seoul Nat ional Universit y Correlation Prefetching in Software ■ New correlation prefetching in software using the memory thread. ■ Records sequences of miss addresses in a correlation table. ■ When the head of a sequence is seen, prefetch the rest. a[4*(i++)] ... a[foo(i)] ... A B C Z ... Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 10 Seoul Nat ional Universit y 5

Correlation Table Basic Organization Advanced Organization (Joseph & Grunwald) Addresses of Addresses of next immediate successors immediate successors Succ Succ Succ Tag Tag Level 1 Level 1 Level 2 … Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 11 Seoul Nat ional Universit y Our Scheme Processor Chip L1$ L2$ 2 1 DRAM Chip or DIMM module Memory Mem Controller DRAM Proc 4 Cells 5 L1$ North Bridge Chip 3 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 12 Seoul Nat ional Universit y 6

The Mechanism of the Memory (Helper) Thread ■ Requirements: - Low response time - Occupancy time < miss distance Miss address Prefetch addresses Table observed generated updated Prefetching step Learning step Response time Occupancy time Wait Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 13 Seoul Nat ional Universit y Miss Distance 100% [360,400) [320,360) 80% [280,320) [240,280) 60% [200,240) [160,200) 40% [120,160) [80,120) 20% [40,80) [0,40) 0% e T r e e e G T p f c e g k F a S s e C M s a G r r a M r a u T r a e p q P S v E A Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 14 Seoul Nat ional Universit y 7

Seoul Nat ional Universit y School of Comput er Science and Engineering Advanced Compiler Research Laborat ory Seoul Nat ional Universit y School of Comput er Science and Engineering Advanced Compiler Research Laborat ory Normalized Execution Time Execution Time in DRAM Response and Occupancy Time 0.2 0.4 0.6 0.8 1.2 0 1 NoPref Conven4 Base Chain CG Repl Processor Cycles Conven4+Repl Custom 100 150 200 250 50 NoPref 0 Conven4 Base Equake Chain Repl Base Conven4+Repl NoPref Conven4 Base Chain Chain FT Response time Repl Conven4+Repl NoPref Repl Conven4 Base Gap Chain Repl Conven4+Repl ReplMC NoPref Busy Conven4 Base Chain Mcf Repl Conven4+Repl UptoL2 12-Aug-04 12-Aug-04 Custom NoPref Base Conven4 BeyondL2 Base Chain MST Occupancy time Repl Conven4+Repl Chain Custom NoPref Conven4 Base Parser Chain Repl Repl Conven4+Repl NoPref ReplMC Conven4 Base Sparse Chain Repl Conven4+Repl Busy Mem NoPref Conven4 Base Tree Chain Repl Conven4+Repl 16 15 NoPref Average Conven4 Base Chain Repl Conven4+Repl 8

Execution Time in MC Busy UptoL2 BeyondL2 1.2 1 Normalized Execution Time 0.8 0.6 0.4 0.2 0 Conven4+ReplMC Conven4+ReplMC Conven4+ReplMC Conven4+ReplMC Conven4+ReplMC Conven4+ReplMC Conven4+ReplMC Conven4+ReplMC Conven4+ReplMC Conven4+ReplMC NoPref Conven4 BaseMC ChainMC ReplMC NoPref Conven4 BaseMC ChainMC ReplMC NoPref Conven4 BaseMC ChainMC ReplMC NoPref Conven4 BaseMC ChainMC ReplMC NoPref Conven4 BaseMC ChainMC ReplMC NoPref Conven4 BaseMC ChainMC ReplMC NoPref Conven4 BaseMC ChainMC ReplMC NoPref Conven4 BaseMC ChainMC ReplMC NoPref Conven4 BaseMC ChainMC ReplMC NoPref Conven4 BaseMC ChainMC ReplMC CG Equake FT Gap Mcf MST Parser Sparse Tree Average Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 17 Seoul Nat ional Universit y Active Prefetching ■ The helper thread runs the skeleton of the original code - Address computation - Prefetch instructions ■ More accurate prefetches ■ The helper thread should be faster than the original code ■ Synchronization overhead Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 18 Seoul Nat ional Universit y 9

Outline ■ Heterogeneous Multithreading (Helper Threading) - Intelligent Memory - Coexecution - Prefetching ■ Compiler-Assisted Demand Paging - Motivation - Framework - Example - Performance Results ■ Wrap-up Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 19 Seoul Nat ional Universit y S eoul N ational university A dvanced C ompiler tool K it ( SNACK ) Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 20 Seoul Nat ional Universit y 1 0

SNACK Components ■ SNACK-cc: a C compiler for embedded systems ■ SNACK-c2c: C-to-C translator ■ SNACK-asm: assembler ■ SNACK-link: linker ■ SNACK-pop: post-pass optimizer ■ SNACK-jvm: embedded Java VM (planned) Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 21 Seoul Nat ional Universit y Goals ■ High performance ■ Small code size ■ Low power/energy Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 22 Seoul Nat ional Universit y 1 1

Memory Hierarchy Optimizations with Compilers/Software Jaejin Lee - PDF document

Memory Hierarchy Optimizations with Compilers/Software Jaejin Lee Advanced Compiler Research Laboratory School of Computer Science and Engineering Seoul National University jlee@cse.snu.ac.kr http://aces.snu.ac.kr/~jlee Advanced Compiler

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Memory Hierarchy Design Memory Hierarchy Design Chapter 5 and Appendix C 1 Overview

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy Soner Onder

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Abstractions for Practical Systems Caching and the memory hierarchy Operating systems and the

1 5.1 Introduction A Typical Memory Hierarchy A Typical Memory Hierarchy Memory Technology

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

Analysis and Optimizations Analysis and Optimizations Program Analysis Program Analysis

Concepts Introduced in Chapter 9 introduction to compiler optimizations basic blocks and

1 Basic use of caches Levels in the memory hierarchy When fetching an instruction, first

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy & Caching Use several

Why memory hierarchy (3 rd Ed: p.468-487, 4 th Ed: p. 452-470) users want unlimited fast

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Data Management Systems Storage Management The Memory hierarchy Memory hierarchy

Multimedia Conferencing A cura di: Ing. Alessandro Amirante Ing. Tobia Castaldi Ing. Lorenzo

Architecting Energy Efficient Computing Platforms Rajesh Gupta, UC San Diego

Applications & transport Example client/server systems and network their

SimpleScalar Overview Slides borrowed with permission from Todd Austin info@simplescalar.com

Multimedia Communications Spring 2006-07 Advances in the Transport Layer (RTP) Shahab Baqai

Voice over the Internet (the basics) Outline Basics about voice encoding Packetization

Living in AD-times Using Open Standards with Microsoft ActiveDirectory John Paschoud LSE

Client-Side IPv6 Measurement Geoff Huston APNIC Labs How to measure millions of end devices for

Memory Hierarchy Optimizations with Compilers/Software Jaejin Lee - PDF document

Memory Hierarchy Optimizations with Compilers/Software Jaejin Lee Advanced Compiler Research Laboratory School of Computer Science and Engineering Seoul National University jlee@cse.snu.ac.kr http://aces.snu.ac.kr/~jlee Advanced Compiler

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Memory Hierarchy Design Memory Hierarchy Design Chapter 5 and Appendix C 1 Overview

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy Soner Onder

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Abstractions for Practical Systems Caching and the memory hierarchy Operating systems and the

1 5.1 Introduction A Typical Memory Hierarchy A Typical Memory Hierarchy Memory Technology

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

Analysis and Optimizations Analysis and Optimizations Program Analysis Program Analysis

Concepts Introduced in Chapter 9 introduction to compiler optimizations basic blocks and

1 Basic use of caches Levels in the memory hierarchy When fetching an instruction, first

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy &amp; Caching Use several

Why memory hierarchy (3 rd Ed: p.468-487, 4 th Ed: p. 452-470) users want unlimited fast

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Data Management Systems Storage Management The Memory hierarchy Memory hierarchy

Multimedia Conferencing A cura di: Ing. Alessandro Amirante Ing. Tobia Castaldi Ing. Lorenzo

Architecting Energy Efficient Computing Platforms Rajesh Gupta, UC San Diego

Applications &amp; transport Example client/server systems and network their

SimpleScalar Overview Slides borrowed with permission from Todd Austin info@simplescalar.com

Multimedia Communications Spring 2006-07 Advances in the Transport Layer (RTP) Shahab Baqai

Voice over the Internet (the basics) Outline Basics about voice encoding Packetization

Living in AD-times Using Open Standards with Microsoft ActiveDirectory John Paschoud LSE

Client-Side IPv6 Measurement Geoff Huston APNIC Labs How to measure millions of end devices for

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy & Caching Use several

Applications & transport Example client/server systems and network their