memory hierarchy optimizations with compilers software
play

Memory Hierarchy Optimizations with Compilers/Software Jaejin Lee - PDF document

Memory Hierarchy Optimizations with Compilers/Software Jaejin Lee Advanced Compiler Research Laboratory School of Computer Science and Engineering Seoul National University jlee@cse.snu.ac.kr http://aces.snu.ac.kr/~jlee Advanced Compiler


  1. Memory Hierarchy Optimizations with Compilers/Software Jaejin Lee Advanced Compiler Research Laboratory School of Computer Science and Engineering Seoul National University jlee@cse.snu.ac.kr http://aces.snu.ac.kr/~jlee Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Augl-04 1 Seoul Nat ional Universit y Outline ■ Heterogeneous Multithreading (Helper Threading) - Intelligent Memory - Coexecution - Prefetching ■ Compiler-Assisted Demand Paging ■ Wrap-up Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 2 Seoul Nat ional Universit y 1

  2. Memory Wall Problem ■ The performance gap of processors and memory - Microprocessor performance has been improving at a rate of 60% per year. - The access time to DRAM has been improving at a rate of less than 10% per year. ■ The performance of applications is dominated by memory. ■ Thousands of papers. Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 3 Seoul Nat ional Universit y The Intelligent Memory Architecture Processor Chip P.host Main t hread L2 $ L1 $ Of f -t he-shelf Memory Chip int erconnect ion P.mem Helper t hread L1 $ Could be DRAM a DI MM module or a memory cont roller Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 4 Seoul Nat ional Universit y 2

  3. Co-execution ■ Using a compiler, - Partition code into compute-/memory- intensive sections (so called modules). ▪ Performance prediction - The memory-intensive sections are wrapped into a helper thread. - Statically/dynamically map the sections to the best processor. Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 5 Seoul Nat ional Universit y Overview of the Co-execution Algorithm Numerical Non-numerical Applicat ions Applicat ions Basic Part it ioning Basic Part it ioning Af f init y Est imat ion Af f init y Est imat ion (perf ormance model) (prof iling) Advanced Advanced Overlapping Part it ioning Part it ioning Mapping Mapping st at ic dynamic st at ic dynamic Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 6 Seoul Nat ional Universit y 3

  4. Static Mapping ■ Performance model (numerical apps) - Execution time = T comp + T memstall - Stack distance model for the number of misses T T T = + fp T max( int , , ldst ) T comp N N N other int fp ldst ∑ = • T miss penalty memstall i i ∈ i caches ■ Profiling (non-numerical apps) - Gather execution time and the number of invocations for all modules and subroutines. Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 7 Seoul Nat ional Universit y Dynamic Mapping ■ Decision runs at runtime to determine affinity ■ Coarse and CoarseR - Decision runs are module invocations I nvocat ion 1 2 3 4 5 ••• Coarse P.host P.mem CoarseR P.host P.mem Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 8 Seoul Nat ional Universit y 4

  5. Overall Speedups for Co-execution ■ Our co-execution algorithm delivers speedups that are comparable to the ideal speedup. Apps. P.host (alone) P.host (alone) Amdahl’s 2-processor / AdvCoarseR / OverDyn 2 P.host s SGI 1.67 2.71 2.00 1.85 Swim 1.17 1.60 1.67 1.44 Tomcat v 1.26 1.22 1.04 0.99 LU 1.42 1.22 1.91 0.80 TFFT2 1.05 1.55 1.94 1.47 Mgrid Average 1.31 1.31 1. 66 1. 71 Bzip2 1.37 - 1.01 0.99 Mcf 1.37 - 1.01 1.00 Go 0.97 - 1.01 0.57 M88ksim 1.01 - 1.03 1.00 Average 1. 18 - 1. 02 0.89 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 9 Seoul Nat ional Universit y Correlation Prefetching in Software ■ New correlation prefetching in software using the memory thread. ■ Records sequences of miss addresses in a correlation table. ■ When the head of a sequence is seen, prefetch the rest. a[4*(i++)] ... a[foo(i)] ... A B C Z ... Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 10 Seoul Nat ional Universit y 5

  6. Correlation Table Basic Organization Advanced Organization (Joseph & Grunwald) Addresses of Addresses of next immediate successors immediate successors Succ Succ Succ Tag Tag Level 1 Level 1 Level 2 … Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 11 Seoul Nat ional Universit y Our Scheme Processor Chip L1$ L2$ 2 1 DRAM Chip or DIMM module Memory Mem Controller DRAM Proc 4 Cells 5 L1$ North Bridge Chip 3 Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 12 Seoul Nat ional Universit y 6

  7. The Mechanism of the Memory (Helper) Thread ■ Requirements: - Low response time - Occupancy time < miss distance Miss address Prefetch addresses Table observed generated updated Prefetching step Learning step Response time Occupancy time Wait Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 13 Seoul Nat ional Universit y Miss Distance 100% [360,400) [320,360) 80% [280,320) [240,280) 60% [200,240) [160,200) 40% [120,160) [80,120) 20% [40,80) [0,40) 0% e T r e e e G T p f c e g k F a S s e C M s a G r r a M r a u T r a e p q P S v E A Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 14 Seoul Nat ional Universit y 7

  8. Seoul Nat ional Universit y School of Comput er Science and Engineering Advanced Compiler Research Laborat ory Seoul Nat ional Universit y School of Comput er Science and Engineering Advanced Compiler Research Laborat ory Normalized Execution Time Execution Time in DRAM Response and Occupancy Time 0.2 0.4 0.6 0.8 1.2 0 1 NoPref Conven4 Base Chain CG Repl Processor Cycles Conven4+Repl Custom 100 150 200 250 50 NoPref 0 Conven4 Base Equake Chain Repl Base Conven4+Repl NoPref Conven4 Base Chain Chain FT Response time Repl Conven4+Repl NoPref Repl Conven4 Base Gap Chain Repl Conven4+Repl ReplMC NoPref Busy Conven4 Base Chain Mcf Repl Conven4+Repl UptoL2 12-Aug-04 12-Aug-04 Custom NoPref Base Conven4 BeyondL2 Base Chain MST Occupancy time Repl Conven4+Repl Chain Custom NoPref Conven4 Base Parser Chain Repl Repl Conven4+Repl NoPref ReplMC Conven4 Base Sparse Chain Repl Conven4+Repl Busy Mem NoPref Conven4 Base Tree Chain Repl Conven4+Repl 16 15 NoPref Average Conven4 Base Chain Repl Conven4+Repl 8

  9. Execution Time in MC Busy UptoL2 BeyondL2 1.2 1 Normalized Execution Time 0.8 0.6 0.4 0.2 0 Conven4+ReplMC Conven4+ReplMC Conven4+ReplMC Conven4+ReplMC Conven4+ReplMC Conven4+ReplMC Conven4+ReplMC Conven4+ReplMC Conven4+ReplMC Conven4+ReplMC NoPref Conven4 BaseMC ChainMC ReplMC NoPref Conven4 BaseMC ChainMC ReplMC NoPref Conven4 BaseMC ChainMC ReplMC NoPref Conven4 BaseMC ChainMC ReplMC NoPref Conven4 BaseMC ChainMC ReplMC NoPref Conven4 BaseMC ChainMC ReplMC NoPref Conven4 BaseMC ChainMC ReplMC NoPref Conven4 BaseMC ChainMC ReplMC NoPref Conven4 BaseMC ChainMC ReplMC NoPref Conven4 BaseMC ChainMC ReplMC CG Equake FT Gap Mcf MST Parser Sparse Tree Average Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 17 Seoul Nat ional Universit y Active Prefetching ■ The helper thread runs the skeleton of the original code - Address computation - Prefetch instructions ■ More accurate prefetches ■ The helper thread should be faster than the original code ■ Synchronization overhead Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 18 Seoul Nat ional Universit y 9

  10. Outline ■ Heterogeneous Multithreading (Helper Threading) - Intelligent Memory - Coexecution - Prefetching ■ Compiler-Assisted Demand Paging - Motivation - Framework - Example - Performance Results ■ Wrap-up Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 19 Seoul Nat ional Universit y S eoul N ational university A dvanced C ompiler tool K it ( SNACK ) Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 20 Seoul Nat ional Universit y 1 0

  11. SNACK Components ■ SNACK-cc: a C compiler for embedded systems ■ SNACK-c2c: C-to-C translator ■ SNACK-asm: assembler ■ SNACK-link: linker ■ SNACK-pop: post-pass optimizer ■ SNACK-jvm: embedded Java VM (planned) Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 21 Seoul Nat ional Universit y Goals ■ High performance ■ Small code size ■ Low power/energy Advanced Compiler Research Laborat ory School of Comput er Science and Engineering 12-Aug-04 22 Seoul Nat ional Universit y 1 1

Recommend


More recommend