the multi2sim simulation framework
play

The Multi2Sim Simulation Framework A CPU-GPU Model for - PowerPoint PPT Presentation

The Multi2Sim Simulation Framework A CPU-GPU Model for Heterogeneous Computing www.multi2sim.org Rafael Ubal David R. Kaeli Northeastern University Boston, MA Conference title 1 Outline 1. Introduction First Block The x86 CPU


  1. The Multi2Sim Simulation Framework A CPU-GPU Model for Heterogeneous Computing www.multi2sim.org Rafael Ubal David R. Kaeli Northeastern University Boston, MA Conference title 1

  2. Outline 1. Introduction First Block – The x86 CPU Simulation 2. The x86 CPU Emulation 3. The x86 CPU Architectural Simulation 4. The Memory Hierarchy 5. Benchmarks and Simulations Second Block – The AMD Evergreen GPU Simulation 6. The OpenCL Programming Model 7. The AMD Evergreen GPU Emulation 8. The AMD Evergreen GPU Architectural Simulation 9. Benchmarks and Simulations 10. Conclusions and Future work The Multi2Sim Simulation Framework, PACT 2011 Tutorial 2

  3. 1. Introduction Motivation • Limitations of existing CPU simulators – Such as SimpleScalar, Simics, SSMT, M-Sim, SMTSim, M5, ... – Full-system vs. application-only simulation. – Free, open-source. – Architectural simulation accuracy. – Alpha/PISA architectures → cross-compilers. – Integrated system. • Current simulation needs – Based on current processor market. – Heterogeneous CPU-GPU environments. – Tool for evaluation of new architectural proposals. – Simulation of a GPU ISA. • Existing GPU simulation approaches – Barra: NVIDIA Telsa ISA. – Ocelot: PTX intermediate language simulator. – No architectural simulation. – No emulation of AMD ISAs. – Not capable of heterogeneous simulation. The Multi2Sim Simulation Framework, PACT 2011 Tutorial 3

  4. 1. Introduction Multi2Sim Background • Multi2Sim 1.x version series, 2007 (MIPS-based) Superscalar pipeline Multithreading Out-of-order execution, Fine-grain, coarse-grain branch prediction, trace and simultaneous (SMT). cache, etc. • Multi2Sim 2.x version series, 2008 (x86-based) Multicore architecture. State-of-the-art benchmarks. Configurable memory hierarchy, Tested support for common research cache coherence, benchmarks, available for download. interconnection networks. • Multi2Sim 3.x version series, 2011 (x86+Evergreen) GPU model Support for OpenCL benchmarks. Model for Evergreen ISA. The Multi2Sim Simulation Framework, PACT 2011 Tutorial 4

  5. 1. Introduction Getting Started • User-friendly installation and test $ tar -xzf multi2sim-3.1.tar.gz $ cd multi2sim-3.1 $ ./configure $ make $ sudo make install • Application-only simulator Original execution Simulated execution $ ./test-args hola que tal $ m2s ./test-args hola que tal arg[0] = 'hola' <... Simulator output ...> arg[1] = 'que' arg[0] = 'hola' arg[2] = 'tal' arg[1] = 'que' arg[2] = 'tal' <... Simulator statistics ...> The Multi2Sim Simulation Framework, PACT 2011 Tutorial 5

  6. 1. Introduction The IniFile Format • Example of IniFile ; This is a comment. [ Section 0 ] Color = Red Height = 40 [ OtherSection ] Variable = Value • Multi2Sim uses IniFile for – Configuration files. – Output statistic files. – Standard error output. Demo 1 The Multi2Sim Simulation Framework, PACT 2011 Tutorial 6

  7. Block 1 The x86 CPU Simulation The Multi2Sim Simulation Framework, PACT 2011 Tutorial 7

  8. 2. The CPU Emulation Definition • Emulation (a.k.a. functional simulation) – Just mimic original behavior of a program. – … as opposed to timing/detailed/architectural simulation. • Steps 1) Program loading. 2) Simulation loop. The Multi2Sim Simulation Framework, PACT 2011 Tutorial 8

  9. 2. The CPU Emulation Program Loading • Initialization of a process state – Virtual memory map. – Value of x86 registers. 1) Parse ELF executable – ELF sections. 0xc0000000 Stack – Initialized code and data. eax eax Program arguments Environment variables ebx k ecx c a t s 2) Initialize stack f o mmap region p – Program headers. o T esp (not initialized) – Arguments. 0x40000000 eip – Environment variables. r e t n i d o e p Heap z i n l a o Initialized data i i t 3) Initialize registers t i c n u 0x08xxxxxx I r t – Program entry point → eip s n Text i – Stack pointer → esp Initialized data 0x08000000 The Multi2Sim Simulation Framework, PACT 2011 Tutorial 9

  10. 2. The CPU Emulation Simulation Loop • Emulation of x86 instructions Read instr. at eip – Update memory map (if needed). Instr . – Update x86 registers. bytes – Example: add [bp+16], 0x5 Decode instruction Instr . fields • Emulation of Linux system No Yes Instr. is calls int 0x80 – Analyze system call code and args. – Update memory map. Emulate Emulate x86 instr. system call – Update eax with return value. – Example: read(fd, buf, count); Move eip to next instr. Demo 2 The Multi2Sim Simulation Framework, PACT 2011 Tutorial 10

  11. 3. The CPU Architectural Simulation Definition • Architectural simulation (a.k.a. detailed/timing simulation) – Provides performance results from executing a program on a configurable CPU model. – Main performance metric: execution time. But also structures occupancy, cache hit rates, contention points... Architectural Simulator cycle counter CPU Run a new x86 functional instruction CPU cores model simulator This is the isntr. Memory hierarchy that was run model The Multi2Sim Simulation Framework, PACT 2011 Tutorial 11

  12. 3. The CPU Architectural Simulation The Superscalar Pipeline Reorder Buffer ··· Fetch queue Commit ··· μop queue Instruction Queue ··· ··· Fetch Decode Dispatch Trace queue ··· Load/Store Queue Issue FU ··· Instr . Trace Cache Cache Data Register Writeback Cache File • Characteristics – Speculative execution. – Branch prediction. – Out-of-order execution. – Trace cache. Demo 3 The Multi2Sim Simulation Framework, PACT 2011 Tutorial 12

  13. 3. The CPU Architectural Simulation Multithreaded Processor Model ··· Commit ··· Commit ··· ··· Commit ··· ··· ··· ··· Fetch Decode Dispatch ··· ··· Fetch Decode Dispatch ··· ··· Fetch Decode Dispatch ··· Issue FU ··· Issue ··· FU ··· Issue FU ··· ··· Instr . Trace Instr . Trace Instr . Trace Cache Cache Cache Cache Data Register Cache Cache Data Register Writeback Data Register Writeback Cache File Writeback Cache File Cache File Shared Functional Unit Pool • Multithreading Paradigms – Coarse grain multithreading Thread switch upon long-latency events. – Fine grain multithreading Thread switch at a cycle granularity. – Simultaneous multithreading Multiple-thread issuing of instructions. The Multi2Sim Simulation Framework, PACT 2011 Tutorial 13

  14. 3. The CPU Architectural Simulation Multicore Processor Model Core 0 Core 1 ··· ··· Commit ··· Commit ··· ··· ··· ··· ··· ··· Fetch Decode Dispatch Fetch Decode Dispatch ··· ··· Issue FU Issue FU ··· ··· Instr . Trace Instr . Trace Cache Cache Cache Cache Data Register Data Register Writeback Writeback Cache File Cache File Memory Hierarchy • Multicore Processor – Multiple independent superscalar pipelines. – Communication only through memory hierarchy. • What can we run on it? – Multiple single-threaded programs. – One (or more) programs spawning child threads. Demo 4 The Multi2Sim Simulation Framework, PACT 2011 Tutorial 14

  15. 3. The CPU Architectural Simulation Definitions • Core (c-0, c-1, ...) – Hardware component with an independent set of superscalar pipelines. – Each core may contain several threads . • Thread (t-0, t-1, ...) – Hardware component with a partially independent set of pipeline stages. • Context (ctx-0, ctx-1, ...) – Software thread with independent value for registers (incl. eip ). – Can be a sequential program or a spawned child context. • Node – Hardware component running a context. – Multicore proc.: c0 , c1 , … Multithreaded proc.: t0 , t1 , … Multicore-multithreaded proc.: c0-t0 , c0-t1 , ... Demo 4 The Multi2Sim Simulation Framework, PACT 2011 Tutorial 15

  16. 4. Memory Hierarchy Configuration • Configuring memory hierarchy – Any number of caches organized in any number of levels. – Connected through any number of interconnects. – A set of 1 or more caches must connect to an interconnect from “above”. Only one cache –or main memory– connected “below”. ··· Cache Cache Cache Interconnect Cache or Main Memory • Memory hierarchy entries – Each node has two entries to the memory hierarchy: Instruction entry + Data entry – Several node entries can converge to the same cache (or main memory). The Multi2Sim Simulation Framework, PACT 2011 Tutorial 16

  17. 4. Memory Hierarchy Configuration • Example – 2-core, 2-threaded processor (4 nodes). – Each thread has its own private data and instruction L1 caches. – L2 caches: shared among threads, private per core, unified for data/instr. Core 0 Core 1 c0-t0 c0-t1 c1-t0 c1-t1 Data Instr. Data Instr. Data Instr. Data Instr. L1 L1 L1 L1 L1 L1 L1 L1 L2 Cache L2 Cache Main Memory Demo 5 The Multi2Sim Simulation Framework, PACT 2011 Tutorial 17

Recommend


More recommend