Multi2Sim 4.1 Multi-Architecture ISA-Level Simulation of OpenCL Dana Schaa, Rafael Ubal Northeastern University Boston, MA Conference title 1
Outline Introduction Simulation methodology Part 1 – Simulation of an x86 CPU Emulation Timing simulation Memory hierarchy Visualization tool OpenCL on the host Part 2 – Simulation of a Southern Islands GPU OpenCL on the device The Southern Islands ISA The GPU architecture Southern Islands simulation Validation results Improving heterogeneity Concluding remarks IWOCL Tutorial, May 2013 2
Introduction Getting Started Follow our demos! • User accounts for demos $ ssh iwocl<N>@fusion1.ece.neu.edu -X Password: iwocl2013 • Installation of Multi2Sim $ wget http://www.multi2sim.org/files/multi2sim-4.1.tar.gz $ tar -xzf multi2sim-4.1.tar.gz $ cd multi2sim-4.1 $ ./configure && make IWOCL Tutorial, May 2013 3
Introduction First Execution • Source code #include <stdio.h> int main(int argc, char **argv) { int i; printf("Number of arguments: %d\n", argc); for (i = 0; i < argc; i++) printf("\targv[%d] = %s\n", i, argv[i]); return 0; } • Native execution • Execution on Multi2Sim $ test-args hello there $ m2s test-args hello there Number of arguments: 4 < Simulator message in stderr > arg[0] = 'test-args' Number of arguments: 4 arg[1] = 'hello' arg[0] = 'test-args' arg[2] = 'there' arg[1] = 'hello' arg[2] = 'there' < Simulator statistics > Demo 1 IWOCL Tutorial, May 2013 4
Introduction Simulator Input/Output Files • Example of INI file format ; This is a comment. [ Section 0 ] Color = Red Height = 40 [ OtherSection ] Variable = Value • Multi2Sim uses INI file for ─ Configuration files. ─ Output statistics. ─ Statistic summary in standard error output. IWOCL Tutorial, May 2013 5
Simulation Methodology Application-Only vs. Full-System ... ... Guest Guest Guest Guest program 1 program 2 program 1 program 2 Virtualization of Virtualization of Full O.S. User-space subset of ISA Complete processor ISA System call interface I/O hardware Full-system Application-only simulator core simulator core • Full-system simulation • Application-only simulation An entire OS runs on top of the simulator. Only an application runs on top of the The simulator models the entire ISA, and simulator. The simulator implements a virtualizes native hardware devices, similar subset of the ISA, and needs to virtualize to a virtual machine. Very accurate the system call interface (ABI). Multi2Sim simulations, but extremely slow. falls in this category. IWOCL Tutorial, May 2013 6
Simulation Methodology Four-Stage Simulation Process ● Executable ● Executable ● Arguments ● Executable ● Arguments ● Configuration ● User Instruction Run one bytes Instruction! Detailed Functional simulator Visual simulator Disassembler tool (or timing/ Pipeline (or emulator) architectural trace simulator) Instruction Instruction fields fields ● ISA disassembly ● Program output ● Performance ● Timing diagrams statistics • Modular implementation ─ Four clearly different software modules per architecture (x86, MIPS, ...) ─ Each module has a standard interface for stand-alone execution, or interaction with other modules. IWOCL Tutorial, May 2013 7
Simulation Methodology Current Architecture Support Timing Graphic Disasm. Emulation Simulation Pipelines X ARM In progress – – X MIPS – – – X X X X x86 X X X X AMD Evergreen X X X X AMD Southern Islands X NVIDIA Fermi In progress – – • Available in Multi2Sim 4.1 ─ Evergreen, Southern Islands, and x86 fully supported. ─ Three other CPU/GPU architectures in progress. ─ This tutorial will focus on x86 and Southern Islands. IWOCL Tutorial, May 2013 8
Part 1 Simulation of an x86 CPU IWOCL Tutorial, May 2013 9
Emulation of an x86 CPU Program Loading • Initialization of a process state 1) Parse ELF executable ─ Read ELF sections and symbols. Initial virtual Initial values ─ Initialize code and data. memory image for x86 of context registers 2) Initialize stack 0xc0000000 Stack eax eax ─ Program headers. Program args. Env. variables ebx ─ Arguments. op of stack ecx ─ Environment variables. mmap region (not initialized) esp T 3) Initialize registers 0x40000000 eip instruction pointer ─ Program entry → eip Initialized ─ Stack pointer → esp Heap Initialized data 0x08xxxxxx Text Initialized data 0x08000000 IWOCL Tutorial, May 2013 10
Emulation of an x86 CPU Emulation Loop • Emulation of x86 instructions Read instr. at eip ─ Update x86 registers. ─ Update memory map if needed. Instr . bytes ─ Example: add [bp+16], 0x5 Decode instruction Instr . • Emulation of Linux system calls fields ─ Analyze system call code and arguments. No Yes Instr. is ─ Update memory map. int 0x80 ─ Update register eax with return value. ─ Example: read(fd, buf, count) Emulate Emulate x86 instr. system call Move eip to next instr. Demo 2 IWOCL Tutorial, May 2013 11
Timing Simulation of an x86 CPU Superscalar Processor Reorder Buffer ··· Fetch queue Commit ··· μop queue Instruction Queue ··· ··· Fetch Decode Dispatch Trace queue ··· Load/Store Queue Issue FU ··· Instr . Trace Cache Cache Data Register Writeback Cache File • Superscalar x86 pipelines ─ 6-stage pipeline with configurable latencies. ─ Supported features include speculative execution, branch prediction, micro- instruction generation, trace caches, out-of-order execution, … ─ Modeled structures include fetch queues, reorder buffer, load-store queues, register files, register mapping tables, ... IWOCL Tutorial, May 2013 12
Timing Simulation of an x86 CPU Multithreaded and Multicore Processors ··· Commit ··· Commit ··· ··· Commit ··· ··· ··· ··· Fetch Decode Dispatch ··· ··· Fetch Decode Dispatch ··· ··· Fetch Decode Dispatch ··· Issue FU ··· Issue ··· FU ··· Issue FU ··· ··· Instr . Trace Instr . Trace Instr . Trace Cache Cache Cache Cache Data Register Cache Cache Data Register Writeback Data Register Writeback Cache File Writeback Cache File Cache File • Multithreading ─ Replicated superscalar pipelines with partially shared resources. ─ Fine-grain, coarse-grain, and simultaneous multithreading. • Multicore ─ Fully replicated superscalar pipelines, communicating through the memory hierarchy. ─ Parallel architectures can run multiple programs concurrently, or one program spawning child threads (using OpenMP, pthread , etc.) Demo 3 IWOCL Tutorial, May 2013 13
The Memory Hierarchy Configuration • Flexible hierarchies ─ Any number of caches organized in any number of levels. ─ Cache levels connected through default cross-bar interconnects, or complex custom interconnect configurations. ─ Each architecture undergoing a timing simulation specifies its own entry point (cache memory) in the memory hierarchy, for data or instructions. ─ Cache coherence is guaranteed with an implementation of the 5-state MOESI protocol . IWOCL Tutorial, May 2013 14
The Memory Hierarchy Configuration Examples Core 0 Core 1 Core 2 L1-0 L1-1 L1-2 Example 1 Three CPU cores with private Switch L1 caches, two L2 caches, and default cross-bar based interconnects. Cache L2-0 L2-0 L2-1 serves physical address range [0, 7ff...ff], and cache L2-1 serves [80...00, ff...ff]. Switch Main Memory Demo 4 IWOCL Tutorial, May 2013 15
The Memory Hierarchy Configuration Examples Core 0 Core 1 Core 2 Core 3 Data Inst. Data Data Inst. Data L1-0 L1-0 L1-1 L1-2 L1-1 L1-3 Example 2 Switch Switch Four CPU cores with private L1 data caches, L1 instruction caches and L2 caches shared L2-0 L2-1 every 2 cores (serving the whole address space), and n0 n1 four main memory modules, connected with a custom sw0 sw1 sw2 sw3 network on a ring topology. n2 n3 n4 n5 MM-0 MM-1 MM-2 MM-3 IWOCL Tutorial, May 2013 16
The Memory Hierarchy Configuration Examples Example 3 Ring connection between four n0 s0 s1 n1 switches associated with end-nodes with routing tables calculated automatically based on shortest paths. The resulting routing n3 s3 s2 n2 algorithm can contain cycles, potentially leading to routing deadlocks at runtime. IWOCL Tutorial, May 2013 17
The Memory Hierarchy Configuration Examples Example 4 n0 s0 s1 n1 Ring connection between for switches associated with end nodes, where a routing cycle has been removed by adding an Virtual Channel 0 additional virtual n2 n3 s3 s2 channel. Virtual Channel 1 IWOCL Tutorial, May 2013 18
Pipeline Visualization Tool Pipeline Diagrams ─ Cycle bar on main window for navigation. ─ Panel on main window shows software contexts mapped to hardware cores. ─ Clicking on the Detail button opens a secondary window with a pipeline diagram. IWOCL Tutorial, May 2013 19
Pipeline Visualization Tool Memory Hierarchy ─ Panel on main window shows how memory accesses traverse the memory hierarchy. ─ Clicking on a Detail button opens a secondary window with the cache memory representation. ─ Each row is a set, each column is a way. ─ Each cell shows the tag and state (color) of a cache block. ─ Additional columns show the number of sharers and in-flight accesses. Demo 5 IWOCL Tutorial, May 2013 20
Recommend
More recommend