simulation of opencl and apus on multi2sim 4 1
play

Simulation of OpenCL and APUs on Multi2Sim 4.1 Rafael Ubal, David - PowerPoint PPT Presentation

Simulation of OpenCL and APUs on Multi2Sim 4.1 Rafael Ubal, David Kaeli Conference title 1 Outline Introduction Simulation methodology Part 1 Simulation of an x86 CPU Part 2 Simulation of a Southern Islands GPU Disassembler OpenCL


  1. Simulation of OpenCL and APUs on Multi2Sim 4.1 Rafael Ubal, David Kaeli Conference title 1

  2. Outline Introduction Simulation methodology Part 1 – Simulation of an x86 CPU Part 2 – Simulation of a Southern Islands GPU Disassembler OpenCL from host to device Emulation Disassembler Timing simulation Emulation Memory hierarchy Timing simulation Visualization tool Visualization tool Case study: ND-Range virtualization Part 3 – Concluding Remarks Additional Projects The Multi2Sim Community ISCA 2013, Tel-Aviv 2

  3. Introduction Getting Started Follow our demos! • User accounts for demos ─ Machine: fusion1.ece.neu.edu ─ User: isca1, isca2, isca3, ... ─ Password: isca2013 ISCA 2013, Tel-Aviv 3

  4. Introduction Getting Started • Connect to our server $ ssh isca<N>@fusion1.ece.neu.edu -X (Notice the X forwarding for later demos using graphics) • Demo descriptions $ ls demo1 demo2 demo3 demo4 demo5 demo6 demo7 README All files needed for each demo are present in its corresponding directory. README files describe commands to run and interpretation of outputs. • Download and compile Multi2Sim $ wget http://www.multi2sim.org/files/multi2sim-4.1.tar.gz $ tar -xzf multi2sim-4.1.tar.gz $ cd multi2sim-4.1 $ ./configure && make ISCA 2013, Tel-Aviv 4

  5. Introduction First Execution • Source code #include <stdio.h> int main(int argc, char **argv) { int i; printf("Number of arguments: %d\n", argc); for (i = 0; i < argc; i++) printf("\targv[%d] = %s\n", i, argv[i]); return 0; } • Native execution • Execution on Multi2Sim $ test-args hello there $ m2s test-args hello there Number of arguments: 4 < Simulator message in stderr > arg[0] = 'test-args' Number of arguments: 4 arg[1] = 'hello' arg[0] = 'test-args' arg[2] = 'there' arg[1] = 'hello' arg[2] = 'there' < Simulator statistics > Demo 1 ISCA 2013, Tel-Aviv 5

  6. Introduction Simulator Input/Output Files • Example of INI file format ; This is a comment. [ Section 0 ] Color = Red Height = 40 [ OtherSection ] Variable = Value • Multi2Sim uses INI file for ─ Configuration files. ─ Output statistics. ─ Statistic summary in standard error output. ISCA 2013, Tel-Aviv 6

  7. Simulation Methodology Application-Only vs. Full-System ... ... Guest Guest Guest Guest program 1 program 2 program 1 program 2 Virtualization of Virtualization of Full O.S. User-space subset of ISA Complete processor ISA System call interface I/O hardware Full-system Application-only simulator core simulator core • Full-system simulation • Application-only simulation An entire OS runs on top of the simulator. Only an application runs on top of the The simulator models the entire ISA, and simulator. The simulator implements a virtualizes native hardware devices, similar subset of the ISA, and needs to virtualize to a virtual machine. Very accurate the system call interface (ABI). Multi2Sim simulations, but extremely slow. falls in this category. ISCA 2013, Tel-Aviv 7

  8. Simulation Methodology Four-Stage Simulation Process Executable file, Exectuable Exectuable file, program arguments, User ELF file program arguments processor configuration interaction Instruction Run one bytes instruction Pipeline trace Timing Emulator simulator Visual Disassembler (or functional (or detailed/ tool simulator) architectural) Instruction Instruction fields information Instructions Program Performance Cycle navigation, dump output statistics timing diagrams • Modular implementation ─ Four clearly different software modules per architecture (x86, MIPS, ...) ─ Each module has a standard interface for stand-alone execution, or interaction with other modules. ISCA 2013, Tel-Aviv 8

  9. Simulation Methodology Current Architecture Support Timing Graphic Disasm. Emulation simulation pipelines X – – ARM In progress X X – – MIPS X X X X x86 X X X X AMD Evergreen X X X X AMD Southern Islands X – – NVIDIA Fermi In progress – – – NVIDIA Kepler In progress • In our latest Multi2Sim SVN repository ─ 4 GPU + 3 CPU architectures supported or in progress. ─ This tutorial will focus on x86 and AMD Southern Islands. ISCA 2013, Tel-Aviv 9

  10. Part 1 Simulation of an x86 CPU ISCA 2013, Tel-Aviv 10

  11. The x86 Disassembler Timing Emulator simulator Visual Disassembler (or functional (or detailed/ tool simulator) architectural) ISCA 2013, Tel-Aviv 11

  12. The x86 Disassembler Methodology ─ Implementation of an efficient instruction decoder based on lookup tables. ─ When used as a stand-alone tool, the output is provided with exactly the same format as the GNU x86 disassembler for automatic verification . • GNU x86 disassembler • Multi2Sim x86 disassembler $ objdump -S -M intel test-args $ m2s --x86-disasm test-args • Verification of common output 08048900 <_start>: 8048900: 31 ed xor ebp,ebp 8048902: 5e pop esi 8048903: 89 e1 mov ecx,esp 8048905: 83 e4 f0 and esp,0xfffffff0 8048908: 50 push eax 8048909: 54 push esp 804890a: 52 push edx 804890b: 68 70 91 04 08 push 0x8049170 ... ISCA 2013, Tel-Aviv 12

  13. The x86 Emulator Timing Emulator simulator Visual Disassembler (or functional (or detailed/ tool simulator) architectural) ISCA 2013, Tel-Aviv 13

  14. The x86 Emulator Program Loading 1) Parse ELF executable Initial virtual Initial values for memory image x86 registers ─ Read ELF sections and symbols. ─ Initialize code and data. 0xc0000000 Stack eax eax Program args. ebx Env. variables Stack pointer 2) Initialize stack ecx ─ Program headers. mmap region ─ Arguments. esp (not initialized) eip 0x40000000 ─ Environment variables. Instruction pointer Heap Initialized data 3) Initialize registers 0x08xxxxxx Text ─ Program entry → eip Initialized data ─ Stack pointer → esp 0x08000000 ISCA 2013, Tel-Aviv 14

  15. The x86 Emulator Emulation Loop • Emulation of x86 instructions Read instr. at eip ─ Update x86 registers. Instr. ─ Update memory map if needed. bytes Decode ─ Example: add [bp+16], 0x5 instruction Instr. fields • Emulation of Linux system calls No Yes Instr. is int 0x80 ─ Analyze system call code and arguments. ─ Update memory map. Emulate Emulate x86 instr. system call ─ Update register eax with return value. ─ Example: read(fd, buf, count) Move eip to next instr. Demo 2 ISCA 2013, Tel-Aviv 15

  16. The x86 Timing Simulator Timing Emulator simulator Visual Disassembler (or functional (or detailed/ tool simulator) architectural) ISCA 2013, Tel-Aviv 16

  17. The x86 Timing Simulator Superscalar Processor Fetch queue c Uop I a n ··· ··· ··· Reorder buffer Commit c s queue h t r e . Fetch Decode ··· Dispatch Instruction queue c T Issue ALU a r ··· a c h c Load/store queue e e Trace queue Data Reg. Writeback cache file • Superscalar x86 pipelines ─ 6-stage pipeline with configurable latencies. ─ Supported features include speculative execution, branch prediction, micro- instruction generation, trace caches, out-of-order execution, … ─ Modeled structures include fetch queues, reorder buffer, load-store queues, register files, register mapping tables, ... ISCA 2013, Tel-Aviv 17

  18. The x86 Timing Simulator Multithreaded and Multicore Processors • Multithreading ─ Replicated superscalar n c o Superscalar Core pipelines with partially ··· r e s Nodes shared resources. Shared resources: ( n – 1) m to nm – 1 Reorder buffer Register file ─ Fine-grain, coarse-grain, and Instruction queue Functional units simultaneous multithreading. Memory hierarchy Load/store queue Nodes m m t to 2 m - 1 h • Multicore Hardware Thread r ··· e a d s Private resources: Node 0 ─ Fully replicated Program counter Node 1 superscalar pipelines, Register aliasing table TLB Node m – 1 connected through caches. ··· ··· ─ Running multiple ··· ··· programs concurrently, or one program spawning child threads (using OpenMP, pthread , etc.) Demo 3 ISCA 2013, Tel-Aviv 18

  19. The x86 Timing Simulator Benchmark Support • Single-threaded applications ─ SPEC 2000 and SPEC 2006 benchmarks are fully supported. Pre-compiled x86 binaries are available on the website. ─ The Mediabench suite includes program binaries and data files, with all you need to run them. • Multithreaded applications ─ SPLASH-2 benchmark suite with pre-compiled x86 executables and data files available on the website. ─ PARSEC-2.1 with pre-compiled x86 executables and data files. ISCA 2013, Tel-Aviv 19

  20. The Memory Hierarchy Configuration • Flexible hierarchies ─ Any number of caches organized in any number of levels. ─ Cache levels connected through default cross-bar interconnects, or complex custom interconnect configurations. ─ Each architecture undergoing a timing simulation specifies its own entry point (cache memory) in the memory hierarchy, for data or instructions. ─ Cache coherence is guaranteed with an implementation of the 5-state MOESI protocol . ISCA 2013, Tel-Aviv 20

Recommend


More recommend