DQEMU: A Scalable Emulator with Retargetable DBT on Distributed Platforms Ziyi Zhao , Zhang Jiang, Ximing Liu, Xiaoli Gong* Nankai University Pen-Chung Yew University of Minnesota Wenwen Wang University of Georgia 1
Introduction Dynamic Binary Translation(DBT) “A Key Enabling Technology” Cross-ISA Virtualization Dynamic Instrumentation 2
Introduction The scalability of DBT is limited by computing resources • QEMU is a trending DBT • Parallel programs from PARSEC • On x64 dual-core machine Saturate around speedup of 2.0x 3
Introduction Goal: Enable DBT to utilize compute resources across nodes Host OS Distributed DBT Host OS Guest Application Host OS Hardware Hardware Hardware 4
Introduction Goal: Enable DBT to utilize compute resources across nodes In a distributed emulator... Transparently Side effect to host kernel • How to maintain guest cache coherence ? • • How to emulate guest system calls ? • • How to emulate guest atomic operations ? • Equivalent atomic sematic between RISC CISC 5
Introduction How does DBT work? T iny C ode G enerator (TCG) Guest Code Intermediate Code Host Code 6
Introduction How does DBT work? Host OS Guest Application Guest Mem Region Execute DBT Translate TCG Thread 7
Implementation What should Distributed DBT looks like? Guest Application Guest Mem Region TCG Thread Distributed Shared Memroy Communicator Master Node Worker Node1 Worker Node2 Manager Host OS Host OS Host OS 8
MSI Hardware-based: MMU – host page level check How to keep cache coherence? For the Distributed Shared Memory Region... Cache line size? Page size? Larger? Implementation Software-based instrumentation: check on every memory access Distributed / Centralized • At what granularity? • • How to check privilege? • • • Which type of protocol? • • 9
Implementation How to keep cache coherence? Utilize host MMU to do state check Synchronize granularity = 4K(host page size) • State Page Protection M odified RW S hared R- I nvalid -- • 10
Syscalls also affects host kernel Implementation The problem of system calls Kernel-space resource manager User-space file descriptor node#2 affects Eg. fopen () by a worker thread at Syscall • fopen() • • • File Missing input.txt input.txt 11
Implementation The problem of system calls – Syscall Delegate Local Syscall Global Syscall read, write, openat, open, fstat, close, stat64, lstat64, fstat64, futex, writev, brk, mmap2, mprotect, madvise, mumap, clone , vfork, futex gettimeofday, clock_gettime, exit, nanosleep, ... all the rest Master Node Slave Node Syscall parameters Guest CPU state • • 12
13 Implementation The emulation of atomic operations CISC x86 LL(Load-linked) SC(Store-conditional ) CAS(Compare and Swap) Translate? RISC ARM, MIPS...
14 Implementation The emulation of atomic operations Hierarchical lock 1. Intra-node: Consistency model translation[ArMOR] 2. Inter-node: MSI Coherence Protocol – Sequential
15 Optimization Page Split: The false sharing overhead • Probability : cache line size 64B page size 4096B • Cost : cache miss 23 cycles network + pagefault >= 120000cycles
16 Optimization Page Split: The false sharing overhead • Reduce false sharing possibility • Compatible with cache coherence protocol
17 Hint-based thread scheduling: data sharing among nodes Optimization Guest Application Data Sharing Guest Mem Region TCG Thread Distributed Shared Memroy Communicator Master Node Slave Node 1 Slave Node 2 Manager Host OS Host OS Host OS
18 Optimization Hint-based thread scheduling: data sharing among nodes Means “call DQEMU_scheduler” to DBT Source Code Hint
19 Optimization Page forwarding: to cover the network latency trigger forward / prefetch …… 10 pages Continuous Virtual Memory Space record record record page cache trigger forward / prefetch 20 pages
20 Kernel Baseline ISA micro bench, PARSEC-3.0 Workload Linux 4.15.0 Ubuntu 18.04 12GB Results Memory Quad-core Intel i5-6500@3.30GHz CPU Processor TP-Link TL-SG1024DT Gigabit Switch Network Experiment Setup QEMU-4.2.0 Guest: ARM Host: X64
Access Type Throughput(MB/s) 173.06 QEMU Sequential Access Latency(us) Throughput(MB/s) Access Type Sequential memory access 410.5 7.88 Remote Sequential Access - 173.06 Throughput(MB/s) Latency(us) QEMU Sequential Access Access Type Memory Access Performance Latency(us) QEMU Sequential Access 173.06 - Remote Sequential Access - 7.88 410.5 Page forwarding Enabled 108.01 83.2 21 Results QEMU DQEMU DQEMU Memory Memory Memory
22 Results Memory Access Performance Access Type Throughput(MB/s) QEMU Access of 128 bytes 20,259 False Sharing of 1 Page 2,216 Page Splitting Enabled 75,294 False sharing
23 6 1.6 1.4 1.2 1.2 3.4 1 2 3 4 5 0.00 4.0 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 Slave Node(s) 2.1 QEMU-1 Results 3 Atomic Operation Performance 5.2 6.8 9.5 16.5 21.3 25.6 0.48 1 2 4 DQEMU-1 5 6 0.00 5.00 10.00 15.00 20.00 25.00 30.00 Slave Node(s) Elapsed Time(s) Elapsed Time(s)
24 Results Scalability - Ideal 7.00 DQEMU 5.94 6.00 4.93 Normalized Speedup 5.00 3.98 4.00 2.97 3.00 1.97 2.00 1.00 1.00 1.04 0.00 1 2 3 4 5 6 Slave Node(s)
25 Results Scalability – Parallel Programs blackscholes 4.5 4 Normalized Speedup 3.5 3 2.5 2 1.5 1 0.5 0 1 2 3 4 5 6 origin qemu-4.2.0
26 Scalability – Parallel Programs Results blackscholes 6 5 Normalized Speedup 4 3 2 1 0 1 2 3 4 5 6 origin forwarding full qemu-4.2.0
27 2 1.5 2 2.5 3 3.5 4 1 3 0.5 4 5 6 Normalized Time Slave Nodes x264 pagefault syscall 1 0 Results 3.5 Scalability – Heavy data sharing program 0 0.5 1 1.5 2 2.5 3 4 x264 1 2 3 4 5 6 Normalized Time Slave Nodes exec
28 Results Discussion • A more scalable coherence protocol? • Random memory access hurts DSM. • What kind of program suits DQEMU? How to recognize? • Support various host ISA Heterogeneous computing?
Thank you! Q&A 29
Recommend
More recommend