DQEMU: A Scalable Emulator with Retargetable DBT on Distributed - PowerPoint PPT Presentation

DQEMU: A Scalable Emulator with Retargetable DBT on Distributed Platforms Ziyi Zhao , Zhang Jiang, Ximing Liu, Xiaoli Gong* Nankai University Pen-Chung Yew University of Minnesota Wenwen Wang University of Georgia 1

Introduction Dynamic Binary Translation(DBT) “A Key Enabling Technology” Cross-ISA Virtualization Dynamic Instrumentation 2

Introduction The scalability of DBT is limited by computing resources • QEMU is a trending DBT • Parallel programs from PARSEC • On x64 dual-core machine Saturate around speedup of 2.0x 3

Introduction Goal: Enable DBT to utilize compute resources across nodes Host OS Distributed DBT Host OS Guest Application Host OS Hardware Hardware Hardware 4

Introduction Goal: Enable DBT to utilize compute resources across nodes In a distributed emulator... Transparently Side effect to host kernel • How to maintain guest cache coherence ? • • How to emulate guest system calls ? • • How to emulate guest atomic operations ? • Equivalent atomic sematic between RISC  CISC 5

Introduction How does DBT work? T iny C ode G enerator (TCG) Guest Code Intermediate Code Host Code 6

Introduction How does DBT work? Host OS Guest Application Guest Mem Region Execute DBT Translate TCG Thread 7

Implementation What should Distributed DBT looks like? Guest Application Guest Mem Region TCG Thread Distributed Shared Memroy Communicator Master Node Worker Node1 Worker Node2 Manager Host OS Host OS Host OS 8

MSI Hardware-based: MMU – host page level check How to keep cache coherence? For the Distributed Shared Memory Region... Cache line size? Page size? Larger? Implementation Software-based instrumentation: check on every memory access Distributed / Centralized • At what granularity? • • How to check privilege? • • • Which type of protocol? • • 9

Implementation How to keep cache coherence? Utilize host MMU to do state check Synchronize granularity = 4K(host page size) • State Page Protection M odified RW S hared R- I nvalid -- • 10

Syscalls also affects host kernel Implementation The problem of system calls Kernel-space resource manager User-space file descriptor node#2 affects Eg. fopen () by a worker thread at Syscall • fopen() • • • File Missing input.txt input.txt 11

Implementation The problem of system calls – Syscall Delegate Local Syscall Global Syscall read, write, openat, open, fstat, close, stat64, lstat64, fstat64, futex, writev, brk, mmap2, mprotect, madvise, mumap, clone , vfork, futex gettimeofday, clock_gettime, exit, nanosleep, ... all the rest Master Node Slave Node Syscall parameters Guest CPU state • • 12

13 Implementation The emulation of atomic operations CISC x86 LL(Load-linked) SC(Store-conditional ) CAS(Compare and Swap) Translate? RISC ARM, MIPS...

14 Implementation The emulation of atomic operations Hierarchical lock 1. Intra-node: Consistency model translation[ArMOR] 2. Inter-node: MSI Coherence Protocol – Sequential

15 Optimization Page Split: The false sharing overhead • Probability : cache line size 64B  page size 4096B • Cost : cache miss 23 cycles  network + pagefault >= 120000cycles

16 Optimization Page Split: The false sharing overhead • Reduce false sharing possibility • Compatible with cache coherence protocol

17 Hint-based thread scheduling: data sharing among nodes Optimization Guest Application Data Sharing Guest Mem Region TCG Thread Distributed Shared Memroy Communicator Master Node Slave Node 1 Slave Node 2 Manager Host OS Host OS Host OS

18 Optimization Hint-based thread scheduling: data sharing among nodes Means “call DQEMU_scheduler” to DBT Source Code Hint

19 Optimization Page forwarding: to cover the network latency trigger forward / prefetch …… 10 pages Continuous Virtual Memory Space record record record page cache trigger forward / prefetch 20 pages

20 Kernel Baseline ISA micro bench, PARSEC-3.0 Workload Linux 4.15.0 Ubuntu 18.04 12GB Results Memory Quad-core Intel i5-6500@3.30GHz CPU Processor TP-Link TL-SG1024DT Gigabit Switch Network Experiment Setup QEMU-4.2.0 Guest: ARM  Host: X64

Access Type Throughput(MB/s) 173.06 QEMU Sequential Access Latency(us) Throughput(MB/s) Access Type Sequential memory access 410.5 7.88 Remote Sequential Access - 173.06 Throughput(MB/s) Latency(us) QEMU Sequential Access Access Type Memory Access Performance Latency(us) QEMU Sequential Access 173.06 - Remote Sequential Access - 7.88 410.5 Page forwarding Enabled 108.01 83.2 21 Results QEMU DQEMU DQEMU Memory Memory Memory

22 Results Memory Access Performance Access Type Throughput(MB/s) QEMU Access of 128 bytes 20,259 False Sharing of 1 Page 2,216 Page Splitting Enabled 75,294 False sharing

23 6 1.6 1.4 1.2 1.2 3.4 1 2 3 4 5 0.00 4.0 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 Slave Node(s) 2.1 QEMU-1 Results 3 Atomic Operation Performance 5.2 6.8 9.5 16.5 21.3 25.6 0.48 1 2 4 DQEMU-1 5 6 0.00 5.00 10.00 15.00 20.00 25.00 30.00 Slave Node(s) Elapsed Time(s) Elapsed Time(s)

24 Results Scalability - Ideal 7.00 DQEMU 5.94 6.00 4.93 Normalized Speedup 5.00 3.98 4.00 2.97 3.00 1.97 2.00 1.00 1.00 1.04 0.00 1 2 3 4 5 6 Slave Node(s)

25 Results Scalability – Parallel Programs blackscholes 4.5 4 Normalized Speedup 3.5 3 2.5 2 1.5 1 0.5 0 1 2 3 4 5 6 origin qemu-4.2.0

26 Scalability – Parallel Programs Results blackscholes 6 5 Normalized Speedup 4 3 2 1 0 1 2 3 4 5 6 origin forwarding full qemu-4.2.0

27 2 1.5 2 2.5 3 3.5 4 1 3 0.5 4 5 6 Normalized Time Slave Nodes x264 pagefault syscall 1 0 Results 3.5 Scalability – Heavy data sharing program 0 0.5 1 1.5 2 2.5 3 4 x264 1 2 3 4 5 6 Normalized Time Slave Nodes exec

28 Results Discussion • A more scalable coherence protocol? • Random memory access hurts DSM. • What kind of program suits DQEMU? How to recognize? • Support various host ISA  Heterogeneous computing?

Thank you! Q&A 29

DQEMU: A Scalable Emulator with Retargetable DBT on Distributed - PowerPoint PPT Presentation

DQEMU: A Scalable Emulator with Retargetable DBT on Distributed Platforms Ziyi Zhao , Zhang Jiang, Ximing Liu, Xiaoli Gong* Nankai University Pen-Chung Yew University of Minnesota Wenwen Wang University of Georgia 1 Introduction Dynamic

Direct Benefit Transfer (DBT) 8 th August, 2019 By Joint Secretary, DBT Mission Cabinet

ATLAS ATLAS A Scalable Emulator for A Scalable Emulator for Transactional Parallel Systems

1/31/2015 1 DBT & Adaption of DBT in Treatment for S uicidal Adolescents Amy Marzulla LMSW

Dialectical Behaviour Therapy (DBT) Skills Groups Kathleen McGrory (Senior Nurse Practitioner)

German Tour 10 & 11 April 2018 INVESTOR PRESENTATION Herv Borgoltz, DBT CEO DBT, THE

DBT CHARGING ELECTRIC VEHICLES S F A F M A R D I 2 5 J U I N 2 0 1 9 Source: Les Echos DBT

1 Gbps and 10 Gbps IP WAN Link Emulator - IPLinkSim Single Stream IP WAN Link Emulator 818

1 Gbps and 10 Gbps WAN Emulator IPNetSim Multi Stream IP WAN Emulator 818 West Diamond

Converse BlueGene Emulator Gengbin Zheng Parallel Programming Lab 2/27/2001 1 Objective

Computational Computational Thinkers: The Thinkers: The Emulator Example Emulator Example

FE Software Emulator Progress Nikola Whallon University of Washington April 14, 2017, LBNL

RD53A Emulator and 64b/66b Serial Link Status Lev Kurilenko (levkur@uw.edu) University of

Retargetable Compilers System on Chip Many different types of DSPs and embedded processors

State Resident Database (SRDB) Haryana Experience DBT Workshop, New Delhi 22-Jul-2016 1 Acco

Outline Digital Breast Tomosynthesis (DBT) the new standard of care Breast cancer

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

10/12/2020 Finding Slides for Todays Webinar COPE Webinar Series for Health Professionals

Distress Tolerance Enhancing Coping Skills for Adolescents Andrea Hanley, M.S. Hannah

hopecc.com/slides & hopecc.com/notes There are two equal and opposite errors into

12 And they told Mordecai what Esther had said. 13 Then Mordecai told them to reply to Esther,

How a database optimizer gets your data, fast Viceniu Ciorbaru Software Developer Team Lead

Automatic Generation of Efficient Dynamic Binary Translators Fr ed eric P etrot, Luc

Bigger Models General stock model The market consists of a riskless cash bond { B t } t 0

JIT-Assisted Fast-Forward Embedding and Instrumentation to Enable Fast, Accurate, and Agile

DQEMU: A Scalable Emulator with Retargetable DBT on Distributed - PowerPoint PPT Presentation

DQEMU: A Scalable Emulator with Retargetable DBT on Distributed Platforms Ziyi Zhao , Zhang Jiang, Ximing Liu, Xiaoli Gong* Nankai University Pen-Chung Yew University of Minnesota Wenwen Wang University of Georgia 1 Introduction Dynamic

Direct Benefit Transfer (DBT) 8 th August, 2019 By Joint Secretary, DBT Mission Cabinet

ATLAS ATLAS A Scalable Emulator for A Scalable Emulator for Transactional Parallel Systems

1/31/2015 1 DBT &amp; Adaption of DBT in Treatment for S uicidal Adolescents Amy Marzulla LMSW

Dialectical Behaviour Therapy (DBT) Skills Groups Kathleen McGrory (Senior Nurse Practitioner)

German Tour 10 &amp; 11 April 2018 INVESTOR PRESENTATION Herv Borgoltz, DBT CEO DBT, THE

DBT CHARGING ELECTRIC VEHICLES S F A F M A R D I 2 5 J U I N 2 0 1 9 Source: Les Echos DBT

1 Gbps and 10 Gbps IP WAN Link Emulator - IPLinkSim Single Stream IP WAN Link Emulator 818

1 Gbps and 10 Gbps WAN Emulator IPNetSim Multi Stream IP WAN Emulator 818 West Diamond

Converse BlueGene Emulator Gengbin Zheng Parallel Programming Lab 2/27/2001 1 Objective

Computational Computational Thinkers: The Thinkers: The Emulator Example Emulator Example

FE Software Emulator Progress Nikola Whallon University of Washington April 14, 2017, LBNL

RD53A Emulator and 64b/66b Serial Link Status Lev Kurilenko (levkur@uw.edu) University of

Retargetable Compilers System on Chip Many different types of DSPs and embedded processors

State Resident Database (SRDB) Haryana Experience DBT Workshop, New Delhi 22-Jul-2016 1 Acco

Outline Digital Breast Tomosynthesis (DBT) the new standard of care Breast cancer

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

10/12/2020 Finding Slides for Todays Webinar COPE Webinar Series for Health Professionals

Distress Tolerance Enhancing Coping Skills for Adolescents Andrea Hanley, M.S. Hannah

hopecc.com/slides &amp; hopecc.com/notes There are two equal and opposite errors into

12 And they told Mordecai what Esther had said. 13 Then Mordecai told them to reply to Esther,

How a database optimizer gets your data, fast Viceniu Ciorbaru Software Developer Team Lead

Automatic Generation of Efficient Dynamic Binary Translators Fr ed eric P etrot, Luc

Bigger Models General stock model The market consists of a riskless cash bond { B t } t 0

JIT-Assisted Fast-Forward Embedding and Instrumentation to Enable Fast, Accurate, and Agile

1/31/2015 1 DBT & Adaption of DBT in Treatment for S uicidal Adolescents Amy Marzulla LMSW

German Tour 10 & 11 April 2018 INVESTOR PRESENTATION Herv Borgoltz, DBT CEO DBT, THE

hopecc.com/slides & hopecc.com/notes There are two equal and opposite errors into