Faculty of Computer Science, Institute for System Architecture, Operating Systems Group M 3 : INTEGRATING ARBITRARY COMPUTE UNITS AS FIRST-CLASS CITIZENS OS: Nils Asmussen, Hermann H¨ artig, Marcus V ¨ olp EE: Benedikt N ¨ othen, Gerhard Fettweis Dagstuhl Seminar, 02/09/2017
Why? • FPGA-based memcached 16x better in performance per watt than Atom CPU [1] • Machine learning accelerator is 20% faster than FPGA and requires 128 times less energy [2] • . . . [1] Thin servers with smart pipes: Designing SoC accelerators for memcached, ISCA’13 [2] PuDianNao: A polyvalent machine learning accelerator, ASPLOS’15 Nils Asmussen Slide 2 of 16
The Problem for OSes ARM Intel Audio big Xeon DSP Decoder ARM Intel FPGA Xeon LITTLE DSP Nils Asmussen Slide 3 of 16
The Problem for OSes ARM Intel Audio big Xeon DSP Decoder Kernel ARM Intel FPGA Xeon LITTLE DSP Kernel Nils Asmussen Slide 3 of 16
The Problem for OSes ARM Intel Audio big Xeon DSP Decoder Kernel Kernel ARM Intel FPGA Xeon LITTLE DSP Kernel Kernel Nils Asmussen Slide 3 of 16
The Problem for OSes ARM Intel big Xeon Kernel Kernel ARM Intel Xeon LITTLE Kernel Kernel Nils Asmussen Slide 3 of 16
The Goal Treat all compute units (CU) as first-class citizens: Run untrusted code without causing harm 1 Access operating system services 2 Interact as the master with other CUs 3 Nils Asmussen Slide 4 of 16
First-class Citizenchip as Enabler • Pipe communication between arbitrary CUs • Use parallism on GPUs for FS operations • Direct access to accelerators from the net • . . . Nils Asmussen Slide 5 of 16
M 3 Approach – Hardware ARM Audio Intel big DSP Decoder Xeon Mem Mem Mem Mem ARM Intel FPGA Xeon DSP LITTLE Mem Mem Mem Mem Asmussen et al.: M3: A Hardware/OS Co-Design to Tame Heterogeneous Manycores, ASPLOS’16 Nils Asmussen Slide 6 of 16
M 3 Approach – Hardware ARM Audio Intel big DSP Decoder Xeon Mem DTU Mem DTU Mem DTU Mem DTU ARM Intel FPGA Xeon DSP LITTLE Mem DTU Mem DTU Mem DTU Mem DTU Asmussen et al.: M3: A Hardware/OS Co-Design to Tame Heterogeneous Manycores, ASPLOS’16 Nils Asmussen Slide 6 of 16
M 3 Approach – Hardware PE PE PE PE ARM Audio Intel big DSP Decoder Xeon Mem DTU Mem DTU Mem DTU Mem DTU PE PE PE PE ARM Intel FPGA Xeon DSP LITTLE Mem DTU Mem DTU Mem DTU Mem DTU Asmussen et al.: M3: A Hardware/OS Co-Design to Tame Heterogeneous Manycores, ASPLOS’16 Nils Asmussen Slide 6 of 16
M 3 Approach – Software PE PE PE PE ARM Audio Intel App App Kernel big DSP Decoder Xeon App Mem DTU Mem DTU Mem DTU Mem DTU PE PE PE PE ARM Intel App App App FPGA Xeon DSP App LITTLE Mem DTU Mem DTU Mem DTU Mem DTU Asmussen et al.: M3: A Hardware/OS Co-Design to Tame Heterogeneous Manycores, ASPLOS’16 Nils Asmussen Slide 6 of 16
Data Transfer Unit • Supports memory access and message passing • Provides a number of endpoints • Each endpoint can be configured for: Accessing memory (contiguous range, byte granular) 1 Receiving messages into a receive buffer 2 Sending messages to a receiving endpoint 3 • Configuration only by kernel, usage by application • Credit system to prevent DoS attacks • Direct reply on received messages Nils Asmussen Slide 7 of 16
M 3 System Call Kernel App Mem DTU S Mem DTU R Nils Asmussen Slide 8 of 16
Prototype Platform: Tomahawk 2 PE PE PE R R R Xtensa LX4 PE PE PE R R R Mem PE PE Instr. Data Ctrl. DTU SPM SPM R R R DRAM PEs have no OS support: • No privileged mode • No MMU • No caches, but SPM Nils Asmussen Slide 9 of 16
Prototype Platform: gem5 PE PE x86 x86 DTU DTU L1 SPM VM PE PE Hash x86 DRAM Accel Ctl DRAM L1 DTU DTU SPM DTU L2 VM Nils Asmussen Slide 10 of 16
M 3 • M icrokernel-based syste m for het. m anycores • Mechanisms for PEs, memory and communication • Drivers, filesystems, . . . are implemented on top • Kernel manages permissions • DTU enforces permissions (communication, memory access) • Kernel is independent of other CUs in the system Nils Asmussen Slide 11 of 16
Virtual PEs • Comparable to a process with 0/1 threads • Creating VPE yields a VPE cap. and memory cap. • Library provides primitives like fork and exec Nils Asmussen Slide 12 of 16
Virtual PEs • Comparable to a process with 0/1 threads • Creating VPE yields a VPE cap. and memory cap. • Library provides primitives like fork and exec Execute function on different PE VPE vpe; vpe.run_async([]() { Serial::get() << "Hello World!\n"; return 0; }); int exitcode = vpe.wait(); Nils Asmussen Slide 12 of 16
Virtual PEs • VPE with 0 threads for HW accelerators • Allows direct access for applications • Time-multiplexed by the kernel Access an accelerator VPE vpe(VPEDesc::HASH_ACCEL); SendGate sg(vpe); GateIStream reply = send_receive_vmsg(sg, 1, 2, 3); int res; reply >> res; Nils Asmussen Slide 13 of 16
Filesystem: m3fs Kernel App Mem DTU S S Mem DTU R DRAM m3fs Mem DTU S R Nils Asmussen Slide 14 of 16
Filesystem: m3fs Kernel App Mem DTU S S Mem DTU R DRAM m3fs Mem DTU S R Nils Asmussen Slide 14 of 16
Filesystem: m3fs Kernel App Mem DTU S S Mem DTU R DRAM m3fs Mem DTU S R Nils Asmussen Slide 14 of 16
Filesystem: m3fs Kernel App Mem DTU S S M Mem DTU R DRAM m3fs Mem DTU S R Nils Asmussen Slide 14 of 16
Performance Comparison App Xfers OS 7 6 Time (M cycles) 5 4 3 2 1 0 M3 Lx M3 Lx M3 Lx M3 Lx tar untar find sqlite Nils Asmussen Slide 15 of 16
Summary • M 3 uses a HW/SW co-design • DTU creates common interface for all CUs • M 3 kernel controls DTUs remotely • Allows to treat all CUs as first-class citizens Nils Asmussen Slide 16 of 16
Recommend
More recommend