Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation Yiying Zhang
2
Monolithic Computer OS / Hypervisor 3
Application Can monolithic Hardware servers continue to Heterogeneity meet Flexibility datacenter needs? Perf / $
TPU GPU FPGA HBM NVM ASIC DNA Storage NVMe 5
Making new hardware work with existing servers is like fitting puzzles 6
Application Can monolithic Hardware servers continue to Heterogeneity meet Flexibility datacenter needs? Perf / $
Poor Hardware Elasticity • Hard to change hardware components Add (hotplug), remove, reconfigure, restart - • No fine-grained failure handling The failure of one device can crash a whole machine - 8
Application Can monolithic Hardware servers continue to Heterogeneity meet Flexibility datacenter needs? Perf / $
Poor Resource Utilization • Whole VM/container has to run on one physical machine Move current applications to make room for new ones - wasted! cpu mem Server 1 Server 2 Job 1 Job 2 Available Space Required Space 10
Resource Utilization in Production Clusters * Google Production Cluster Trace Data. * Alibaba Production Cluster Trace Data. “https://github.com/google/cluster-data” “https://github.com/alibaba/clusterdata." Unused Resource + Waiting/Killed Jobs Because of Physical-Node Constraints 11
Application Can monolithic Hardware servers continue to Heterogeneity meet Flexibility datacenter needs? Perf / $
How to achieve better heterogeneity, flexibility, and perf/$? Go beyond physical node boundary 13
Resource Disaggregation : Breaking monolithic servers into network- attached, independent hardware components 14
15
Application Flexibility Heterogeneity Hardware Perf / $ Network 16
Why Possible Now? • Network is faster • InfiniBand ( 200Gbps, 600ns ) Berkeley Firebox • Optical Fabric ( 400Gbps, 100ns ) • More processing power at device • SmartNIC, SmartSSD, PIM Intel Rack-Scale • Network interface closer to device HP The Machine System • Omni-Path, Innova-2 IBM Composable System 17
Disaggregated Datacenter End-to-End Solution Unmodified Performance Application Heterogeneity Dist Sys Flexibility OS Reliability Network Hardware $ Cost
Disaggregated Datacenter End-to-End Solution Physically Disaggregated Resources Disaggregated Operating System (OSDI’18) New Processor and Memory Architecture Networking for Disaggregated Resources Kernel-Level RDMA Virtualization (SOSP’17) RDMA Network
20
Can Existing Kernels Fit? Kern Kern Kern monolithic microkernel kernel Core GPU P-NIC CPU CPU mem NIC mem NIC Shared Main Memory Disk Disk Server Server network across servers Disk NIC Monolithic Server Monolithic/Micro-kernel Multikernel (e.g., Linux, L4) (e.g., Barrelfish, Helios, fos) 21
Existing Kernels Don’t Fit Access remote resources Network Distributed resource mgmt Fine-grained failure handling 22
When hardware is disaggregated The OS should be also 23
OS Virtual File & Process Memory Storage Mgmt System System Network 24
Network File & Process Storage Mgmt System Network Network Virtual Memory File & System Storage Network System Network 25
The Splitkernel Architecture • Split OS functions into monitors • Run each monitor at h/w device • Network messaging across non-coherent components GPU XPU Process Minitor Manager Monitor • Distributed resource mgmt and New h/w Processor Processor failure handling (XPU) (CPU) (GPU) network messaging across non-coherent components Memory NVM HDD SSD Monitor Monitor Monitor Monitor Memory NVM Hard Disk SSD 26
LegoOS The First Disaggregated OS M e m o r y Processor Storage NVM 27
How Should LegoOS Appear to Users? As a set of hardware devices? As a giant machine? • Our answer: as a set of virtual Nodes ( vNodes ) Similar semantics to virtual machines - Unique vID, vIP , storage mount point - Can run on multiple processor, memory, and storage components - 28
Abstraction - vNode Process GPU XPU Monitor Minitor Manager vNode1 Processor Processor New h/w (CPU) (GPU) (XPU) network messaging across non-coherent components vNode2 Memory NVM HDD SSD Monitor Monitor Monitor Monitor Memory NVM Hard Disk SSD One vNode can run multiple hardware components One hardware component can run multiple vNodes 29
Abstraction • Appear as vNodes to users • Linux ABI compatible • Support unmodified Linux system call interface (common ones) • A level of indirection to translate Linux interface to LegoOS interface 30
LegoOS Design 1. Clean separation of OS and hardware functionalities 2. Build monitor with hardware constraints 3. RDMA-based message passing for both kernel and applications 4. Two-level distributed resource management 5. Memory failure tolerance through replication 31
Separate Processor and Memory Processor CPU $ CPU $ Last-Level TLB MMU DRAM PT 32
Separate Processor and Memory Separate and move Processor CPU $ CPU $ hardware units Last-Level to memory Network component Memory TLB MMU DRAM PT Memory 33
Separate Processor and Memory Virtual Memory Separate and move Processor CPU $ CPU $ hardware units Last-Level to memory Network component Memory TLB MMU DRAM PT Memory 34
Separate Processor and Memory Separate and move Processor virtual memory CPU $ CPU $ Last-Level system Network to memory Memory component TLB MMU Virtual Memory DRAM PT Memory 35
Separate Processor and Memory Virtual Virtual Address Address Processor Processor components only Virtual CPU $ CPU $ Address see virtual memory addresses All levels of cache are Last-Level virtual cache Network Virtual Address Memory Memory components manage TLB MMU Virtual Memory virtual and physical memory DRAM PT Memory 36
Challenge: network is 2x-4x slower than memory bus 37
Add Extended Cache at Processor Processor CPU $ CPU $ Last-Level Network Memory TLB MMU Virtual Memory DRAM PT Memory 38
Add Extended Cache at Processor Processor • Add small DRAM/HBM at processor CPU $ CPU $ Last-Level • Use it as Extended Cache, or ExCache DRAM • Software and hardware co- Network managed Memory • Inclusive TLB MMU Virtual Memory • Virtual cache DRAM PT Memory 39
LegoOS Design 1. Clean separation of OS and hardware functionalities 2. Build monitor with hardware constraints 3. RDMA-based message passing for both kernel and applications 4. Two-level distributed resource management 5. Memory failure tolerance through replication 40
Distributed Resource Management Global Process Manager ( GPM ) Process GPU Global Monitor Minitor Resource Mgmt Global Processor Processor Memory Manager ( GMM ) (CPU) (GPU) Global network messaging across non-coherent components Storage Manager ( GSM ) Memory Memory NVM HDD SSD 1. Coarse-grain allocation Monitor Monitor Monitor Monitor Monitor Memory Memory NVM Hard Disk SSD 2. Load-balancing 3. Failure handling 41
Implementation and Emulation Process Monitor • Processor • Reserve DRAM as ExCache (4KB page as cache line) CPU CPU CPU CPU Processor • h/w only on hit path, s/w managed miss path LLC Disk ExCache • Indirection layer to store states for 113 Linux syscalls • Memory RDMA Network • Limit number of cores, kernel-space only Memory Monitor Linux Kernel Module • Storage/Global Resource Monitors CPU CPU CPU CPU CPU • Implemented as kernel module on Linux LLC Disk LLC Disk • Network DRAM DRAM Memory Storage • RDMA RPC stack based on LITE [ SOSP’17 ] 42
Performance Evaluation • Unmodified TensorFlow, running CIFAR-10 7 Linux − swap − SSD Linux − swap − ramdisk • Working set: 0.9G Slowdown InfiniSwap 5 LegoOS • 4 threads 3 • Systems in comparison 1 128 256 512 • Baseline: Linux with unlimited memory ExCache/Memory Size (MB) LegoOS Config: 1P , 1M, 1S • Swap to SSD, and ramdisk Only 1.3x to 1.7x slowdown when • InfiniSwap [ NSDI’17 ] disaggregating devices with LegoOS To gain better resource packing, 43 elasticity, and fault tolerance!
LegoOS Summary • Resource disaggregation calls for new system • LegoOS : a new OS designed and built from scratch for datacenter resource disaggregation • Split OS into distributed micro-OS services, running at device • Many challenges and many potentials 44
Disaggregated Datacenter flexible, heterogeneous, elastic, perf/$, resilient, scalable, easy-to-use Physically Disaggregated Resources Disaggregated Operating System (OSDI’18) New Processor and Memory Architecture Networking for Disaggregated Resources Networking for Disaggregated Resources Kernel-Level RDMA Virtualization (SOSP’17) Kernel-Level RDMA Virtualization (SOSP’17) RDMA Network RDMA Network
Network Requirements for Resource Disaggregation • Low latency RDMA • High bandwidth • Scale • Reliable 46
Recommend
More recommend