All problems in computer science can be solved by another level of indirection Butler Lampson David Wheeler except for the problem of too many layers of indirection – David Wheeler 21
Main Challenge: How to preserve the performance benefit of RDMA? 22
Design Principles 1.Indirection only at local for one-sided RDMA CPU User CPU User Kernel Kernel Memory Memory Berkeley RDMA Socket Userspace Kernel Hardware 23
Design Principles 1.Indirection only at local for one-sided RDMA CPU User CPU User CPU User Kernel Kernel Kernel Memory Memory Memory Berkeley RDMA LITE Socket Userspace Kernel Hardware 23
Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection Kernel Space LITE Hardware Address Address Permission Permission RNIC check check mapping mapping 24
Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection Kernel Address Permission Space LITE check mapping Hardware Address Permission RNIC check mapping 24
Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection Kernel Address Permission Space LITE check mapping Hardware RNIC No redundant indirection Scalable performance 24
Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection 3.Hide kernel cost 25
Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection 3.Hide kernel cost except for the problem of too many layers of indirection – David Wheeler 25
Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection 3.Hide kernel cost except for the problem of too many layers of indirection – David Wheeler Great Performance and Scalability 25
Outline • Introduction and motivation • Overall design and abstraction • LITE internals • LITE applications • Conclusion 26
LITE - Architecture User-Level User-Level User-Level RPC Mgmt App App Function Kernel App OS LITE Abstraction Verbs Abstraction RNIC Driver RNIC global lkey global rkey 27
LITE - Architecture User-Level User-Level User-Level RPC Mgmt App App Function Kernel App OS LITE Abstraction LITE 1-Side lh1 lh2 RDMA Permission check Address mapping global lkey addr1 addr2 global rkey Verbs Abstraction RNIC Driver RNIC global lkey global rkey 27
LITE - Architecture User-Level User-Level User-Level RPC Mgmt App App Function Kernel App OS LITE Abstraction LITE LITE 1-Side RPC RPC lh1 lh2 Client Server RPC RDMA RDMA Connections Permission check Bu ff er Queues Address mapping Mgmt global lkey send poll recv addr1 addr2 global rkey Verbs Abstraction RNIC Driver RNIC global lkey global rkey 27
LITE - Architecture User-Level User-Level User-Level RPC Mgmt App App Function Kernel App OS LITE Abstraction LITE APIs mgmt mem synch msging RPC LITE LITE 1-Side RPC RPC lh1 lh2 Client Server RPC RDMA RDMA Connections Permission check Bu ff er Queues Address mapping Mgmt global lkey send poll recv addr1 addr2 global rkey Verbs Abstraction RNIC Driver RNIC global lkey global rkey 27
LITE - Architecture User-Level User-Level User-Level RPC Mgmt App App Function Kernel App OS LITE LITE 1-Side Abstraction lh1 lh2 LITE APIs mgmt mem synch msging RPC RDMA LITE RPC RPC Client Server Permission check RPC RDMA Address mapping Connections Bu ff er Queues global lkey Mgmt send poll recv addr1 addr2 Verbs global rkey Abstraction RNIC Driver RNIC global lkey global rkey 27
Onload Costly Operations Memory Connections Queues Keys space LITE OS Permission check RNIC Address mapping 28
Onload Costly Operations Memory Connections Queues Keys space LITE Permission check Address mapping OS RNIC Perform address mapping and protection in kernel 28
Avoid Hardware Indirection Memory Connections Queues Keys space LITE Permission check Address mapping OS Cached lkey 1 rkey 1 RNIC … … PTEs lkey n rkey n Challenge: How to eliminate hardware indirection without changing hardware ? 29
Avoid Hardware Indirection Memory Connections Queues Keys space LITE Permission check Address mapping OS Cached lkey 1 rkey 1 RNIC … … PTEs lkey n rkey n Challenge: How to eliminate hardware indirection without changing hardware ? • Register with physical address → no need for any PTEs 29
Avoid Hardware Indirection Memory Connections Queues Keys space LITE Permission check Address mapping OS lkey 1 rkey 1 RNIC … … lkey n rkey n Challenge: How to eliminate hardware indirection without changing hardware ? • Register with physical address → no need for any PTEs 29
Avoid Hardware Indirection Memory Connections Queues Keys space LITE Permission check Address mapping OS lkey 1 rkey 1 RNIC … … lkey n rkey n Challenge: How to eliminate hardware indirection without changing hardware ? • Register with physical address → no need for any PTEs • Register whole memory at once → one global key 29
Avoid Hardware Indirection Memory Connections Queues Keys space LITE Permission check Global lkey Global rkey Address mapping OS Global lkey Global rkey RNIC Challenge: How to eliminate hardware indirection without changing hardware ? • Register with physical address → no need for any PTEs • Register whole memory at once → one global key 29
LITE LMR and RDMA Network Remote Userspace LITE in Kernel nodes application 30
LITE LMR and RDMA LMR Network Remote Userspace LITE in Kernel nodes application 30
LITE LMR and RDMA Node Phy Addr LMR 1 0x45 4 0x27 Network Remote Userspace LITE in Kernel nodes application 30
LITE LMR and RDMA Node Phy Addr LMR Node 1 1 0x45 0x45 Node 4 4 0x27 0x27 Network Remote Userspace LITE in Kernel nodes application 30
LITE LMR and RDMA Node Phy Addr lh LMR Node 1 1 0x45 0x45 Node 4 4 0x27 0x27 Network Remote Userspace LITE in Kernel nodes application 30
LITE LMR and RDMA Node Phy Addr lh LMR LITE_read(lh, offset, size) Node 1 1 0x45 0x45 Node 4 4 0x27 0x27 Network Remote Userspace LITE in Kernel nodes application 30
LITE LMR and RDMA Node Phy Addr lh LMR LITE_read(lh, offset, size) Node 1 1 0x45 0x45 Permission QoS check Node 4 4 0x27 0x27 Network Remote Userspace LITE in Kernel nodes application 30
LITE LMR and RDMA Node Phy Addr lh LMR LITE_read(lh, offset, size) Node 1 1 0x45 0x45 Permission QoS check Node 4 4 0x27 O ff set 0x27 Network Remote Userspace LITE in Kernel nodes application 30
LITE LMR and RDMA Node Phy Addr lh LMR LITE_read(lh, offset, size) Node 1 1 0x45 0x45 Permission QoS check Node 4 4 0x27 0x27 Network Remote Userspace LITE in Kernel nodes application 30
LITE LMR and RDMA Node Phy Addr lh LMR LITE_read(lh, offset, size) Node 1 1 0x45 0x45 Permission QoS check Node 4 4 0x27 0x27 Network Remote Userspace LITE in Kernel nodes application 30
LITE RDMA:Size of MR Scalability Write-64B 6 LITE_write-64B Write-1K 4.5 LITE_write-1K Requests /us 3 1.5 0 1 4 16 64 256 1024 Total Size (MB) 31
LITE RDMA:Size of MR Scalability Write-64B 6 LITE_write-64B Write-1K 4.5 LITE_write-1K Requests /us 3 1.5 0 1 4 16 64 256 1024 Total Size (MB) 31
LITE RDMA:Size of MR Scalability Write-64B 6 LITE_write-64B Write-1K 4.5 LITE_write-1K Requests /us 3 1.5 0 LITE scales much better than native 1 4 16 64 256 1024 RDMA wrt MR size and numbers Total Size (MB) 31
LITE-RDMA Latency 60 user space 45 kernel space Latency (us) 30 15 0 8 512 2048 8K 32K Request Size (B) 32
LITE-RDMA Latency 60 user space 45 kernel space Latency (us) 30 15 0 8 512 2048 8K 32K Request Size (B) 32
LITE-RDMA Latency 60 user space 45 kernel space Latency (us) 30 15 0 8 512 2048 8K 32K Request Size (B) 32
LITE-RDMA Latency 60 user space 45 kernel space Latency (us) 30 15 0 8 512 2048 8K 32K Request Size (B) 32
LITE-RDMA Latency 60 user space 45 kernel space Latency (us) 30 LITE only adds a very slight overhead 15 even when native RDMA doesn’t have 0 scalability issues 8 512 2048 8K 32K Request Size (B) 32
LITE RPC • RPC communication using two RDMA-write-imm • One global busy poll thread • Separate LMRs at server for different RPC clients • Hide syscall cost behind performance critical path • Benefits – Low latency – Low memory utilization – Low CPU utilization 33
Outline • Introduction and motivation • Overall design and abstraction • LITE internals • LITE applications • Conclusion 34
LITE Application Effort Application LOC LOC using LITE Student Days LITE-Log 330 36 1 LITE-MapReduce 600* 49 4 LITE-Graph 1400 20 7 LITE-Kernel-DSM 3000 45 26 LITE-Graph-DSM 1300 0 5 • Simple to use • Needs no expert knowledge • Flexible, powerful abstraction • Easy to achieve optimized performance 35 * LITE-MapReduce ports from the 3000-LOC Phoenix with 600 lines of change or addition
MapReduce Results • LITE-MapReduce adapted from Phoenix [1] 25 Hadoop Phoenix 23 Runtime (sec) LITE 21 8 6 4 2 0 Phoenix 2-node 4-node 8-node [1]: “Ranger etal., Evaluating MapReduce for Multi-core and Multiprocessor Systems. (HPCA 07)” 36
MapReduce Results • LITE-MapReduce adapted from Phoenix [1] 25 Hadoop Phoenix 23 Runtime (sec) LITE 21 8 6 4 LITE-MapReduce outperforms Hadoop 2 by 4.3x to 5.3x 0 Phoenix 2-node 4-node 8-node [1]: “Ranger etal., Evaluating MapReduce for Multi-core and Multiprocessor Systems. (HPCA 07)” 36
Graph Results • LITE-Graph built directly on LITE using PowerGraph design • Grappa and PowerGraph 10 LITE-Graph 8 Runtime (sec) Grappa PowerGraph 6 4 2 0 4 nodes x 4 threads 7 nodes x 4 threads 4 nodes x 4threads 7x4 37
Recommend
More recommend