2
play

2 Berkeley Socket Userspace Kernel Hardware Time 1983 2 - PowerPoint PPT Presentation

LITE Kernel RDMA Support for Datacenter Applications Shin-Yeh Tsai , Yiying Zhang Time 2 Berkeley Socket Userspace Kernel Hardware Time 1983 2 Berkeley TCP Arrakis & Socket IX O ffl oad engine mTCP Userspace Kernel Hardware


  1. All problems in computer science can be solved by another level of indirection Butler Lampson David Wheeler except for the problem of too many layers of indirection 
 – David Wheeler 21

  2. Main Challenge: How to preserve the performance benefit of RDMA? 22

  3. Design Principles 1.Indirection only at local for one-sided RDMA CPU User CPU User Kernel Kernel Memory Memory Berkeley RDMA Socket Userspace Kernel Hardware 23

  4. Design Principles 1.Indirection only at local for one-sided RDMA CPU User CPU User CPU User Kernel Kernel Kernel Memory Memory Memory Berkeley RDMA LITE Socket Userspace Kernel Hardware 23

  5. Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection Kernel Space LITE Hardware Address Address Permission Permission RNIC check check mapping mapping 24

  6. Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection Kernel Address Permission Space LITE check mapping Hardware Address Permission RNIC check mapping 24

  7. Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection Kernel Address Permission Space LITE check mapping Hardware RNIC No redundant indirection 
 Scalable performance 24

  8. Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection 3.Hide kernel cost 25

  9. Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection 3.Hide kernel cost except for the problem of too many layers of indirection – David Wheeler 25

  10. Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection 3.Hide kernel cost except for the problem of too many layers of indirection – David Wheeler Great Performance and Scalability 25

  11. Outline • Introduction and motivation • Overall design and abstraction • LITE internals • LITE applications • Conclusion 26

  12. LITE - Architecture User-Level User-Level User-Level RPC Mgmt App App Function Kernel App OS LITE Abstraction Verbs Abstraction RNIC Driver RNIC global lkey global rkey 27

  13. LITE - Architecture User-Level User-Level User-Level RPC Mgmt App App Function Kernel App OS LITE Abstraction LITE 1-Side lh1 lh2 RDMA Permission check Address mapping global lkey addr1 addr2 global rkey Verbs Abstraction RNIC Driver RNIC global lkey global rkey 27

  14. LITE - Architecture User-Level User-Level User-Level RPC Mgmt App App Function Kernel App OS LITE Abstraction LITE LITE 1-Side RPC RPC lh1 lh2 Client Server RPC RDMA RDMA Connections Permission check Bu ff er Queues Address mapping Mgmt global lkey send poll recv addr1 addr2 global rkey Verbs Abstraction RNIC Driver RNIC global lkey global rkey 27

  15. LITE - Architecture User-Level User-Level User-Level RPC Mgmt App App Function Kernel App OS LITE Abstraction LITE APIs mgmt mem synch msging RPC LITE LITE 1-Side RPC RPC lh1 lh2 Client Server RPC RDMA RDMA Connections Permission check Bu ff er Queues Address mapping Mgmt global lkey send poll recv addr1 addr2 global rkey Verbs Abstraction RNIC Driver RNIC global lkey global rkey 27

  16. LITE - Architecture User-Level User-Level User-Level RPC Mgmt App App Function Kernel App OS LITE LITE 1-Side Abstraction lh1 lh2 LITE APIs mgmt mem synch msging RPC RDMA LITE RPC RPC Client Server Permission check RPC RDMA Address mapping Connections Bu ff er Queues global lkey Mgmt send poll recv addr1 addr2 Verbs global rkey Abstraction RNIC Driver RNIC global lkey global rkey 27

  17. Onload Costly Operations Memory Connections Queues Keys space LITE OS Permission check RNIC Address mapping 28

  18. Onload Costly Operations Memory Connections Queues Keys space LITE Permission check Address mapping OS RNIC Perform address mapping and protection in kernel 28

  19. Avoid Hardware Indirection Memory Connections Queues Keys space LITE Permission check Address mapping OS Cached lkey 1 rkey 1 RNIC … … PTEs lkey n rkey n Challenge: How to eliminate hardware indirection without changing hardware ? 29

  20. Avoid Hardware Indirection Memory Connections Queues Keys space LITE Permission check Address mapping OS Cached lkey 1 rkey 1 RNIC … … PTEs lkey n rkey n Challenge: How to eliminate hardware indirection without changing hardware ? • Register with physical address → no need for any PTEs 29

  21. Avoid Hardware Indirection Memory Connections Queues Keys space LITE Permission check Address mapping OS lkey 1 rkey 1 RNIC … … lkey n rkey n Challenge: How to eliminate hardware indirection without changing hardware ? • Register with physical address → no need for any PTEs 29

  22. Avoid Hardware Indirection Memory Connections Queues Keys space LITE Permission check Address mapping OS lkey 1 rkey 1 RNIC … … lkey n rkey n Challenge: How to eliminate hardware indirection without changing hardware ? • Register with physical address → no need for any PTEs • Register whole memory at once → one global key 29

  23. Avoid Hardware Indirection Memory Connections Queues Keys space LITE Permission check Global lkey Global rkey Address mapping OS Global lkey Global rkey RNIC Challenge: How to eliminate hardware indirection without changing hardware ? • Register with physical address → no need for any PTEs • Register whole memory at once → one global key 29

  24. LITE LMR and RDMA Network Remote 
 Userspace 
 LITE in Kernel nodes application 30

  25. LITE LMR and RDMA LMR Network Remote 
 Userspace 
 LITE in Kernel nodes application 30

  26. LITE LMR and RDMA Node Phy Addr LMR 1 0x45 4 0x27 Network Remote 
 Userspace 
 LITE in Kernel nodes application 30

  27. LITE LMR and RDMA Node Phy Addr LMR Node 1 1 0x45 0x45 Node 4 4 0x27 0x27 Network Remote 
 Userspace 
 LITE in Kernel nodes application 30

  28. LITE LMR and RDMA Node Phy Addr lh LMR Node 1 1 0x45 0x45 Node 4 4 0x27 0x27 Network Remote 
 Userspace 
 LITE in Kernel nodes application 30

  29. LITE LMR and RDMA Node Phy Addr lh LMR LITE_read(lh, offset, size) Node 1 1 0x45 0x45 Node 4 4 0x27 0x27 Network Remote 
 Userspace 
 LITE in Kernel nodes application 30

  30. LITE LMR and RDMA Node Phy Addr lh LMR LITE_read(lh, offset, size) Node 1 1 0x45 0x45 Permission QoS check Node 4 4 0x27 0x27 Network Remote 
 Userspace 
 LITE in Kernel nodes application 30

  31. LITE LMR and RDMA Node Phy Addr lh LMR LITE_read(lh, offset, size) Node 1 1 0x45 0x45 Permission QoS check Node 4 4 0x27 O ff set 0x27 Network Remote 
 Userspace 
 LITE in Kernel nodes application 30

  32. LITE LMR and RDMA Node Phy Addr lh LMR LITE_read(lh, offset, size) Node 1 1 0x45 0x45 Permission QoS check Node 4 4 0x27 0x27 Network Remote 
 Userspace 
 LITE in Kernel nodes application 30

  33. LITE LMR and RDMA Node Phy Addr lh LMR LITE_read(lh, offset, size) Node 1 1 0x45 0x45 Permission QoS check Node 4 4 0x27 0x27 Network Remote 
 Userspace 
 LITE in Kernel nodes application 30

  34. LITE RDMA:Size of MR Scalability Write-64B 6 LITE_write-64B Write-1K 4.5 LITE_write-1K Requests /us 3 1.5 0 1 4 16 64 256 1024 Total Size (MB) 31

  35. LITE RDMA:Size of MR Scalability Write-64B 6 LITE_write-64B Write-1K 4.5 LITE_write-1K Requests /us 3 1.5 0 1 4 16 64 256 1024 Total Size (MB) 31

  36. LITE RDMA:Size of MR Scalability Write-64B 6 LITE_write-64B Write-1K 4.5 LITE_write-1K Requests /us 3 1.5 0 LITE scales much better than native 1 4 16 64 256 1024 RDMA wrt MR size and numbers Total Size (MB) 31

  37. LITE-RDMA Latency 60 user space 45 kernel space Latency (us) 30 15 0 8 512 2048 8K 32K Request Size (B) 32

  38. LITE-RDMA Latency 60 user space 45 kernel space Latency (us) 30 15 0 8 512 2048 8K 32K Request Size (B) 32

  39. LITE-RDMA Latency 60 user space 45 kernel space Latency (us) 30 15 0 8 512 2048 8K 32K Request Size (B) 32

  40. LITE-RDMA Latency 60 user space 45 kernel space Latency (us) 30 15 0 8 512 2048 8K 32K Request Size (B) 32

  41. LITE-RDMA Latency 60 user space 45 kernel space Latency (us) 30 LITE only adds a very slight overhead 15 even when native RDMA doesn’t have 0 scalability issues 8 512 2048 8K 32K Request Size (B) 32

  42. LITE RPC • RPC communication using two RDMA-write-imm • One global busy poll thread • Separate LMRs at server for different RPC clients • Hide syscall cost behind performance critical path • Benefits – Low latency – Low memory utilization – Low CPU utilization 33

  43. Outline • Introduction and motivation • Overall design and abstraction • LITE internals • LITE applications • Conclusion 34

  44. LITE Application Effort Application LOC LOC using LITE Student Days LITE-Log 330 36 1 LITE-MapReduce 600* 49 4 LITE-Graph 1400 20 7 LITE-Kernel-DSM 3000 45 26 LITE-Graph-DSM 1300 0 5 • Simple to use • Needs no expert knowledge • Flexible, powerful abstraction • Easy to achieve optimized performance 35 * LITE-MapReduce ports from the 3000-LOC Phoenix with 600 lines of change or addition

  45. MapReduce Results • LITE-MapReduce adapted from Phoenix [1] 25 Hadoop Phoenix 23 Runtime (sec) LITE 21 8 6 4 2 0 Phoenix 2-node 4-node 8-node [1]: “Ranger etal., Evaluating MapReduce for Multi-core and Multiprocessor Systems. (HPCA 07)” 36

  46. MapReduce Results • LITE-MapReduce adapted from Phoenix [1] 25 Hadoop Phoenix 23 Runtime (sec) LITE 21 8 6 4 LITE-MapReduce outperforms Hadoop 
 2 by 4.3x to 5.3x 0 Phoenix 2-node 4-node 8-node [1]: “Ranger etal., Evaluating MapReduce for Multi-core and Multiprocessor Systems. (HPCA 07)” 36

  47. Graph Results • LITE-Graph built directly on LITE using PowerGraph design • Grappa and PowerGraph 10 LITE-Graph 8 Runtime (sec) Grappa PowerGraph 6 4 2 0 4 nodes x 4 threads 7 nodes x 4 threads 4 nodes x 4threads 7x4 37

Recommend


More recommend