i nefficient device utilization
play

I nefficient device utilization Host-centric device m anagem ent - PowerPoint PPT Presentation

DCS: A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin Ajdari, Jaewon Lee, and Jangwoo Kim { jh2ekd, nankdu7, elixir, majdari, spiegel0, jangwoo} @postech.ac.kr High Performance


  1. DCS: A Fast and Scalable Device-Centric Server Architecture Jaehyung Ahn, Dongup Kwon, Youngsok Kim, Mohammadamin Ajdari, Jaewon Lee, and Jangwoo Kim { jh2ekd, nankdu7, elixir, majdari, spiegel0, jangwoo} @postech.ac.kr High Performance Computing Lab Pohang University of Science and Technology ( POSTECH)

  2. I nefficient device utilization • Host-centric device m anagem ent − Host manages every device invocation − Frequent host-involved layer crossings  Increases latency and management cost Application Userspace Kernel stack Kernel stack Kernel stack Kernel Driver A Driver B Driver C Hardw are Device A Device B Device C Datapath Metadata/ Command path 1

  3. Latency: High softw are overhead • Single sendfile: Storage read & NI C send − Faster devices, more software overhead Softw are overhead 7% 50% 77% 82% 100% Decomposition (Normalized) Latency 0% HDD NVMe PCM PCM 10Gb NIC 10Gb NIC 10Gb NIC 100Gb NIC Software Storage NIC 2

  4. Cost: High host resource dem and • Sendfile under host resource ( CPU) contention − Faster devices, more host resource consumption Sendfile bandwidth CPU Busy Sendfile bandwidth 100% Sendfile Sendfile CPU usage bandwidth 34% 14% Sendfile CPU usage 6% No contention High contention * Measured from NVMe SSD/ 10Gb NIC 3

  5. I ndex • I nefficient device utilization • Lim itations of existing solutions • DCS: Device-Centric Server architecture • Experim ental results • Conclusion

  6. Lim itations of existing w ork • Single-device optim ization − Do not address inter-device communication e.g., Moneta (SSD), DCA (NIC), mTCP (NIC), Arrakis (Generic) • I nter-device com m unication − Not applicable for unsupported devices e.g., GPUNet (GPU-NIC), GPUDirect RDMA (GPU-Infiniband) • I ntegrating devices − Custom devices and protocols, limited applicability e.g., QuickSAN (SSD+ NIC), BlueDBM (Accelerator – SSD+ NIC) Need for fast, scalable, and generic inter-device com m unication 5

  7. I ndex • I nefficient device utilization • Lim itations of existing solutions • DCS: Device-Centric Server architecture − Key idea and benefits − Design considerations • Experim ental results • Conclusion

  8. DCS: Key idea • Minim ize host involvem ent & data m ovem ent DCS Library Application Userspace DCS Driver Device drivers & Kernel stacks Kernel Hardw are DCS Engine Device A Device B Device C Datapath Metadata/ Command path Single command → Optimized multi-device invocation 7

  9. DCS: Benefits • Better device perform ance − Faster data delivery, lower total operation latency • Better host perform ance/ efficiency − Resource/ time spent for device management now available for other applications • High applicability − Relies on existing drivers / kernel supports / interfaces − Easy to extend and cover more devices 8

  10. I ndex • I nefficient device utilization • Lim itations of existing solutions • DCS: Device-Centric Server architecture − Key idea and benefits − Design considerations  By discussing implementation details • Experim ental results • Conclusion

  11. DCS: Architecture overview Existing System DCS Library Application Userspace sendfile(), encrypted sendfile() DCS Driver Kernel Drivers & Kernel communicator Kernel stack Command generator PCIe Switch Hardw are DCS Engine (on NetFPGA NIC) NVMe SSD Command Per-device GPU Command interpreter manager Queue NetFPGA NIC Fully com patible w ith existing system 1 0

  12. ❺ ❶ ❹ ❸ ❷ Com m unicating w ith storage Userspace Hook / API call Application DCS Library File descriptor Kernel DCS Driver (Virtual) Filesystem Block addr ( in device) / buffer addr ( cached) Hardw are DCS Engine Source device Target Target device NVMe SSD Source device VFS cache Data consistency guaranteed 1 1

  13. ❺ ❶ ❹ ❸ ❷ Com m unicating w ith netw ork interface Userspace Hook / API call Application DCS Library Socket descriptor Kernel Network stack DCS Driver Connection inform ation Hardw are DCS Engine NetFPGA NIC Packet generation & Send Data buffer HW PacketGen HW -assisted packet generation 1 2

  14. ❹ ❶ ❼ ❻ ❺ ❸ ❷ Com m unicating w ith accelerator Kernel invocation Mem ory allocation Call DCS library Userspace GPU user library Application DCS Library Kernel DCS Driver GPU kernel driver Get m em ory m apping Hardw are DCS Engine GPU Process data Memory Source device ( Kernel launch) DMA / NVMe transfer Direct data loading w ithout m em cpy 1 3

  15. I ndex • I nefficient device utilization • Lim itations of existing solutions • DCS: Device-Centric Server architecture • Experim ental results • Conclusion

  16. Experim ental setup • Host: Pow er-efficient system − Core 2 Duo @ 2.00GHz, 2MB LLC − 2GB DDR2 DRAM • Device: Off-the-shelf em erging devices − Storage: Samsung XS1715 NVMe SSD − NIC: NetFPGA with Xilinx Virtex 5 (up to 1Gb bandwidth) − Accelerator: NVIDIA Tesla K20m − Device interconnect: Cyclone Microsystems PCIe2-2707 (Gen 2 switch, 5 slots, up to 80Gbps) 1 5

  17. DCS prototype im plem entation • Our 4 -node DCS prototype − Can support many devices per host NetFPGA NIC GPU NVMe SSD PCIe Switch 1 6

  18. Reducing device utilization latency • Single sendfile: Storage read & NI C send − Host-centric: Per-device layer crossings − DCS: Batch management in HW layer Latency ( µ s) SW 79 DCS 39 75 HW 75 Host-centric DCS 1 7

  19. Reducing device utilization latency • Single sendfile: Storage read & NI C send − Host-centric: Per-device layer crossings − DCS: Batch management in HW layer 2 x latency im provem ent (with low-latency devices) Latency ( µ s) SW 79 DCS 39 Latency 75 HW 75 Host-centric DCS Host-centric DCS 1 8

  20. Host-independent perform ance • Sendfile under host resource ( CPU) contention − Host-centric: host-dependent, high management cost − DCS: host-independent, low management cost CPU Busy Host-centric 1 0 0 % BW / CPU 7 0 % busy DCS Sendfile bandwidth 1 0 0 % BW / CPU 2 9 % busy 7 1 % BW / CPU 1 1 % busy 1 3 % BW / CPU 1 0 % busy No contention High contention High perform ance even on w eak hosts

  21. Multi-device invocation • Encrypted sendfile ( SSD → GPU → NIC, 512MB ) − DCS provides much efficient data movement to GPU − Current bottleneck is NIC (1Gbps) GPU data loading GPU processing Network send NVIDIA driver Host-centric 32 6 62 Network send (1Gb) 14% reduction DCS 6 6 6 68 Normalized processing time 2 0

  22. Multi-device invocation • Encrypted sendfile ( SSD → GPU → NIC, 512MB ) − DCS provides much efficient data movement to GPU − Current bottleneck is NIC (1Gbps) GPU data loading GPU processing Network send NVIDIA driver Host-centric 32 6 12 62 Network send (1Gb) Network send (10Gb) 14% reduction DCS 6 6 6 13 3 8 % reduction 68 Normalized processing time 2 1

  23. Real-w orld w orkload: Hadoop-grep • Hadoop-grep ( 1 0 GB) − Faster input delivery & smaller host resource consumption Map progress Reduce progress % 100 Host-centric Map/ Reduce progress 75 50 25 0 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 % 100 DCS 75 50 25 0 3 8 % faster processing 2 2

  24. Scalability: More devices per host • Doubling # of devices in a single host Host-centric DCS Total device throughput (Normalized) 1 .3 x 2 x SSD SSDx2 SSD SSDx2 Devices NIC NICx2 NIC NICx2 CPU Utilization 60% 100% 22% 37% Scalable m any-device support 2 3

  25. Conclusion • Device-Centric Server architecture − Manages emerging devices on behalf of host − Optimized data transfer and device control − Easily extensible modularized design • Real hardw are prototype evaluation − Device latency reduction: ~ 25% − Host resource savings: ~ 61% − Hadoop-grep speed improvement: ~ 38% 2 4

  26. Thank you! NetFPGA NIC GPU NVMe SSD PCIe Switch Device latency reduction ~25% Host resource savings ~61% Hadoop-grep speed improvement ~38% High Performance Computing Lab Pohang University of Science and Technology ( POSTECH)

Recommend


More recommend