dcs ctrl
play

DCS-ctrl: A Fast and Flexible Device-Control Mechanism for - PowerPoint PPT Presentation

DCS-ctrl: A Fast and Flexible Device-Control Mechanism for Device-Centric Server Architecture Dongup Kwon 1 , Jaehyung Ahn 2 , Dongju Chae 2 , Mohammadamin Ajdari 2 , Jaewon Lee 1 , Suheon Bae 1 , Youngsok Kim 1 , and Jangwoo Kim 1 1 Dept. of


  1. DCS-ctrl: A Fast and Flexible Device-Control Mechanism for Device-Centric Server Architecture Dongup Kwon 1 , Jaehyung Ahn 2 , Dongju Chae 2 , Mohammadamin Ajdari 2 , Jaewon Lee 1 , Suheon Bae 1 , Youngsok Kim 1 , and Jangwoo Kim 1 1 Dept. of Electrical and Computer Engineering, Seoul National University 2 Dept. of Computer Science and Engineering, POSTECH

  2. Conventional Server Architecture • Primarily rely on “CPU and memory” − CPU-centric computing & in-memory storage − Slow and low-bandwidth peripheral devices CPU Storage Network Compute Host- & CPU-centric 2 /28

  3. Conventional Server Architecture • Primarily rely on “CPU and memory” − CPU-centric computing & in-memory storage − Slow and low-bandwidth peripheral devices CPU Storage Network Compute Host- & CPU-centric 2 /28

  4. Device-centric Server Architecture • Exploit “fast & high-bandwidth devices” − Data processing accelerators (e.g., GPU, FPGA) − Storage (e.g., SSD), network (e.g., 100GbE), PCIe Gen3 Network Storage … … NIC NVM NVM NIC CPU Storage PCIe Accelerator … … GPU GPU FPGA FPGA CPU Network Compute Host- & CPU-centric Device-centric 3 /28

  5. Index • Existing approaches • DCS-ctrl: HW-based device-control mechanism • Experimental results • Conclusion 4 /28

  6. Existing Approaches • Software optimization − Memory mgmt. optimization, user-level device interface − Do not address multi-device tasks • P2P communication − Transfer data directly through PCI Express è D2D comm. • Device integration − Integrate heterogeneous devices è D2D comm. 5 /28

  7. Limitations of Existing D2D Comm. • P2P communication − Direct data transfers through PCI Express è D2D comm. − Slow and high-overhead control path Dev Control Data copy Kernel Others Control Kernel A 100% SW Latency (us) 120 CPU util. (%) Dev CPU 75% 90 B 50% 60 Dev 25% 30 C 0 0% SW SW P2P Data path P2P opt opt Control path 6 /28

  8. Limitations of Existing D2D Comm. • Integrated devices − Integrating heterogeneous devices è D2D comm. − Fast data & control transfers − Fixed and inflexible aggregate implementation Dev A Controllers Dev CPU B New Dev Dev C $$$ 7 /28

  9. Limited Performance Potential while ( true ) { rc_recv = recv (fd_sock, buffer, recv_size, 0); CPU if (rc_recv <= 0) break ; processing (&md_ctx, buffer, recv_size); rc_write = write (fd_file, buffer, recv_size); Dev Dev … A B } • “Intermediate” processing between device ops − Prevent applications from using direct D2D comm. − Cause host-side resource contention (CPU and memory) 8 /28

  10. Design Goals • Performance & scalability − Faster inter-device data & control communication − More scalable with CPU-efficient device operations • Flexibility − Support any types of off-the-shelf devices • Applicability − Increase the opportunity of applying D2D comm. 9 /28

  11. Index • Existing approaches • DCS-ctrl: HW-based device-control mechanism − Key ideas and benefits − Architecture • Experimental results • Conclusion 10 /28

  12. DCS-ctrl: Key Ideas & Benefits • DCS-ctrl: PCIe P2P + “HDC” − Hardware-based device-control (HDC) mechanism − HDC Engine : “FPGA-based” device orchestrator + “near-device” processing unit § Performance & scalability è HDC, device orchestrator § Flexibility è FPGA-based, low-cost device controller § Applicability è near-device processing unit 11 /28

  13. HDC Engine: Overview SW-controlled P2P DCS-ctrl (HW) Application Application HDC Engine (FPGA) Dev A Dev B Dev C Dev A Dev B Dev C NDP Device Device Device Device Device Device driver A driver B driver C ctrl A ctrl B ctrl C Dev A Dev B Dev C Dev C Dev A Dev B 12 /28

  14. DCS-ctrl: Key Ideas & Benefits Dev void ssd_to_nic() Dev CPU CPU { A A get_from_ssd(&data); process_in_HDC(&data); write_to_nic(&data); Dev Dev } B B HDC HDC HDC CPU Dev Dev C C Device Dev Dev HDC New Data path controller A B Dev Control path Optimized dev. control Generic dev. interfaces Near-device processing ⇒ Faster & scalable ⇒ Higher flexibility ⇒ Higher applicability communication 13 /28

  15. Key Idea #1: Device Orchestrator • Perform multi-device tasks w/o CPU involvement − Offload a multi-device task to HDC Engine − Manage all device operations and their dependencies Scoreboard Dev A Multi-device Dev R/W Src Dst Aux State task A Read Addr(DevA) Addr(NDP-A) - Done NDP NDP - - Addr(NDP-A) Addr(NDP-B) Hash Issue B Write Addr(NDP-B) Addr(DevB) - Ready Dev B Fast hardware-level device control 14 /28

  16. Key Idea #2: Device Controller • Provide interfaces between HDC Engine & devices − Include submission & completion queues − Build standard & vendor-specific device commands Submission PCIe switch queue controller Device Device Doorbell registers Completion queue Flexible & low-cost device control 15 /28

  17. Key Idea #3: Near-device Processing • Near-device processing units − Execute intermediate processing between device ops − Scale-out storage app è hash, encryption, compression Processing units LUTs Registers Applications MD5 3.0% 0.69% Swift AES256 3.52% 0.99% HDFS, Swift GZIP 5.36% 2.09% HDFS Easy to be extended & Highly applicable to existing applications support other devices & applications 16 /28

  18. Index • Existing approaches • DCS-ctrl: HW-based device-control mechanism - Key idea and benefits − Architecture • Experimental results • Conclusion 17 /28

  19. Baseline Architecture • Software-controlled P2P − P2P comm. + indirect device-control path SW HW PCIe Device driver A Dev A Dev switch A Application Device driver A Dev B Dev B Device driver A Dev C Dev C 18 /28

  20. DCS-ctrl: HW-based Device Control (1/3) • Offload device-control path to HDC Engine − Scoreboard: schedule device operations in a multi-dev task SW HW PCIe FPGA-based HDC Engine Dev switch A Scoreboard Dev r/w Src Dst Application A – B - C A Dev B B C Dev C 19 /28

  21. DCS-ctrl: Low-cost Integration (2/3) • Implement an FPGA-based device controller − Device controller: directly control devices using P2P SW HW PCIe FPGA-based HDC Engine Dev switch A Scoreboard Device New controller Dev r/w Src Dst Dev Application A – B - C A Dev B B C Dev C 20 /28

  22. DCS-ctrl: Near-device Processing (3/3) • Provide units for intermediate processing − NDP unit: perform data processing on a data path SW HW PCIe FPGA-based HDC Engine Dev switch A Scoreboard Device New controller Dev r/w Src Dst Dev Application A – B - C A Dev B B C Near-device Intermediate Dev processing buffers C 21 /28

  23. DCS-ctrl Prototype HDC Engine implemented on Xilinx Virtex-7 VC707 Supports off-the-shelf devices – Intel 750 SSDs, Broadcom 10GbE NICs, NVIDIA GPUs 22 /28

  24. Index • Existing approaches • DCS-ctrl: HW-based device-control mechanism • Experimental results • Conclusion 23 /28

  25. Reducing Device Control Latency • encrypted_sendfile() : SSD à hash à NIC − SW opt (+P2P): frequent boundary crossings, complex software − DCS-ctrl: less crossings, hardware-based device control HW Kernel Dev ctrl HW Kernel Data Copy Dev ctrl 100 300 Latency (us) Latency (us) 42% 200 SW SW 72% 50 SW 100 0 0 SW opt DCS-ctrl SW opt SW opt DCS-ctrl + P2P without processing with processing (AES256) 24 /28

  26. Reducing CPU Utilization • Swift & HDFS workloads − Offload device control & data transfers to hardware Kernel (Sender) Kernel (Receiver) Kernel (GET) Kernel (PUT) GPU control others GPU control Others 100% CPU utilization CPU utilization 100% Normalized Normalized 75% 50% 52% 49% 75% 50% 50% 25% 25% 0% 0% Send Recv Send Recv Send Recv SW opt SW opt DCS-ctrl +P2P SW opt SW opt DCS-ctrl +P2P Swift HDFS 25 /28

  27. Scalability: More Devices • Swift & HDFS workloads − More CPU-efficient è support more high-performance devices SW opt SW opt DCS-ctrl SW opt SW opt DCS-ctrl + P2P + P2P 6 6 CPU utilization CPU utilization (# cores) (# cores) 4 4 2 2 0 0 0 10 20 30 40 0 10 20 30 40 Throughput (Gbps) Throughput (Gbps) Swift HDFS 26 /28

  28. Conclusion • Fast & flexible device-control mechanism − Hardware-based device-control (HDC) mechanism − FPGA-based standard device controllers − Near-device data processing (NDP) units • Real hardware prototype evaluation − 72% faster inter-device communication − 50% lower CPU utilization for Swift & HDFS 27 /28

  29. Thank you! We will release our IP & tools soon! 28 /28

Recommend


More recommend