strom smart remote memory ry
play

StRoM: Smart Remote Memory ry David Sidler* , Zeke Wang , Monica - PowerPoint PPT Presentation

StRoM: Smart Remote Memory ry David Sidler* , Zeke Wang , Monica Chiosa , Amit Kulkarni , Gustavo Alonso * Microsoft Corporation Collaborative Innovation Center of Artificial Intelligence, Zhejiang University


  1. StRoM: Smart Remote Memory ry David Sidler* ‡ , Zeke Wang †‡ , Monica Chiosa ‡ , Amit Kulkarni ‡ , Gustavo Alonso ‡ * Microsoft Corporation † Collaborative Innovation Center of Artificial Intelligence, Zhejiang University ‡ Systems Group, Department of Computer Science, ETH Zürich

  2. Increasing Compute-Bandwidth Gap 100000 10000 Compute- Bandwidth Relative Speedup 1000 Gap ▪ Increase in CPU cycles allocated 100 towards network processing ▪ Context switches between OS network 10 stack and application amplify the issue 1 0.1 1980 1990 2000 2010 2020 CPU frequency Network bandwidth

  3. RDMA (Remote Direct Memory Access) RDMA (Remote Direct Memory Access) Memory Memory Complete Hardware offload => Bypasses OS and CPU CPU NIC CPU NIC Distributed key-value stores[1,2] Parallel database systems Distributed graph computation[3] [1] C. Mitchell, et al., Using One-sided RDMA Reads to build a fast, CPU-efficient key-value store, ATC’13 [2] A. Dragojevic, et al., FaRM: Fast Remote Memory, NSDI’14 [3] M. Wu, et al., GRAM: Scaling graph computation to the trillions, SoCC’15

  4. Get over RDMA: Two-sided vs One-sided Two-sided (Send/Receive) Remote Memory ▪ Single round trip ▪ Simple client-server Hash Table Value Store 1 Send GET model NIC ▪ Client Remote CPU involved ▪ Read hash entry 2 CPU ▪ Compare keys 3 Send Value ▪ Read value No clear winner One-sided (Direct Access) Remote Memory ▪ Remote CPU not involved ▪ At least 2 RTs necessary Hash Table Value Store ▪ 1 Read Hash Table Handling of misses costly NIC Client 2 CPU 3 Read Value Compare keys

  5. StRoM: Smart Remote Memory StRoM: Deployment of Acceleration kernels on the NIC Memory StRoM kernel ▪ Direct access to host memory ▪ Able to receive/transmit data StRoM NIC CPU over RDMA kernel

  6. GET as StRoM Kernel ▪ Read hash entry Remote Memory ▪ Compare keys Hash Table ▪ Read value 2 ▪ Single round trip Value Store ▪ 1 Remote CPU not involved RDMA RPC GET NIC Client kernel CPU 3 Write Value

  7. Acceleration Capabilities Accelerating Data Access Invoke one-sided RPCs on the remote NIC Memory ▪ Traversal of remote data structures ▪ Verification of data objects StRoM ▪ Manipulation of simple data structures NIC CPU kernel Accelerating Data Processing On-the-fly data processing when transmitting/receiving Memory ▪ Data shuffling ▪ Filtering ▪ Pattern/event detection StRoM NIC CPU ▪ Aggregation kernel ▪ Compression ▪ Statistics gathering

  8. Use Case: Gathering Statistics HyperLogLog (HLL) kernel to estimate cardinality of a data set • Bump-in-the-wire kernel • Cardinality estimation can augment the optimizer in data processing systems Remote Memory 1 data statistics 1 RDMA RPC WRITE HLL NIC Node 2 kernel CPU Leading Harmonic Hash Buckets Zeros mean

  9. Evaluation – StRoM NIC • FPGA-based prototype RDMA NIC • Extended RoCEv2 implementation with support for StRoM StRoM at 10G StRoM at 100G Alpha Data ADM-PCIE-7V3 Xilinx VCU118

  10. Evaluation – GET kernel 5 μ s

  11. Evaluation – HLL kernel

  12. Conclusion StRoM: Smart Remote Memory • Deployment of acceleration kernels on the NIC • Acceleration of data access and data processing at up to 100G • Research platform Open source at github.com/fpgasystems/fpga-network-stack

Recommend


More recommend