improving spark performance with zero copy buffer
play

Improving Spark Performance with Zero-copy Buffer Management and - PowerPoint PPT Presentation

Improving Spark Performance with Zero-copy Buffer Management and RDMA Hu Li, Charley Chen and Wei Xu Institute for Interdisciplinary Information Sciences Tsinghua University, China Latency matters in big data Impala Query Dremel Query [2012]


  1. Improving Spark Performance with Zero-copy Buffer Management and RDMA Hu Li, Charley Chen and Wei Xu Institute for Interdisciplinary Information Sciences Tsinghua University, China

  2. Latency matters in big data Impala Query Dremel Query [2012] [2010] Hive Query In-memory Spark Query [2009] [2010] Spark Streaming MapReduce Batch Job [2013] [2004] 10 min 100 ms 1 ms 10 sec Job Latencies Big Data: Not only capable , but also interactively [Kay@SOSP13]

  3. Overview of our work • NetSpark: A reliable Spark package that takes advantage of the RDMA over Converged Ethernet (RoCE) fabric • A combination of memory management optimizations for JVM-based applications to take advantage of RDMA more efficiently • Improving latency-sensitive task performance, while staying fully compatible with the off-the-shelf Spark

  4. Background: 
 Remote Direct Memory Access (RDMA) Lower CPU utilization and lower latency

  5. An over view of NetSpark transfer model Machine B Machine B Machine B Machine A Machine A Machine A Executor Executor Executor Executor Executor Executor JVM heap JVM heap JVM heap JVM heap JVM heap JVM heap Object Object Object Object deserialization serialization serialization serialization Byte Byte Byte JVM o ff -heap JVM o ff -heap JVM o ff -heap Array Array Array JVM o ff -heap JVM o ff -heap JVM o ff -heap Byte Byte User Space User Space User Space Array Array DMA Read DMA Read DMA Write DMA Write RNIC RNIC RNIC Network transfer Network transfer RNIC RNIC RNIC

  6. Zero-copy network transfer Traditional Way Our Way Object Object Serialize JVM Heap JVM Heap Byte Array Network API Serialize (Copy) JVM Off-heap JVM Off-heap Byte Array Byte Array System call DMA READ (Copy) Kernel Space RNIC Byte Array

  7. Implementation: SPARK executors Executor(Spark) Executor(NetSpark) … … Thread Thread Thread Thread Thread Thread 1 2 N 1 2 N SendingConnections SendingConnections BlockManager BlockTransferService( TCP ) BlockManager BlockTransferService( RDMA ) ReceivingConnections ReceivingConnections BufferManager

  8. RDMA buffer management • RDMA require a fixed physical memory address • for Java: off-heap • Significant allocate/de-allocate cost • Need to register to RDMA • High overhead Simple solution: Pre-allocate RDMA buffer space to avoid allocation / register overhead

  9. RDMA Buffer Management (cont’d) • A small number of large-enough fixed-size off-heap buffers • Like the Linux kernel buffer, but @ user space • But … need to copy from heap to off-heap

  10. Serializing directly into the off-heap RDMA buffer • Rewrite Java InputStream and OutputStream to take advantage of the new buffer manager • Details in the paper •

  11. Evaluation: Testbed 1. 3 switches, 34 servers Switch 2. RoCE, 10GE 3 X 40Gb Ethernet Switch 3. Using priority flow control 10Gb Ethernet Sever … … … for RDMA to avoid packets loss Network topology of our testbed

  12. 
 
 Evaluation: Experiment Setup Compared four different executor implementation 1. Java NIO max 2. Netty 75 50 3. Naive RDMA 25 min 4. NetSpark 
 latency (Spark version: 1.5.0)

  13. Group-by performance on small dataset • Spark example • 2.5GB data shuffled About 17% improvement over the naive RDMA

  14. Why do we have an improvement? • CPU block time • Measurements from SPARK log • Following Kay@NSDI15

  15. Group by on larger data - entire reduce stage A larger dataset about 107.3GB 
 for shuffle ~40% faster over Netty 


  16. PageRank on a large graph Twitter Graph Dataset 
 [Kwak@www2010] 41million nodes 1.5 billion edges 20% faster than Netty 10% faster than naive RDMA

  17. Conclusion • NetSpark: A reliable Spark package that takes advantage of the RDMA over Converged Ethernet (RoCE) fabric • A combination of memory management optimizations for JVM-based applications to take advantage of RDMA more efficiently • Improving latency-sensitive task performance, while staying fully compatible with the off-the-shelf Spark Wei Xu weixu@tsinghua.edu.cn

Recommend


More recommend