Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research
Apache Crail (crail.apache.org)
Apache Crail (crail.apache.org)
Ephemeral Data HDFS, Input data S3 Map-reduce job Broadcast Map Shuffle Reduce HDFS, Output data S3
Ephemeral Data HDFS, Input data S3 Map-reduce job Broadcast Map Shuffle Reduce HDFS, Output data S3
Ephemeral Data HDFS, Input data S3 Broadcast Apache Crail Map Shuffle Reduce HDFS, Output data S3
Ephemeral Data HDFS, Input data S3 Broadcast Apache Crail Map Shuffle Reduce HDFS, Intermediate S3 data HDFS, S3
Ephemeral Data ML pre-processing normalized ML training (map-reduce job) images (Tensorflow job) Input data HDFS, HDFS, HDFS, S3 S3 S3 Apache Crail
Ephemeral Data ML pre-processing normalized ML training (map-reduce job) images (Tensorflow job) Input data HDFS, HDFS, HDFS, HDFS, S3 S3 S3 S3 Apache Crail
Ephemeral Data ML pre-processing normalized ML training (map-reduce job) images (Tensorflow job) Input data HDFS, HDFS, S3 S3 Apache Crail
Why/when to use Crail
Why/when to use Crail No Crail needed 100MB/s 10ms 10Gb/s 20us
Why/when to use Crail 10GB/s 10us 200Gb/s 1us No 100x Crail Crail needed land 100MB/s 10ms 10Gb/s 20us
Why/when to use Crail 10GB/s 10us 200Gb/s 1us No 100x Crail Crail needed land Throughput (Gbit/s) 100 100MB/s 10ms 88.3s Spark/Crail 80 hardware limit 10Gb/s Terasort Spark/Vanilla 60 20us 12.8 TB data 40 128 nodes 527.6s 20 0 0 100 200 300 400 500 Elapsed time (seconds)
Performance Challenge Sorting Application Sorter Serializer Data Processing Framework sockets filesystem Netty TCP/IP block layer JVM Ethernet iSCSI NIC SSD
Performance Challenge Process chunk In reduce task Sorting Application Sorter Serializer Data Processing Framework sockets filesystem Netty TCP/IP block layer JVM Ethernet iSCSI NIC SSD Fetch chunk HotNets’16 Over the network
Performance Challenge Sorting Application Sorter Serializer Data Processing Framework sockets filesystem Netty TCP/IP block layer JVM Ethernet iSCSI NIC SSD HotNets’16
Performance Challenge software overhead are spread over the entire stack Sorting Application Sorter Serializer Data Processing Framework sockets filesystem Netty TCP/IP block layer JVM Ethernet iSCSI NIC SSD HotNets’16
Crail Overview Multiple interfaces Multiple storage backends (pluggable, open interface)
Crail Overview Multiple interfaces Multiple storage backends (pluggable, open interface) primary high-performance storage backends
Crail Architecture & API MultiFile
Crail Architecture & API optimized MultiFile for shuffle data key-value semantics append-only file
Crail Architecture & API Java: MultiFile C++:
Crail Architecture & API Java: MultiFile Node type C++:
Crail Architecture & API Java: MultiFile non-blocking & asynchronous C++:
Where does the performance come from?
User-Level I/O: Metadata 1 2 1 2 Crail client library
User-Level I/O: Metadata 1 2 1 2 Crail client library No threads No context switches
User-Level I/O: Data 1 2 2 1
zero-copy, User-Level I/O: Data transfer only data that is requested Application 1 2 2 1
Crail Deployment Modes compute/storage storage flash storage co-located disaggregation disaggregation
YCSB KeyValue Workload GET GET Value size: Value size: 1KB 100KB latency [us] latency [us] Crail offers Get latencies of ~12us and 30us for DRAM and NVM for 100 byte KV pairs Crail offers Get latencies of ~30us and 40us for DRAM and NVM for 1000 byte KV pairs
Spark GroupBy (80M keys, 4K) 100 Throughput (Gbit/s) Spark/ 1 core 80 4 cores Vanilla 8 cores 60 40 20 0 0 10 20 30 40 50 60 70 80 90 100 110 120 Throughput (Gbit/s) 100 Spark/ Elapsed time (seconds) 1 core 80 4 cores Crail 8 cores 60 2x 40 2.5x 5x 20 0 0 10 20 30 40 50 60 70 80 90 100 110 120 Elapsed time (seconds) Spark shuffling via Crail on a single core is 2x faster than vanilla Spark on 8 cores per executor (8 executors)
DRAM & Flash Disaggregation Crail enables disaggregation of temporary data at no cost
DRAM/Flash Tiering 120 Runtime (seconds) Map 100 Vanilla Spark Reduce 80 (100% Memory) 60 40 20 0 100/0 100/0 80/20 60/40 40/60 20/80 0/100 Memory to Flash Ratio Using flash only increases the sorting time by around 48%
Conclusions ● Apache Crail: Fast distributed “tmp” put your #assignedhashtag here by setting the footer in view-header/footer User-level I/O – Storage disaggregation – Memory/flash convergence – ● Applications Intra-job scratch space (shuffle, broadcast, etc.) – Multi-job pipelines – ● Coming soon Native Crail (C++) – Tensorflow-Crail –
Recommend
More recommend