Accelerating Cross-Validation in Spark Using GPU Minsik Cho, Rajesh Bordawekar IBM TJW Research 1
Cross-Validation 101 [Wikipedia] Popular Model Validation Technique – to avoid overfitting, for better generalization – useful when not enough dataset 2
Cross-Validation + Elastic Net Regression Tons of problems to crunch [Wikipedia] Cross Validation is popularly used with – Linear/Logistic Regression – Elastic Net Regularization A large number of problems to solve – #fold from cross-validation – various lambdas to find the best prediction model – 4 fold x 1000 lambdas = 4000 regressions to fit 3
Apache Spark Overview In-memory engine for large-scale distributed data processing – Used in database, streaming, machine/deep learning, graph processing – Support high-level APIs in Java, Scala, Python and R RDD: resilient distributed datasets – Partitioned collection of records – Spread across the cluster – Caching dataset in memory 4
Spark GPU Acceleration [Rajesh, oreilly.com] Accelerated Compute-Intensive Workload with GPUs 5
Cross-Validation in Spark For each problem – Create RDD – Distribute RDD – Call optimizer – Return Model [Berkeley] 6
Cross-Validation in Spark Dataset Partitioned RDD worker i worker j worker k Dataset j Dataset k Dataset i Reduce Is this best for GPU? One Model 7
Proposed Cross-Validation in Spark Using GPU Broadcast Data – Cross-Validation reuses the same mother dataset RDD of problem instances, not DATA – Tons of problems with different folding/lambdas Maximize GPU stream to minimize down-time 8
Cross-Validation in Spark Using GPU Problems Dataset Broadcasted as Array worker j worker i worker k Dataset Dataset Dataset 9
Cross-Validation in Spark Using GPU Problems Distributed as RDD worker j worker i worker k Dataset Dataset Dataset Problems j Problems k Problems i Problems in RDD 10
Code Snippet Build a problem set 11
Code Snippet (cont.) Input: dataset, problems Dataset broadcast Problem RDD 12
Cross-Validation in Spark Using GPU worker i Dataset Problems i GPU0 GPU1 Dataset fold 0 Dataset fold 1 Dataset fold 2 Dataset fold 3 cudaStream cudaStream cudaStream cudaStream Problem a:0 Problem a:2 Problem a:1 Problem a:3 Problem b:0 Problem b:2 Problem b:1 Problem b:3 13 13
Cross-Validation in Spark Using GPU Problems Distributed as RDD worker i worker j worker k Dataset Dataset Dataset Problems j Problems k Problems i Reduce All Models 14
Cross-Validation in Spark Using GPU (Advantages) Dataset Broadcast – Efficient p2p protocol in Spark – One-time upfront overhead – Data reused within GPUs Problem RDD – No communication among workers – Multiple streams to maximize GPU utilization Multi-level parallelism – Functional parallelism from Problem RDD – Multiple GPUs – Multiple cudaStreams 15
Experimental Results System – 2 node cluster – Each node with thirty two x86 cores – Each node with two K40ms Software – Spark 2.0 – OpenJDK 1.8 Workload – Real Watson Health dataset – 5 fold cross validation – 1024 lambda exploration Algorithms – Logistic regression – Linear regression Measured e2e runtime including dataset broadcast 16
Result: GPU utilization Sustained over 97% Multi-GPU utilization 17
Result: Logistic Regression No help : 2 problems Help : more problems 114x speedup Help : enough problems
Result: Linear Regression 94x speedup
Conclusion Cross-Validation on Spark using GPU – New way of parallelization in Spark Broadcast dataset RDDmized problems – Reduce communication About 100x speedup for Logistic/Linear Regression + Elastic Net Future work – Support out of core execution 20
Recommend
More recommend