in spark using gpu
play

in Spark Using GPU Minsik Cho, Rajesh Bordawekar IBM TJW Research - PowerPoint PPT Presentation

Accelerating Cross-Validation in Spark Using GPU Minsik Cho, Rajesh Bordawekar IBM TJW Research 1 Cross-Validation 101 [Wikipedia] Popular Model Validation Technique to avoid overfitting, for better generalization useful when not


  1. Accelerating Cross-Validation in Spark Using GPU Minsik Cho, Rajesh Bordawekar IBM TJW Research 1

  2. Cross-Validation 101 [Wikipedia]  Popular Model Validation Technique – to avoid overfitting, for better generalization – useful when not enough dataset 2

  3. Cross-Validation + Elastic Net Regression Tons of problems to crunch [Wikipedia]  Cross Validation is popularly used with – Linear/Logistic Regression – Elastic Net Regularization  A large number of problems to solve – #fold from cross-validation – various lambdas to find the best prediction model – 4 fold x 1000 lambdas = 4000 regressions to fit 3

  4. Apache Spark Overview  In-memory engine for large-scale distributed data processing – Used in database, streaming, machine/deep learning, graph processing – Support high-level APIs in Java, Scala, Python and R  RDD: resilient distributed datasets – Partitioned collection of records – Spread across the cluster – Caching dataset in memory 4

  5. Spark GPU Acceleration [Rajesh, oreilly.com]  Accelerated Compute-Intensive Workload with GPUs 5

  6. Cross-Validation in Spark  For each problem – Create RDD – Distribute RDD – Call optimizer – Return Model [Berkeley] 6

  7. Cross-Validation in Spark Dataset Partitioned RDD worker i worker j worker k Dataset j Dataset k Dataset i Reduce Is this best for GPU? One Model 7

  8. Proposed Cross-Validation in Spark Using GPU  Broadcast Data – Cross-Validation reuses the same mother dataset  RDD of problem instances, not DATA – Tons of problems with different folding/lambdas  Maximize GPU stream to minimize down-time 8

  9. Cross-Validation in Spark Using GPU Problems Dataset Broadcasted as Array worker j worker i worker k Dataset Dataset Dataset 9

  10. Cross-Validation in Spark Using GPU Problems Distributed as RDD worker j worker i worker k Dataset Dataset Dataset Problems j Problems k Problems i  Problems in RDD 10

  11. Code Snippet Build a problem set 11

  12. Code Snippet (cont.) Input: dataset, problems Dataset broadcast Problem RDD 12

  13. Cross-Validation in Spark Using GPU worker i Dataset Problems i GPU0 GPU1 Dataset fold 0 Dataset fold 1 Dataset fold 2 Dataset fold 3 cudaStream cudaStream cudaStream cudaStream Problem a:0 Problem a:2 Problem a:1 Problem a:3 Problem b:0 Problem b:2 Problem b:1 Problem b:3 13 13

  14. Cross-Validation in Spark Using GPU Problems Distributed as RDD worker i worker j worker k Dataset Dataset Dataset Problems j Problems k Problems i Reduce All Models 14

  15. Cross-Validation in Spark Using GPU (Advantages)  Dataset Broadcast – Efficient p2p protocol in Spark – One-time upfront overhead – Data reused within GPUs  Problem RDD – No communication among workers – Multiple streams to maximize GPU utilization  Multi-level parallelism – Functional parallelism from Problem RDD – Multiple GPUs – Multiple cudaStreams 15

  16. Experimental Results  System – 2 node cluster – Each node with thirty two x86 cores – Each node with two K40ms  Software – Spark 2.0 – OpenJDK 1.8  Workload – Real Watson Health dataset – 5 fold cross validation – 1024 lambda exploration  Algorithms – Logistic regression – Linear regression  Measured e2e runtime including dataset broadcast 16

  17. Result: GPU utilization  Sustained over 97% Multi-GPU utilization 17

  18. Result: Logistic Regression No help : 2 problems Help : more problems 114x speedup Help : enough problems

  19. Result: Linear Regression 94x speedup

  20. Conclusion  Cross-Validation on Spark using GPU – New way of parallelization in Spark  Broadcast dataset  RDDmized problems – Reduce communication  About 100x speedup for Logistic/Linear Regression + Elastic Net  Future work – Support out of core execution 20

Recommend


More recommend