integration of spark parallelization in tmva
play

Integration of Spark parallelization in TMVA Georgios Douzas Enric - PowerPoint PPT Presentation

Integration of Spark parallelization in TMVA Georgios Douzas Enric Tejedor, Sergei Gleyzer, Georgios Douzas Supervisors : Enric Tejedor, Sergei Gleyzer Spark engine A generalized framework for distributed data processing. Implemented in


  1. Integration of Spark parallelization in TMVA Georgios Douzas Enric Tejedor, Sergei Gleyzer, Georgios Douzas Supervisors : Enric Tejedor, Sergei Gleyzer

  2. Spark engine ¤ A generalized framework for distributed data processing. ¤ Implemented in Scala. ¤ Provides a Python API called PySpark. ¤ Two main concepts: - RDD (Resilient Distributed Datasets) - DAG (Direct Acyclic Graph)

  3. Spark engine ¤ RDD is an immutable parallel data structure. ¤ DAG is a programming model for distributed systems. ¤ RDD operations: Transformations and Actions.

  4. Spark architecture

  5. Parallelization of the TMVA code ¤ Identify opportunities for parallelism. ¤ Examine whether the parallelism improves performance. ¤ Target on loops that include independent calculations. ¤ Use the same interface as the C++ TMVA code.

  6. Parallelization in TMVA ¤ Cross validation. ¤ Optimization of tuning parameters. ¤ Local search for the optimization of tuning parameters.

  7. Cross validation

  8. Cross validation Validation K-fold cross validation

  9. Parallelized CrossValidate ¤ RDD = sc.parallelize( [fold 0 , fold 1 , …, fold k -1 ] ). ¤ A map transformation is applied to the RDD. ¤ A new RDD with an AUC value for each fold index is returned. ¤ The average AUC is calculated.

  10. Driver Program SparkContext Read input data DataLoader object Serialized DataLoader Serialize Factory object CrossValidate Parameters Broadcast RDD [fold 0 , fold 1 , …] RDD [ (fold 0 , AUC 0 ), … ] Fold AUC func>on Map Results Worker Worker Worker Serialized DataLoader object Serialized DataLoader object Serialized DataLoader object Task - Local Factory object Task - Local Factory object Task - Local Factory object Input root file Input root file Input root file Distributed File System : Input root file

  11. Optimization of tuning parameters

  12. Parallelized OptimizeTuningParameters (Full search of parameter space) ¤ A default parameter space is defined. RDD = sc.parallelize( [ (fold 0, par 0 ), …, (fold k - 1, par 0 )…, (fold 0, par p - 1 ), …, (fold k - 1, par p - 1 ) ] ) ¤ ¤ A map transformation is applied to the RDD. ¤ A new RDD with an AUC value for each fold and parameter index is returned. ¤ The maximum AUC in each fold is calculated. ¤ The cross validation AUC is calculated for each “fold winner” parameter.

  13. Driver Program SparkContext Read input data DataLoader object Serialized DataLoader Serialize Factory object OpQmizeTuningParameters Parameters Broadcast RDD [ (fold 0 , par 0 ), … ] RDD [ (fold 0 , par 0 , AUC (0, 0) ), …] (Fold, Parameter) Map AUC func>on Results Worker Worker Worker Serialized DataLoader object Serialized DataLoader object Serialized DataLoader object Task - Local Factory object Task - Local Factory object Task - Local Factory object Input root file Input root file Input root file Distributed File System : Input root file

  14. Parallelized OptimizeTuningParameters (Local search of parameter space) ¤ For each fold a H.C. algorithm is applied. ¤ Parallelize any calculation in each H.C. iteration. ¤ RDD includes a subset of all the folds/parameters pairs.

  15. Spark cluster Node 2 Node 1 Driver Program SparkContext Master Worker 2 Worker 1 4 cores 4 cores

  16. Experimental results

  17. Experimental results

  18. Experimental results Full search Hill climbing

Recommend


More recommend