Integration of Spark parallelization in TMVA Georgios Douzas Enric - PowerPoint PPT Presentation

Integration of Spark parallelization in TMVA Georgios Douzas Enric Tejedor, Sergei Gleyzer, Georgios Douzas Supervisors : Enric Tejedor, Sergei Gleyzer

Spark engine ¤ A generalized framework for distributed data processing. ¤ Implemented in Scala. ¤ Provides a Python API called PySpark. ¤ Two main concepts: - RDD (Resilient Distributed Datasets) - DAG (Direct Acyclic Graph)

Spark engine ¤ RDD is an immutable parallel data structure. ¤ DAG is a programming model for distributed systems. ¤ RDD operations: Transformations and Actions.

Spark architecture

Parallelization of the TMVA code ¤ Identify opportunities for parallelism. ¤ Examine whether the parallelism improves performance. ¤ Target on loops that include independent calculations. ¤ Use the same interface as the C++ TMVA code.

Parallelization in TMVA ¤ Cross validation. ¤ Optimization of tuning parameters. ¤ Local search for the optimization of tuning parameters.

Cross validation

Cross validation Validation K-fold cross validation

Parallelized CrossValidate ¤ RDD = sc.parallelize( [fold 0 , fold 1 , …, fold k -1 ] ). ¤ A map transformation is applied to the RDD. ¤ A new RDD with an AUC value for each fold index is returned. ¤ The average AUC is calculated.

Driver Program SparkContext Read input data DataLoader object Serialized DataLoader Serialize Factory object CrossValidate Parameters Broadcast RDD [fold 0 , fold 1 , …] RDD [ (fold 0 , AUC 0 ), … ] Fold AUC func>on Map Results Worker Worker Worker Serialized DataLoader object Serialized DataLoader object Serialized DataLoader object Task - Local Factory object Task - Local Factory object Task - Local Factory object Input root file Input root file Input root file Distributed File System : Input root file

Optimization of tuning parameters

Parallelized OptimizeTuningParameters (Full search of parameter space) ¤ A default parameter space is defined. RDD = sc.parallelize( [ (fold 0, par 0 ), …, (fold k - 1, par 0 )…, (fold 0, par p - 1 ), …, (fold k - 1, par p - 1 ) ] ) ¤ ¤ A map transformation is applied to the RDD. ¤ A new RDD with an AUC value for each fold and parameter index is returned. ¤ The maximum AUC in each fold is calculated. ¤ The cross validation AUC is calculated for each “fold winner” parameter.

Driver Program SparkContext Read input data DataLoader object Serialized DataLoader Serialize Factory object OpQmizeTuningParameters Parameters Broadcast RDD [ (fold 0 , par 0 ), … ] RDD [ (fold 0 , par 0 , AUC (0, 0) ), …] (Fold, Parameter) Map AUC func>on Results Worker Worker Worker Serialized DataLoader object Serialized DataLoader object Serialized DataLoader object Task - Local Factory object Task - Local Factory object Task - Local Factory object Input root file Input root file Input root file Distributed File System : Input root file

Parallelized OptimizeTuningParameters (Local search of parameter space) ¤ For each fold a H.C. algorithm is applied. ¤ Parallelize any calculation in each H.C. iteration. ¤ RDD includes a subset of all the folds/parameters pairs.

Spark cluster Node 2 Node 1 Driver Program SparkContext Master Worker 2 Worker 1 4 cores 4 cores

Experimental results

Experimental results Full search Hill climbing

Integration of Spark parallelization in TMVA Georgios Douzas Enric - PowerPoint PPT Presentation

Integration of Spark parallelization in TMVA Georgios Douzas Enric Tejedor, Sergei Gleyzer, Georgios Douzas Supervisors : Enric Tejedor, Sergei Gleyzer Spark engine A generalized framework for distributed data processing. Implemented in

Deep learning in TMVA Benchmarking TMVA DNN Integration of a Deep Autoencoder Marc Huwiler CERN

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Particle identification using TMVA/MLP and Nave Bayes for EMC detector Malgorzata

Photon Not Meeting 27 th July 2017 1 TMVA Classification Can now extract the response variable

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Speed up evaluation by parallelization /////////// November 2018 Michael Weiss Bayer AG

Parallelization and Parallelization and Proling Proling Programming for Statistical

TELECOM Paris AADL tools portfolio for real-time systems virtual integration Dominique Blouin

TensorRT Inference with TensorFlow Pooya Davoodi (NVIDIA) Chul Gwon (Clarifai) Guangda Lai

Approximating Cumulative Pebbling Cost is Unique Games Hard Jeremiah Blocki 1 , Seunghoon Lee 1 ,

Marketing Authorisation: Marketing Authorisation: The Evaluation Process The Evaluation Process

PURCHASING DIVISION 1 AGENCY OVERVIEW ASSEMBLY COMMITTEE ON GOVERNMENT AFFAIRS FEBRUARY 8,

Calcul de bornes dans LocalSolver 9.5 Nikolas Stott nstott@localsolver.com www.localsolver.com

Feasibility of Cryptocurrencies on Mobile devices Anas Younis & Sander Lentink University of

Topological Sort Shivam Patel Viktor Zenkov Questions 1. Who first described topological sort?

Integration of Spark parallelization in TMVA Georgios Douzas Enric - PowerPoint PPT Presentation

Integration of Spark parallelization in TMVA Georgios Douzas Enric Tejedor, Sergei Gleyzer, Georgios Douzas Supervisors : Enric Tejedor, Sergei Gleyzer Spark engine A generalized framework for distributed data processing. Implemented in

Deep learning in TMVA Benchmarking TMVA DNN Integration of a Deep Autoencoder Marc Huwiler CERN

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Particle identification using TMVA/MLP and Nave Bayes for EMC detector Malgorzata

Photon Not Meeting 27 th July 2017 1 TMVA Classification Can now extract the response variable

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Speed up evaluation by parallelization /////////// November 2018 Michael Weiss Bayer AG

Parallelization and Parallelization and Proling Proling Programming for Statistical

TELECOM Paris AADL tools portfolio for real-time systems virtual integration Dominique Blouin

TensorRT Inference with TensorFlow Pooya Davoodi (NVIDIA) Chul Gwon (Clarifai) Guangda Lai

Approximating Cumulative Pebbling Cost is Unique Games Hard Jeremiah Blocki 1 , Seunghoon Lee 1 ,

Marketing Authorisation: Marketing Authorisation: The Evaluation Process The Evaluation Process

PURCHASING DIVISION 1 AGENCY OVERVIEW ASSEMBLY COMMITTEE ON GOVERNMENT AFFAIRS FEBRUARY 8,

Calcul de bornes dans LocalSolver 9.5 Nikolas Stott nstott@localsolver.com www.localsolver.com

Feasibility of Cryptocurrencies on Mobile devices Anas Younis &amp; Sander Lentink University of

Topological Sort Shivam Patel Viktor Zenkov Questions 1. Who first described topological sort?

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Feasibility of Cryptocurrencies on Mobile devices Anas Younis & Sander Lentink University of