GPU Parallel SubTree Interpreter for Genetic Programming Alberto - PowerPoint PPT Presentation

GPU Parallel SubTree Interpreter for Genetic Programming Alberto Cano and Sebastián Ventura Knowledge Discovery and Intelligent Systems Research Group University of Córdoba, Spain 1/15 Vancouver, Canada, July 12-16, 2014

Overview 1. Parallelization approaches for GP evaluation 2. Stack-based GP interpreter 3. Parallel SubTree interpreter 4. Experiments 5. Conclusions 6. Future work 2/15

1 Parallelization approaches for GP evaluation “ Genetic Programming is embarrassingly parallel ” • Population parallel • Multi-core CPUs (acceptable for small population sizes) • Many-core GPUs (required for large population sizes) • Data parallel • GP run on multiple fitness cases (thousands, millions) • GPU SIMD viewpoint 3/15

1 Parallelization approaches for GP evaluation • Population and data parallel • 2D grid of threads for individuals and fitness cases GP GP GP GP GP GP … GP data data data data data data … data data data data data data data … data data data data data data data … data data data data data data data … data data data data data data data … data … … … … … … … … data data data data data data … data • Performance hints: • Warp: single GP individual run on 32 fitness cases • GP individual in constant memory: single read - broadcast • Data coalescence: transposed data matrix 4/15

2 Stack-based GP interpreter • Postfix notation: expression is evaluated left-to-right V6 AT6 < V5 AT5 > OR V4 AT4 < AND V3 AT3 > V2 AT2 < V1 AT1 > AND OR AND AND OR AND > < AND OR > AT3 > < AT4 < V3 V4 AT1 AT5 V1 AT2 V5 AT6 V2 V6 • O(n) complexity • 23 push and 22 pop operations 5/15

2 Stack-based GP interpreter • Mixed prefix and postfix notation: < AT6 V6 > AT5 V5 OR < AT4 V4 AND > AT3 V3 < AT2 V2 > AT1 V1 AND OR AND AND OR AND > < AND OR > AT3 > < AT4 < V3 V4 AT1 AT5 V1 AT2 V5 AT6 V2 V6 • O(n) complexity • 11 push and 10 pop operations 6/15

3 Parallel SubTree interpreter • Computation of independent subtrees can be parallelized AND OR AND > < AND OR > AT3 > < AT4 < V3 V4 AT1 AT5 V1 AT2 V5 AT6 V2 V6 • O(depth) complexity • No stack depth needed • Threads cooperation via shared memory • Best performance on balanced trees 7/15

3 Parallel SubTree interpreter Full code at: (link available in the paper) http://www.uco.es/grupos/kdis/wiki/GPevaluation 8/15

4 Experiments • GPU: GTX 780 donated by NVIDIA • Comparison: population and data parallel vs subtree parallel • Datasets: 15 • Population size: 32, 64, 128 • Tree size: 31, 63, 127 • Performance measure: GPops/s • How affects the population, tree and dataset size? 9/15

4 Experiments GPops/s (Billion) Tree size 31 63 127 Population size 32 64 128 32 64 128 32 64 128 Dataset Instances Atts Population and data parallel fars 100968 29 35,15 35,33 35,53 44,26 42,73 44,57 45,75 45,88 43,80 glass 214 9 8,08 14,49 19,76 10,97 19,15 24,20 13,08 23,36 27,33 ionosphere 351 33 11,55 15,57 23,44 15,18 19,35 27,08 18,36 20,70 29,21 iris 150 4 5,69 10,53 13,84 7,36 14,41 17,50 9,54 17,64 19,69 kddcup 494020 42 34,13 34,61 34,51 43,32 44,49 44,48 45,92 44,66 48,15 pima 768 8 22,26 29,02 34,67 29,70 34,07 42,17 36,84 43,10 48,34 satimage 6435 36 37,07 40,00 41,55 40,31 42,45 48,01 42,29 45,63 51,07 shuttle 58000 9 35,20 35,57 35,69 42,82 44,47 44,67 45,52 45,15 45,90 texture 5500 40 36,44 39,36 41,61 40,05 42,45 43,76 41,77 43,76 43,26 vowel 990 13 21,49 29,83 35,20 27,52 34,81 39,01 30,66 37,62 39,98 Dataset Instances Atts Subtree parallel fars 100968 29 45,63 45,87 43,59 51,03 51,22 51,28 49,88 49,93 49,99 heart 270 13 10,99 17,81 28,37 20,29 27,94 37,14 27,58 35,19 41,29 ionosphere 351 33 15,56 24,07 32,88 23,40 32,67 39,68 29,95 37,91 43,05 iris 150 4 8,21 14,24 20,93 13,70 20,87 29,18 20,07 28,53 36,19 kddcup 494020 42 45,89 44,92 45,94 50,96 50,92 51,11 49,79 50,78 50,88 pima 768 8 25,80 34,01 46,60 33,65 39,96 49,72 38,42 47,75 51,16 satimage 6435 36 41,03 43,82 45,55 47,28 49,11 55,44 47,90 54,00 54,64 shuttle 58000 9 45,36 43,12 43,16 48,27 51,01 51,20 49,77 49,87 49,88 texture 5500 40 39,90 43,56 45,19 46,50 48,87 45,77 47,40 48,70 49,30 vowel 990 13 28,83 36,22 42,52 35,86 41,93 38,30 40,57 44,55 47,06 10/15

4 Experiments • Performance variation when increasing population and tree size 11/15

4 Experiments • Performance variation when increasing data and tree size • Performance increases as soon as there are enough individuals, subtrees or data to fill the GPU compute units 12/15

5 Conclusions • Positive: • Mixed prefix/postfix notation • O(depth) complexity • No stack depth needed • Best for balanced trees • The higher tree density the better performance • Negative: • Inappropriate for extremely unbalanced trees • Synchronization at each depth level • The number of active threads is reduced at each level • Limited by kernel size • Limited by shared memory 13/15

6 Future work • Performance analysis: balance, density, and branch factor • Scalability to bigger trees • CUDA dynamic parallelism • Parent kernel can launch nested smaller kernel • Kepler’s shuffle instruction to avoid shared memory We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GTX 780 GPU used for this research 14/15

GPU Parallel SubTree Interpreter for Genetic Programming Alberto Cano and Sebastián Ventura Knowledge Discovery and Intelligent Systems Research Group University of Córdoba, Spain Alberto Cano acano@uco.es http://www.uco.es/users/i52caroa http://www.uco.es/grupos/kdis 15/15 Vancouver, Canada, July 12-16, 2014

GPU Parallel SubTree Interpreter for GP

Parallelization approaches for GP evaluation • Pittsburgh style encoding 1 • Individuals represent variable length rule-sets • 3D grid of threads for individuals, rules and fitness cases • Multi-instance classification 2 • Examples represent sets of instances • Association rule mining 3 • Antecedent and consequent to be evaluated in paralell • Concurrent kernels 1) A. Cano, A. Zafra, and S. Ventura. Parallel evaluation of Pittsburgh rule-based classifiers on GPUs. Neurocomputing, vol. 126, pages 45-57, 2014. 2) A. Cano, A. Zafra, and S. Ventura. Speeding up multiple instance learning classification rules on GPUs. Knowledge and Information Systems, In press, 2014. 3) A. Cano, A. Zafra, and S. Ventura. Parallel evaluation of Pittsburgh rule-based classifiers on GPUs. Neurocomputing, vol. 126, pages 45-57, 2014.

GPU Parallel SubTree Interpreter for Genetic Programming Alberto - PowerPoint PPT Presentation

GPU Parallel SubTree Interpreter for Genetic Programming Alberto Cano and Sebastin Ventura Knowledge Discovery and Intelligent Systems Research Group University of Crdoba, Spain 1/15 Vancouver, Canada, July 12-16, 2014 Overview 1.

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

61A Lecture 27 Announcements Interpreting Scheme The Structure of an Interpreter 4 The

Recursive Subtree Composition in LSTM-Based Dependency Parsing Miryam de Lhoneux , Miguel

BST property: for each node v with key k , nodes in left subtree have keys k nodes in

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

2110412 Parallel Comp Arch CUDA: Parallel Programming on GPU Natawut Nupairoj, Ph.D. Department

Deaf Interpreter Curriculum Module 5: Interpreting Theory & Practice for Deaf Interpreters @

Renjin: The new R interpreter built on the JVM What? Renjin is a new interpreter for the R

Interpreter Exploitation Pointer Inference and JIT Spraying dion@semantiscope.com Interpreter

Substitution-model reasoning for Hofl A Hofl interpreter Theory of Programming Languages Computer

Development of a Programming Environment and Interpreter for LOOP and WHILE Michael Gebhard

TDDB84 Design Patterns Lecture 08 Interpreter, Facade Peter Bunus Department of Computer and

TMBL Kernels for CUDA GPUs Compile Faster Using PTX Tony E Lewis George D Magoulas Two Major

Use Tesla to provide first GPU VM Service in China Feng Zhu

Topological data analysis and topology-based visualization Leila De Floriani Topology-based

Fertility Preservation After the Over 70% express a desire for future offspring Diagnosis of

Negative Magnetoresistance in High-Mobility Heterostructures Rolf J. Haug Abteilung

that a l o c a l r e p r e - s e n t a t i v e o f t h e e n g i n

CS 758/858: Algorithms http://www.cs.unh.edu/~ruml/cs758 Backtracking Local Search Wheeler Ruml

Sparse Flat Neighborhood Networks (SFNNs): Scalable Guaranteed Pairwise Bandwidth & Unit

Anisotropic Structures - Theory and Design Strutture anisotrope: teoria e progetto Paolo VANNUCCI

Please send any feedback, ideas, etc to: marines@merrimack.edu

GPU Parallel SubTree Interpreter for Genetic Programming Alberto - PowerPoint PPT Presentation

GPU Parallel SubTree Interpreter for Genetic Programming Alberto Cano and Sebastin Ventura Knowledge Discovery and Intelligent Systems Research Group University of Crdoba, Spain 1/15 Vancouver, Canada, July 12-16, 2014 Overview 1.

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

61A Lecture 27 Announcements Interpreting Scheme The Structure of an Interpreter 4 The

Recursive Subtree Composition in LSTM-Based Dependency Parsing Miryam de Lhoneux , Miguel

BST property: for each node v with key k , nodes in left subtree have keys k nodes in

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

2110412 Parallel Comp Arch CUDA: Parallel Programming on GPU Natawut Nupairoj, Ph.D. Department

Deaf Interpreter Curriculum Module 5: Interpreting Theory &amp; Practice for Deaf Interpreters @

Renjin: The new R interpreter built on the JVM What? Renjin is a new interpreter for the R

Interpreter Exploitation Pointer Inference and JIT Spraying dion@semantiscope.com Interpreter

Substitution-model reasoning for Hofl A Hofl interpreter Theory of Programming Languages Computer

Development of a Programming Environment and Interpreter for LOOP and WHILE Michael Gebhard

TDDB84 Design Patterns Lecture 08 Interpreter, Facade Peter Bunus Department of Computer and

TMBL Kernels for CUDA GPUs Compile Faster Using PTX Tony E Lewis George D Magoulas Two Major

Use Tesla to provide first GPU VM Service in China Feng Zhu

Topological data analysis and topology-based visualization Leila De Floriani Topology-based

Fertility Preservation After the Over 70% express a desire for future offspring Diagnosis of

Negative Magnetoresistance in High-Mobility Heterostructures Rolf J. Haug Abteilung

that a l o c a l r e p r e - s e n t a t i v e o f t h e e n g i n

CS 758/858: Algorithms http://www.cs.unh.edu/~ruml/cs758 Backtracking Local Search Wheeler Ruml

Sparse Flat Neighborhood Networks (SFNNs): Scalable Guaranteed Pairwise Bandwidth &amp; Unit

Anisotropic Structures - Theory and Design Strutture anisotrope: teoria e progetto Paolo VANNUCCI

Please send any feedback, ideas, etc to: marines@merrimack.edu

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Deaf Interpreter Curriculum Module 5: Interpreting Theory & Practice for Deaf Interpreters @

Sparse Flat Neighborhood Networks (SFNNs): Scalable Guaranteed Pairwise Bandwidth & Unit