Benefits of the ARM architecture on the even Berkeley Dwarfs Patric Mai, Pierre Schoonbrood RWTH Aachen University patric.mai@gmx.de, pierre.schoonbrood@rwth-aachen.de February 12, 2015 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 1 / 35
Overview The road to exaflopic computing 1 ARM architecture ARM HPC clusters Performance of the ARM architecture 2 Sparse linear algebra Unstructured grids Combinational logic Optimizing OpenCL for the ARM architecture Graphical models Developing for ARM 3 Conclusions 4 References 5 Who did what? 6 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 2 / 35
(1) - Current trends www.top500.org Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 3 / 35
(1) - Limitations A supercomputer should not exceed 20 MW budget Currently, 2 GFlops/Watt Required, 50 GFlops/Watt Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 4 / 35
(1) - ARM architecture www.arm.com Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 5 / 35
(1) - ARM HPC Clusters HPC cluster of low power SOCs (ARM) Tibidabo (Rajovic et Al. 2013) 256 nodes with: A9 dual-core @1GHz 1GB DDR2 SDRAM Mont-blanc Project (http://www.montblanc-project.eu) Pedraforca - 70 nodes with: A9 quad-core @1.4GHz 4GB DDR3 SDRAM Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 6 / 35
overview The road to exaflopic computing 1 ARM architecture ARM HPC clusters Performance of the ARM architecture 2 Sparse linear algebra Unstructured grids Combinational logic Optimizing OpenCL for the ARM architecture Graphical models Developing for ARM 3 Conclusions 4 References 5 Who did what? 6 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 7 / 35
(2) - Benchmarking the ARM architecture How suitable is the ARM architecture for HPC applications? Performance per Watt for several dwarfs Comparison with other architectures Optimizing OpenCL framework for ARM Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 8 / 35
(2) - (FEAST) Finite element analysis (1) Mesh of points of an object (For example: fluid) Points have properties (viscosity) Differential equations for behavior (laminar or turbulent?) Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 9 / 35
(2) - (FEAST) Finite element analysis (2) Tibidabo benchmarked against Xeon cluster (LiDOng) Tested with four configurations LiDOng as much cores as Tibidabo 1 As (1), but all cores of a node activated at LiDOng 2 As few nodes as possible with respect to memory (LiDOng) 3 Twice the amount of (3) 4 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 10 / 35
(2) - (FEAST) Finite element analysis (3) G¨ oddicke et Al. 2012 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 11 / 35
(2) - (FEAST) Finite element analysis (4) G¨ oddicke et Al. 2012 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 12 / 35
(2) - (HONEI LBM) Computational fluid dynamics (1) Mesh constructed for fluids Definition of physical bounds Definition of physical model Simulation behavior of liquids and gasses Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 13 / 35
(2) - (HONEI LBM) Computational fluid dynamics (2) G¨ oddicke et Al. 2012 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 14 / 35
(2) - (HONEI LBM) Computational fluid dynamics (3) G¨ oddicke et Al. 2012 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 15 / 35
(2) - Rijndael and Bitcount Bitcount - count set bits in an array Uneven workload Rijndael - derive round keys for example AES A lot of X-OR operations Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 16 / 35
(2) - Performance on Combinational logic Maghazeh et Al. 2013 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 17 / 35
(2) - Optimizing OpenCL runtime for ARM (1) ARM processor: both host processor and OpenCL device Every core is one compute unit Every core is a single processing element Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 18 / 35
(2) - Optimizing OpenCL runtime for ARM (2) Gangwon et Al. 2014 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 19 / 35
(2) - Optimizing OpenCL runtime for ARM (2) Gangwon et Al. 2014 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 20 / 35
(2) - Optimizing OpenCL compilation for ARM Gangwon et Al. 2014 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 21 / 35
(2) - Optimizing OpenCL compilation (NEON) Auto-vectorization by the GCC compiler Vector operations are converted into NEON intrinsic functions Binaries contain NEON functions Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 22 / 35
(2) - Optimizing OpenCL results Improvement over PGCL Gangwon et Al. 2014 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 23 / 35
(3) - Deep convolution neural networks Feed forward neural network Neuron collections responsible for part of an image Several layers (filters) Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 24 / 35
(3) - Performance on neural networks (1) Jonghoon et Al. 2014 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 25 / 35
(3) - Performance on neural networks (2) Jonghoon et Al. 2014 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 26 / 35
(3) - Performance on neural networks(3) SoC used has 2 cores with 512KB L2 cache Intermediate result: 256KB in size Lots of cache misses Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 27 / 35
(3) - Performance on neural networks(4) Jonghoon et Al. 2014 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 28 / 35
overview The road to exaflopic computing 1 ARM architecture ARM HPC clusters Performance of the ARM architecture 2 Sparse linear algebra Unstructured grids Combinational logic Optimizing OpenCL for the ARM architecture Graphical models Developing for ARM 3 Conclusions 4 References 5 Who did what? 6 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 29 / 35
(3) - Developing for ARM Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 30 / 35
overview The road to exaflopic computing 1 ARM architecture ARM HPC clusters Performance of the ARM architecture 2 Sparse linear algebra Unstructured grids Combinational logic Optimizing OpenCL for the ARM architecture Graphical models Developing for ARM 3 Conclusions 4 References 5 Who did what? 6 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 31 / 35
(4) - Conclusions ARM most of the time more energy efficient Mostly utilized for embedded applications Usable for HPC Applications Limiting factor: resources (memory, caches) Frameworks should be optimized for ARM Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 32 / 35
(5) - References (1) G¨ oddeke, Dominik, et al. Energy efficiency vs. performance of the numerical solution of PDEs: an application study on a low-power ARM-based cluster. Journal of Computational Physics , 2013, 237. Jg., S. 132-150. JO, Gangwon, et al. OpenCL framework for ARM processors with NEON support. In: Proceedings of the 2014 Workshop on Workshop on programming models for SIMD/Vector processing. ACM, 2014. S. 33-40. MAGHAZEH, Arian, et al. General purpose computing on low-power embedded GPUs: Has it come of age?. In: Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIII), 2013 International Conference on . IEEE, 2013. S. 1-10. Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 33 / 35
(5) - References (2) JIN, Jonghoon, et al. An efficient implementation of deep convolutional neural networks on a mobile coprocessor. In: Circuits and Systems (MWSCAS), 2014 IEEE 57th International Midwest Symposium on . IEEE, 2014. S. 133-136. Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 34 / 35
(6) - Who did what? Shared Introduction - Patric Mai Road to exaflopic computing - Pierre Schoonbrood Sparse linear Algebra - Pierre Schoonbrood Unstructured grids - Pierre Schoonbrood Combinational logic - Patric Mai Optimizing OpenCL for ARM - Patric Mai Graphical models - Pierre Schoonbrood Developing for ARM - Patric Mai Conclusions - Both Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 35 / 35
Recommend
More recommend