towards modeling and simulation of exascale computing
play

Towards Modeling and Simulation of Exascale Computing Platforms - PowerPoint PPT Presentation

Towards Modeling and Simulation of Exascale Computing Platforms Luka Stanisic Supervised by: A.Legrand, B.Videau and J.F.M ehaut Laboratoire dInformatique de Grenoble MESCAL and NANOSIM teams June 21, 2012 Luka Stanisic Modeling of


  1. Towards Modeling and Simulation of Exascale Computing Platforms Luka Stanisic Supervised by: A.Legrand, B.Videau and J.F.M´ ehaut Laboratoire d’Informatique de Grenoble MESCAL and NANOSIM teams June 21, 2012 Luka Stanisic Modeling of caches June 21, 2012 1 / 22

  2. Introduction Introduction Future super-computer platforms will be facing big challenges due to the enormous power consumption This internship was part of two research projects: Mont-Blanc (European): Developing scalable and power efficient HPC 1 platform based on low-power ARM processors SONGS (ANR): Designing unified and open simulation framework for 2 performance evaluation of next generation systems Adequate models are required Goal : Investigate is it possible to model CPU behavior at coarse grain, especially ARM processors Luka Stanisic Modeling of caches June 21, 2012 2 / 22

  3. Introduction Simulation vs. alternative approach Simulation (cycle-accurate simulation) and emulation: Often too slow Questionable accuracy Luka Stanisic Modeling of caches June 21, 2012 3 / 22

  4. Introduction Simulation vs. alternative approach Simulation (cycle-accurate simulation) and emulation: Often too slow Questionable accuracy We need coarse-grain models: Lots of existing projects: LAPSE, MPI-SIM, BigSIM, MPI-NetSim, MicroGrid, PMAC , . . . Memory is the bottleneck of most HPC applications Starting point of this work were 2 articles from Allan Snavely and his team (PMAC) that seemed very promising: 1 “A Framework for Application Performance Modeling and Prediction” . A.Snavely, L.Carrington, N.Wolter, J.Labarta, R.Badia, A.Purkayastah, in SuperComputing 2002 2 “A Genetic Algorithms Approach to Modeling the Performance of Memory- bound Computations” . M. Tikir, L. Carrington, E. Strohmaier, A. Snavely in SuperComputing 2007 Luka Stanisic Modeling of caches June 21, 2012 3 / 22

  5. Introduction Framework for Application Performance Modeling and Prediction Authors propose a macroscopic approach: Trying to characterize the code as a whole with parameters that can later be related to platform characteristics in order to evaluate perfor- mances Luka Stanisic Modeling of caches June 21, 2012 4 / 22

  6. Introduction Framework for Application Performance Modeling and Prediction Authors propose a macroscopic approach: Trying to characterize the code as a whole with parameters that can later be related to platform characteristics in order to evaluate perfor- mances Luka Stanisic Modeling of caches June 21, 2012 4 / 22

  7. Introduction Kernel from MultiMAPS MultiMAPS( size , stride , nloops ) allocate buffer size ; timer start; for(i=1: nloops ) access elements in buffer by stride ; timer stop; bandwidth=#accesses/time; deallocate buffer; Luka Stanisic Modeling of caches June 21, 2012 5 / 22

  8. Introduction Kernel from MultiMAPS MultiMAPS( size , stride , nloops ) allocate buffer size ; timer start; for(i=1: nloops ) access elements in buffer by stride ; timer stop; bandwidth=#accesses/time; deallocate buffer; Luka Stanisic Modeling of caches June 21, 2012 5 / 22

  9. Introduction Kernel from MultiMAPS MultiMAPS( size , stride , nloops ) allocate buffer size ; timer start; for(i=1: nloops ) access elements in buffer by stride ; timer stop; bandwidth=#accesses/time; deallocate buffer; Luka Stanisic Modeling of caches June 21, 2012 5 / 22

  10. Introduction Kernel from MultiMAPS MultiMAPS( size , stride , nloops ) allocate buffer size ; timer start; for(i=1: nloops ) access elements in buffer by stride ; timer stop; bandwidth=#accesses/time; deallocate buffer; Luka Stanisic Modeling of caches June 21, 2012 5 / 22

  11. Introduction Kernel from MultiMAPS MultiMAPS( size , stride , nloops ) allocate buffer size ; timer start; for(i=1: nloops ) access elements in buffer by stride ; timer stop; bandwidth=#accesses/time; deallocate buffer; Luka Stanisic Modeling of caches June 21, 2012 5 / 22

  12. Introduction Kernel from MultiMAPS Our first experiments: MultiMAPS( size , stride , nloops ) allocate buffer size ; timer start; for(i=1: nloops ) access elements in buffer by stride ; timer stop; bandwidth=#accesses/time; deallocate buffer; Luka Stanisic Modeling of caches June 21, 2012 5 / 22

  13. Introduction Methodology Problem with the related work is that it is not very well documented, it is not suited for NUMA, multicore architectures and experiments are not reproducible We wanted to do the measurements in a clean, coherent and systematic way Luka Stanisic Modeling of caches June 21, 2012 6 / 22

  14. Introduction Methodology Problem with the related work is that it is not very well documented, it is not suited for NUMA, multicore architectures and experiments are not reproducible We wanted to do the measurements in a clean, coherent and systematic way Luka Stanisic Modeling of caches June 21, 2012 6 / 22

  15. Introduction Outline Kernel Parameters 1 Memory Allocation Parameters 2 Optimization Parameters 3 Operating System Parameters 4 Conclusion 5 Luka Stanisic Modeling of caches June 21, 2012 7 / 22

  16. Kernel Parameters Outline Kernel Parameters 1 Memory Allocation Parameters 2 Optimization Parameters 3 Operating System Parameters 4 Conclusion 5 Luka Stanisic Modeling of caches June 21, 2012 8 / 22

  17. Kernel Parameters Influence of Stride Parameter Comparing with the results from Intel Core i7 Sandy Bridge processor: MultiMAPS: Few max values Clear plateaus 1 Sharp drop when getting out of 2 the L1 cache size Performance is lower for larger 3 strides Luka Stanisic Modeling of caches June 21, 2012 9 / 22

  18. Kernel Parameters Influence of Stride Parameter Comparing with the results from Intel Core i7 Sandy Bridge processor: MultiMAPS: Randomization + Boxplots Clear plateaus 1 Sharp drop when getting out of 2 the L1 cache size Performance is lower for larger 3 strides Different bandwidths for 4 strides 8, 16, 32 inside L1 cache size Performance drop for higher 5 memory size values stop after stride 8 Luka Stanisic Modeling of caches June 21, 2012 9 / 22

  19. Kernel Parameters Influence of Stride Parameter Comparing with the results from Intel Core i7 Sandy Bridge processor: MultiMAPS: Randomization + Boxplots Clear plateaus 1 Sharp drop when getting out of 2 the L1 cache size Performance is lower for larger 3 strides Different bandwidths for 4 strides 8, 16, 32 inside L1 cache size Performance drop for higher 5 memory size values stop after stride 8 This is general behavior, but with many exceptions Luka Stanisic Modeling of caches June 21, 2012 9 / 22

  20. Kernel Parameters Unexpected Behavior Example for Intel Core i7 3.40 GHz Sandy Bridge: Irregular behavior inside L1 cache size! Luka Stanisic Modeling of caches June 21, 2012 10 / 22

  21. Kernel Parameters Unexpected Behavior Example for Intel Core i7 3.40 GHz Example for ARM Dual Cortex A9 Sandy Bridge: 1 GHz Snowball: Irregular behavior inside L1 cache size! Luka Stanisic Modeling of caches June 21, 2012 10 / 22

  22. Kernel Parameters Unexpected Behavior Example for Intel Core i7 3.40 GHz Example for ARM Dual Cortex A9 Sandy Bridge: 1 GHz Snowball: Strides 10, 12, 14 have better Irregular behavior inside L1 cache performance than Stride 8 ?!? size! Luka Stanisic Modeling of caches June 21, 2012 10 / 22

  23. Memory Allocation Parameters Outline Kernel Parameters 1 Memory Allocation Parameters 2 Optimization Parameters 3 Operating System Parameters 4 Conclusion 5 Luka Stanisic Modeling of caches June 21, 2012 11 / 22

  24. Memory Allocation Parameters Reproducibility Issue on ARM Same input parameters, consecutive experiments 42 repetitions per each memory size, NO NOISE! Results from ARM Dual Cortex A9 1GHz (Snowball): Luka Stanisic Modeling of caches June 21, 2012 12 / 22

  25. Memory Allocation Parameters Reproducibility Issue on ARM Same input parameters, consecutive experiments 42 repetitions per each memory size, NO NOISE! Results from ARM Dual Cortex A9 1GHz (Snowball): Luka Stanisic Modeling of caches June 21, 2012 12 / 22

  26. Memory Allocation Parameters Influence of Allocation Strategy on ARM Different memory allocation technique: Performance depend on actual physical address: Luka Stanisic Modeling of caches June 21, 2012 13 / 22

  27. Optimization Parameters Outline Kernel Parameters 1 Memory Allocation Parameters 2 Optimization Parameters 3 Operating System Parameters 4 Conclusion 5 Luka Stanisic Modeling of caches June 21, 2012 14 / 22

  28. Optimization Parameters Influence of Code Optimizations Element type Using long long int which is 64b instead of regular int 32b Vectorized instructions: On Intel: 128b SSE and 256b AVX On ARM: 128b NEON Loop unrolling Standard execution: With loop unrolling: for(j=0;j < buffersize;j+=STRIDE) for(j=0;j < buffersize;j+=STRIDE*8) { { sum+=buffer[j]; sum+=buffer[j]; } ... sum+=buffer[j+7*STRIDE]; } Luka Stanisic Modeling of caches June 21, 2012 15 / 22

  29. Optimization Parameters Results from Intel Sandy Bridge: Luka Stanisic Modeling of caches June 21, 2012 16 / 22

  30. Optimization Parameters Results from ARM Snowball: Luka Stanisic Modeling of caches June 21, 2012 17 / 22

Recommend


More recommend