Tuning space optimization for multi- core architectures V. Martínez , F. Dupros, M. Castro, H. Aochi and P. Navaux
2 Contents • Introduction. • HPC applications Performance • Stencil Model. • Testbed configuration. • Experiments. • Results. • Conclusion.
3 Scientific Applications Applications Multi-core S Stencil Applications: Data dependency
4 Contribution To find best configuration of runtime parameters (tuning) for stencil computations based on number of available threads and L3 CM reduction
5 Contents • Introduction. • Stencil Model. • Jacobi 7-point • Testbed configuration. • Experiments. • Results. • Conclusion.
6 Stencil: 7-point Jacobi • 3D Stencil. • Heat equation • Finite difference method. • Calculate: Algorithm 1 Pseudocode for stencil algorithm 1: for each timestep do Compute in parallel 2: for each block in X-direction do 3: B i,j,k = α A i,j,k for each block in Y-direction do 4: for each block in Z-direction do 5: + β ( A i − 1 ,j,k + A i,j − 1 ,k + A i,j,k − 1 Compute stencil(3D tile) 6: end for 7: + A i +1 ,j,k + A i,j +1 ,k + A i,j,k +1 ) end for 8: end for 9: 10: end for
7 Contents • Introduction. • Stencil Model. • Testbed configuration. • Hardware/Application Setup. • Experiments. • Results. • Conclusion.
8 Experiments (Testbed) Node 1 Node 2 Processor i5-4570 Xeon X7550 Clock (GHz) 3.2 2.0 Cores 4 8 Sockets 1 4 Threads 4 64 L3 Cache size (MB) 6 18 Compiler gcc-4.6.4 gcc-4.6.4
9 Contents • Introduction. • Stencil Model. • Testbed configuration. • Experiments. • Results. • Conclusion.
10 Experiments (Setup) Total configurations • Output: Input vector Node 1 Node 2 • Cache misses: Threads 2 6 (PAPI_L3_TCM) Looping 2 2 • Cache Accesses: Size 3 3 (PAPI_L3_TCA) Chunk 4 4 • Time Scheduling 3 3 • GFLOPS Total 144 432
������������ ����������� 11 Algorithms Naive • Triple nested loops coming from the three spatial dimensions Blocking • Dependencies between components are exploited to implement a space-time decomposition. size of the tile U3 Skew U2 U1 • Decompose the stencil using both U0 the space and the time directions but in a specific order.
12 Contents • Introduction. • Stencil Model. • Testbed configuration. • Experiments. • Results. • Tuning • Conclusion.
13 Results (Algorithms) Performance Cache Misses 8 1e+08 6 L3 cache misses 1e+05 Gflops 4 2 1e+02 0 Node 1 Node 2 Node 1 Node 2 Naive (magenta), Blocking (green) and Skew(cyan)
14 Results (Scalability) Node 1 Node 2 8 8 6 6 Gflops Gflops 4 4 2 2 0 0 2 4 2 4 8 16 32 64 Naive (magenta), Blocking (green) and Skew(cyan)
15 Results (code optimization) Node 1 Node 2 10.0 10.0 7.5 7.5 Gflops Gflops 5.0 5.0 2.5 2.5 0.0 0.0 Naive Blocking Skew Naive Blocking Skew Parallefor (red), Tasking (blue)
16 Results (Problem size) Node 1 - Tasking Node 2 - Parallelfor 10.0 10.0 7.5 7.5 Gflops Gflops 5.0 5.0 2.5 2.5 0.0 0.0 128 256 512 128 256 512 Naive (magenta), Blocking (green) and Skew(cyan)
17 Results (Scheduling) Chunk size (Node 2) Naive Blocking Skew 15 15 15 10 10 10 Gflops Gflops Gflops 5 5 5 0 0 0 32 128 256 512 32 128 256 512 32 128 256 512 Parallelfor (red), Tasking (blue)
18 Results (Scheduling) Policy (Node 2) Naive Blocking Skew 15 15 15 10 10 10 Gflops Gflops Gflops 5 5 5 0 0 0 32 128 256 512 32 128 256 512 32 128 256 512 Dynamic (green), Guided (Orange) and Static (Gray)
19 Makespan (Naive parallelfor)
19 Makespan (Naive parallelfor)
20 Makespan (Naive tasking)
20 Makespan (Naive tasking)
21 Contents • Introduction. • Stencil Model. • Testbed configuration. • Experiments. • Results. • Conclusion.
22 Conclusion • Tasking achieves good performance when the algorithm does not use cache intensively (Skew). • Chunk size and scheduling algorithms (OpenMP) play an important role for Naive algorithm and can contribute to achieve a peak of performance. • Fitting: Cache misses can be approximated by linear fitting whereas the performance could be predicted by exponential fitting • Future: we intend to develop an auto-tuning approach to automatize the choice of the input parameters
Thanks Questions? victor.martinez@inf.ufrgs.br
Recommend
More recommend