Locality-Aware Laplacian Mesh Smoothing Guillaume Aupy , Jeonghyung Park, Padma Raghavan
Laplacian Mesh Smoothing Iterative process used to improve the quality of 2D meshes. 0 Choose an internal non-visited vertex 1 Move it to the barycenter of its neighbors 2 Pick its lowest-quality non visited neighbor, GOTO 1. If set is empty, GOTO 0. GOAL: Mesh quality (edge-length ratio) is measured as: 1 min edge � | triangles | max edge triangles 1
Data Locality ◮ Data for computation is stored in cache ◮ If it is not: cache miss (additional costs) Cache are governed by Least Recently Used (LRU) algorithm. High-level view of a socket of Intel Westmere-EX processor → Measure for data: Reuse Distance Data Locality Spatial: Reuse within a cache line. Temporal: Reuse of a node already in cache 2
Data Locality in LMS Hypothesis: Cache misses play an important role in the LMS algorithm. → [Strout+Hovland 04] Data-ordering of irregular HPC applications impact the performance. Orderings: 3
Data Locality in LMS Hypothesis: Cache misses play an important role in the LMS algorithm. → [Strout+Hovland 04] Data-ordering of irregular HPC applications impact the performance. Quick check: 2.5 2.5 10 2 2 ReuseDistance (x10 ) ) 0 ReuseDistance (x10 ⁵ ) 8 1 x 1.5 ( 1.5 e c n 6 a t s 1 i D 1 e 4 s u e R 0.5 0.5 2 0 0 0 0 0.5 1 1.5 0 0 0 0.5 1 1.5 2 2.5 2 2.5 0 0.5 1 Index of access (x10 ) Index of access (x10 ) 1.5 2 2.5 Index of access (x10 ) Random ordering: Original ordering: BFS ordering: exec. exec. time 7.6s exec. time 10.3s time 6.59s 3
This work 100000 Reuse Distance 10000 ← Reuse distance profile of the LMS algorithm on a 1000 Carabiner mesh. 100 0 200 400 600 800 Time steps 4
This work 100000 Reuse Distance 10000 ← Reuse distance profile of the LMS algorithm on a 1000 Carabiner mesh. 100 0 200 400 600 800 Time steps Conjecture: Access pattern for LMS can be controlled by the initial qualities of each nodes in the mesh. A re-ordering based on the initial iteration should work well. 4
Mesh reordering scheme RDR ◮ From a given node already ordered: sort all its unordered neighbors by increasing quality ◮ Append to the list of already ordered nodes ◮ Mark the node processed. Iterate from unprocessed neighbor with worse quality. 5
Evaluation ◮ Meshes are generated by Triangle [Shewchuk’02] ◮ LMS is done with Mesquite [Brewer et al’03]. Comparison are made with respect to: ◮ ORI : original ordering given by Mesquite ◮ BFS : breadth first search ordering [Strout+Hovland’05] 6
Experimental Setup Runs done on an Intel Westmere-EX: 4 eight-cores processors (up to 32 concurrent threads). Cache Size Latency (cycles per access) L1 (P) 32K 4 L2 (P) 256K 10 L3 (S) 24M 38-170 Mem ∞ 175-290 7
Results 8 ORI BFS RDR 7 6 Execution Time: 1 core ← Results on one 5 core (seconds). 4 3 2 1 0 M1 M2 M3 M4 M5 M6 M7 M8 M9 8
Results 80 60 Mean Speedup ← Mean speedup ordering ori 40 versus T ORI (1) bfs rdr 20 0 0 10 20 30 Number of Cores 8
9
Cache Performance Using the PAPI software, we can measure cache performance. 60 60 0.8 ORI BFS RDR ORI BFS RDR ORI BFS RDR 0.7 50 50 0.6 40 40 Miss Rate(%) 0.5 Miss Rate(%) Miss Rate(%) 0.4 30 30 0.3 20 20 0.2 10 10 0.1 0 0 0 M1 M2 M3 M4 M5 M6 M7 M8 M9 M1 M2 M3 M4 M5 M6 M7 M8 M9 M1 M2 M3 M4 M5 M6 M7 M8 M9 Cache performance results on one core when reorderings were applied. Better orderings will be characterized by better cache perfor- mance. Can we find better orderings (or show that we cannot)? 10
First-Order approx. By tracing all data accesses, we can measure the reuse-distance of all accesses. Assuming each node is 66 bytes 1 , in a 24MB L3 cache, misses occur for all accesses with a RD greater than 372k ( FOA ). 1 coordinates (two floats), connectivity (5/6 long) and fixed/boundary state (integer). 11
First-Order approx. By tracing all data accesses, we can measure the reuse-distance of all accesses. Assuming each node is 66 bytes 1 , in a 24MB L3 cache, misses occur for all accesses with a RD greater than 372k ( FOA ). Quantiles #accesses mesh Ordering 50% 75% 90% 100% 8 52 1,168 1,924,021 ORI carabiner 1 11 99 1,923,989 15,566,520 BFS 1 4 6 1,942 RDR 8 43 642 1,767,468 ORI crake 1 11 80 1,767,488 14,226,264 BFS 1 4 6 3,903 RDR 7 39 306 1,819,234 ORI dialog 1 10 79 1,803,850 14,614,336 BFS 1 5 11 6,198 RDR 1 coordinates (two floats), connectivity (5/6 long) and fixed/boundary state (integer). 11
12
FOA (II) We know: ◮ L3 misses are due to external factors ◮ We can compute the application Reuse-Distance ◮ We have access to PAPI cache misses We can estimate the “real” number of data elements that fit a cache: Assuming that there are n X LX misses, then the n X accesses with the largest reuse distance are the one that missed . 13
FOA (II) We can estimate the “real” number of data elements that fit a cache: Assuming that there are n X LX misses, then the n X accesses with the largest reuse distance are the one that missed . Estim. max number of elements (x10 3 ) mesh Ordering L1 L2 L3 13.2 21.3 330 ORI carabiner 10.2 21.2 1060 BFS 1.6 1.88 1.94 RDR 24.6 40.9 198 ORI crake 18.3 39.2 986 BFS 3.4 3.77 3.9 RDR 59 87.7 108 ORI dialog 53.2 89.3 157 BFS 5.84 6.05 6.2 RDR 13
FOA (II) We can estimate the “real” number of data elements that fit a cache: Assuming that there are n X LX misses, then the n X accesses with the largest reuse distance are the one that missed . Estim. max number of elements (x10 3 ) mesh Ordering L1 L2 L3 13.2 21.3 330 ORI carabiner 10.2 21.2 1060 BFS 1.6 1.88 1.94 RDR 24.6 198 ORI 40.9 crake 18.3 986 BFS 39.2 3.4 3.77 3.9 RDR 59 108 ORI 87.7 dialog 53.2 157 BFS 89.3 5.84 6.05 6.2 RDR 13
Reordering cost 50 40 ← Gain with scalability the gain in execution time (%) 30 performance gain is T algo ( x ) − T RDR ( x ) , for algo being 20 T algo ( x ) either ORI of BFS and x being 10 the number of cores. 0 −10 −20 1 (bfs) 1(ori) 2 (bfs) 2 (ori) 4 (bfs) 4 (ori) 8 (bfs) 8 (ori) 16 (bfs) 16 (ori) 24 (bfs) 24 (ori) 32 (bfs) 32 (ori) Number of cores Reordering is roughly the cost of one iteration of the algorithm. Basically adds you one iteration and saves you between 10 and 40%. Only worth it if you expect some iterations ( > 3 ). 14
Conclusion Reordering strategies are known to be an efficient way to improve your data-locality (and hence your pereformance). Simple conjecture: each iteration of LMS follows roughly the same execution order. ◮ Simple reordering strategy based on this; ◮ We give an intuition that it may be hard to get better reordering strategies
Recommend
More recommend