Modeling Cache Sharing for MPI Programs on Multi-core Machines Bin Bao, Chen Ding University of Rochester Nov 10, 2011 The 10th Workshop On Compiler-Driven Performance Thursday, November 10, 11
Multi-core Popularity ✤ More and more cluster machines are using multi-core processors ✤ TOP500.org (June 2011): ✤ “Quad-core processors are used in 46.2 percent of the systems, while already 42.4 percent of the systems use processors with six or more cores.” Thursday, November 10, 11
Programming on Cluster ✤ MPI (Message-Passing Interface) is still dominant ✤ Scalability issues, e.g. ✤ Load balance ✤ Communication overhead ✤ Multicore: resource contention Thursday, November 10, 11
Performance Impact of Resource Sharing ✤ Experimental studies: Chai et al. [CCGRID’07], Saini et al. [SC’09], etc. 1x2 2x2 4x2 3 ✤ Intel Nehalem E5520 (4 cores) Speedup (Shared 8MB L3 cache) 2 ✤ GCC 4.4.1, MPICH2 1.4.1 1 0 cg ep ft is lu mg Thursday, November 10, 11
Goal: Modeling Cache Contention ✤ Tool: reuse distance (aka LRU stack distance), the number of distinct data elements accessed between two consecutive references to the same element a b c c d a rd=3 ✤ Reuse distance can be used to calculate cache miss rate ✤ Program A’s cache miss rate = P(A’s reuse distance ≥ cache size) Thursday, November 10, 11
Locality (Reuse Distance) Scaling ✤ Strong scaling: fixed total problem size ✤ Fixed-distance reuses and scaled-distance reuses a 1,1 a 1,2 ... a 1,n b 1 c 1 a 2,1 a 2,2 ... a 2,n b 2 c 2 X = ... ... ... a n,1 a n,2 ... a n,n b n c n Thursday, November 10, 11
Locality (Reuse Distance) Scaling ✤ Strong scaling: fixed total problem size ✤ Fixed-distance reuses and scaled-distance reuses a 1,1 a 1,2 ... a 1,n b 1 c 1 a 2,1 a 2,2 ... a 2,n b 2 c 2 X = ... ... ... a n,1 a n,2 ... a n,n b n c n Thursday, November 10, 11
Locality (Reuse Distance) Scaling ✤ Strong scaling: fixed total problem size ✤ Fixed-distance reuses and scaled-distance reuses a 1,1 a 1,2 ... a 1,n b 1 c 1 a 2,1 a 2,2 ... a 2,n b 2 c 2 X = ... ... ... a n,1 a n,2 ... a n,n b n c n Thursday, November 10, 11
Reuse Distance Reference Histogram 1GB 1x2 NAS-LU (B input) 2x2 4x2 32MB Average reuse distance 1MB 32KB 1KB 0 200 400 600 800 1000 Reference partitions Thursday, November 10, 11
Reuse Distance Reference Histogram 1GB 1x2 NAS-LU (B input) 2x2 4x2 32MB Average reuse distance 1MB 32KB 1KB 0 200 400 600 800 1000 Reference partitions Thursday, November 10, 11
More Examples 1GB 1GB 1x2 1x2 FT CG 2x2 2x2 4x2 4x2 32MB 32MB Average reuse distance Average reuse distance 1MB 1MB 32KB 32KB 1KB 1KB 0 200 400 600 800 1000 0 200 400 600 800 1000 1GB 1GB 1x2 1x2 Reference partitions Reference partitions MG IS 2x2 2x2 4x2 4x2 32MB 32MB Average reuse distance Average reuse distance 1MB 1MB 32KB 32KB 1KB 1KB 0 200 400 600 800 1000 0 200 400 600 800 1000 Thursday, November 10, 11
Linear Regression Based Reuse Distance Prediction ✤ For partition i: rd i = a i × (1/nproc) + b i ✤ The model captures scaled-distance reuses and fixed-distance reuses ✤ Each partition is independent Thursday, November 10, 11
Cache Sharing ✤ General dilation model ✤ Symmetric MPI programs: [Xiang et al. PPoPP’11] uniform interleaving assumption rd = 5 rd = 5 a b c d e f a a b c d e f a Task A: Program A: rd’ = 5 ft = 4 k l m n o p k k m m m n o n Task B: Program B: rd” = 11 rd’ = rd+ft = 9 abcklmdenofpak akbcmdmemfnona Task A&B: Program A&B: Thursday, November 10, 11
Experiments ✤ Pin-based trace cache simulator (16-way LRU, 8MB) MPI Task MPI Task MPI Task MPI Task ... Pin Tool Pin Tool Pin Tool Pin Tool Pin Lock $ Block $ Block ... $ Block ... Pin Lock $ Block $ Block ... $ Block ✤ Performance counters (OProfile, LLC Misses) Thursday, November 10, 11
Cache Simulator vs Reuse Distance Based Calculation 1.6 1.6 1x2 1x2 2x2 2x2 1.4 1.4 4x2 4x2 1.2 1.2 Memory Traffic Memory Traffic 1 1 0.8 0.8 0.6 0.6 0.4 0.4 cg ep ft is lu mg cg ep ft is lu mg (b) Reuse Distance (a) Cache Simulator Based Calculation Thursday, November 10, 11
Reuse Distance Prediction 1.6 1.6 1x2 1x2 2x2 2x2 1.4 1.4 4x2 4x2 1.2 1.2 Memory Traffic Memory Traffic 1 1 0.8 0.8 0.6 0.6 0.4 0.4 cg ep ft is lu mg cg ep ft is lu mg (a) Reuse Distance (b) Reuse Distance Based Calculation Prediction (8-task) Thursday, November 10, 11
Hardware Performance Counter 1.6 1.6 1x2 1x2 2x2 2x2 1.4 1.4 4x2 4x2 1.2 1.2 Memory Traffic Memory Traffic 1 1 0.8 0.8 0.6 0.6 0.4 0.4 cg ep ft is lu mg cg ep ft is lu mg (a) Cache Simulator (b) Performance Counter Thursday, November 10, 11
Hardware Performance Counter 1.6 1.6 1x2 1x2 2x2 2x2 1.4 1.4 4x2 4x2 1.2 1.2 Memory Traffic Memory Traffic 1 1 0.8 0.8 0.6 0.6 0.4 0.4 cg ep ft is lu mg cg ep ft is lu mg (a) Cache Simulator (b) Performance Counter Thursday, November 10, 11
Hardware Performance Counter do k = 1, d3 do ii = 0, d1 - fftblock, fftblock do j = 1, d2 do i = 1, fftblock y(i,j,1) = x(i+ii,j,k) 1.6 1.6 enddo 1x2 1x2 2x2 2x2 1.4 enddo 1.4 4x2 4x2 call cfftz (is, logd2, d2, y, y(1, 1, 2)) 1.2 1.2 Memory Traffic Memory Traffic do j = 1, d2 1 1 do i = 1, fftblock 0.8 0.8 xout(i+ii,j,k) = y(i,j,1) 0.6 0.6 enddo 0.4 enddo 0.4 cg ep ft is lu mg cg ep ft is lu mg enddo enddo (a) Cache Simulator (b) Performance Counter Thursday, November 10, 11
Summary & Future Work ✤ Reuse distance reference histograms show clear patterns ✤ Linear regression based reuse distance prediction ✤ Coarse-granularity uniform interleaving assumption ✤ Verified with a Pin-based cache simulator ✤ Memory bandwidth contention modeling Thursday, November 10, 11
Recommend
More recommend