ACMP: An Architecture to Handle Amdahl’s Law M. Aater Suleman Advisor: Yale Patt HPS Research Group
Acknowledgements Eric Sprangle, Intel Anwar Rohillah, Intel Anwar Ghuloum, Intel Doug Carmean, Intel
Background • Single-thread performance is power constrained • To leverage CMPs for a single application, it must be parallelized • Many kernels cannot be parallelized completely • Applications likely include both serial and parallel portions • Amdahl’s law is more applicable now than ever
Serial Bottlenecks • Inherently serial kernels For I = 1 to N A[I] = (A[I-1] + A[I])/2 • Parallelization requires effort 1 0.9 Irregular 0.8 Degree of Parallelism code Loops with early 0.7 termination 0.6 0.5 Data-parallel 0.4 Loops 0.3 0.2 0.1 0 Programmer Effort
CMP Architectures • Tile small cores e.g. Sun Niagara, Intel Larrabee – High throughput on the parallel part – Low serial thread performance – Highest performance for completely parallelized applications • Tile large cores e.g. Intel Core2Duo, AMD Barcelona, and IBM Power 5. – High serial thread performance – Lower throughput than Niagara
ACMP • Run serial thread on the large core to extract ILP • Run parallel threads on small cores
ACMP • Run serial thread on the large core to extract ILP • Run parallel threads on small cores
ACMP • Run serial thread on the large core to extract ILP • Run parallel threads on small cores
Performance vs. Parallelism 18 Speedup vs. 1 P6-type Core 16 ACMP 14 Niagara 12 P6-Tile 10 8 6 4 2 0 0 0.2 0.4 0.6 0.8 1 Degree of Parallelism
Performance vs. Parallelism 18 Speedup vs. 1 P6-type Core 16 ACMP 14 Niagara 12 P6-Tile 10 At low parallelism, 8 ACMP and P6-Tile 6 outperform Niagara 4 2 0 0 0.2 0.4 0.6 0.8 1 Degree of Parallelism
Performance vs. Parallelism 18 At high parallelism, Speedup vs. 1 P6-type Core 16 ACMP Niagara 14 Niagara outperforms ACMP 12 P6-Tile 10 8 6 4 2 0 0 0.2 0.4 0.6 0.8 1 Degree of Parallelism
Performance vs. Parallelism 18 At medium Speedup vs. 1 P6-type Core 16 ACMP parallelism, ACMP 14 Niagara wins 12 P6-Tile 10 8 6 4 2 0 0 0.2 0.4 0.6 0.8 1 Degree of Parallelism
Performance vs. Parallelism 18 Speedup vs. 1 P6-type Core 16 ACMP The cut-off point 14 Niagara moves to the right 12 P6-Tile in the future 10 8 6 4 2 0 0 0.2 0.4 0.6 0.8 1 Degree of Parallelism
Experimental Methodology • Large core: Out-of-order (similar to P6) • Small Core: 2-wide, In-order • Configuration: – Niagara: 16 small cores – P6-Tile: 4 large cores – ACMP: 1 Large core, 12 small cores • Single ISA, shared memory, private L1 and L2 caches, bi-directional ring interconnect • Simulated existing multi-threaded applications without modification • ACMP Thread Scheduling – Master thread � large core – All additional threads � small cores
Performance Results P6-Tile 1.4 ACMP 1.2 Speedup vs. Niagara 1 0.8 0.6 0.4 0.2 Medium High Low 0 is_nasp ep_nasp art_omp mg_nasp fmm_splash cholesky page convert h.264 ed mcf fft_splash cg_nasp Parallelism Parallelism Parallelism
Summary • ACMP trades peak parallel performance for serial performance • Improves performance for a wide range of applications • Performance is less dependent on length of serial portion • Improves programmer efficiency – Programmers can only parallelize easier-to- parallelize kernels
Future Work • Enhanced ACMP scheduling – Accelerate execution of finer-grain serial portions (critical sections) using the large core – Requires compiler support and minimal hardware • Improved threading decision based on run- time feedback
Thank you
Recommend
More recommend