ACMP: An Architecture to Handle Amdahls Law M. Aater Suleman - PowerPoint PPT Presentation

ACMP: An Architecture to Handle Amdahl’s Law M. Aater Suleman Advisor: Yale Patt HPS Research Group

Acknowledgements Eric Sprangle, Intel Anwar Rohillah, Intel Anwar Ghuloum, Intel Doug Carmean, Intel

Background • Single-thread performance is power constrained • To leverage CMPs for a single application, it must be parallelized • Many kernels cannot be parallelized completely • Applications likely include both serial and parallel portions • Amdahl’s law is more applicable now than ever

Serial Bottlenecks • Inherently serial kernels For I = 1 to N A[I] = (A[I-1] + A[I])/2 • Parallelization requires effort 1 0.9 Irregular 0.8 Degree of Parallelism code Loops with early 0.7 termination 0.6 0.5 Data-parallel 0.4 Loops 0.3 0.2 0.1 0 Programmer Effort

CMP Architectures • Tile small cores e.g. Sun Niagara, Intel Larrabee – High throughput on the parallel part – Low serial thread performance – Highest performance for completely parallelized applications • Tile large cores e.g. Intel Core2Duo, AMD Barcelona, and IBM Power 5. – High serial thread performance – Lower throughput than Niagara

ACMP • Run serial thread on the large core to extract ILP • Run parallel threads on small cores

Performance vs. Parallelism 18 Speedup vs. 1 P6-type Core 16 ACMP 14 Niagara 12 P6-Tile 10 8 6 4 2 0 0 0.2 0.4 0.6 0.8 1 Degree of Parallelism

Performance vs. Parallelism 18 Speedup vs. 1 P6-type Core 16 ACMP 14 Niagara 12 P6-Tile 10 At low parallelism, 8 ACMP and P6-Tile 6 outperform Niagara 4 2 0 0 0.2 0.4 0.6 0.8 1 Degree of Parallelism

Performance vs. Parallelism 18 At high parallelism, Speedup vs. 1 P6-type Core 16 ACMP Niagara 14 Niagara outperforms ACMP 12 P6-Tile 10 8 6 4 2 0 0 0.2 0.4 0.6 0.8 1 Degree of Parallelism

Performance vs. Parallelism 18 At medium Speedup vs. 1 P6-type Core 16 ACMP parallelism, ACMP 14 Niagara wins 12 P6-Tile 10 8 6 4 2 0 0 0.2 0.4 0.6 0.8 1 Degree of Parallelism

Performance vs. Parallelism 18 Speedup vs. 1 P6-type Core 16 ACMP The cut-off point 14 Niagara moves to the right 12 P6-Tile in the future 10 8 6 4 2 0 0 0.2 0.4 0.6 0.8 1 Degree of Parallelism

Experimental Methodology • Large core: Out-of-order (similar to P6) • Small Core: 2-wide, In-order • Configuration: – Niagara: 16 small cores – P6-Tile: 4 large cores – ACMP: 1 Large core, 12 small cores • Single ISA, shared memory, private L1 and L2 caches, bi-directional ring interconnect • Simulated existing multi-threaded applications without modification • ACMP Thread Scheduling – Master thread � large core – All additional threads � small cores

Performance Results P6-Tile 1.4 ACMP 1.2 Speedup vs. Niagara 1 0.8 0.6 0.4 0.2 Medium High Low 0 is_nasp ep_nasp art_omp mg_nasp fmm_splash cholesky page convert h.264 ed mcf fft_splash cg_nasp Parallelism Parallelism Parallelism

Summary • ACMP trades peak parallel performance for serial performance • Improves performance for a wide range of applications • Performance is less dependent on length of serial portion • Improves programmer efficiency – Programmers can only parallelize easier-to- parallelize kernels

Future Work • Enhanced ACMP scheduling – Accelerate execution of finer-grain serial portions (critical sections) using the large core – Requires compiler support and minimal hardware • Improved threading decision based on run- time feedback

Thank you

ACMP: An Architecture to Handle Amdahls Law M. Aater Suleman - PowerPoint PPT Presentation

ACMP: An Architecture to Handle Amdahls Law M. Aater Suleman Advisor: Yale Patt HPS Research Group Acknowledgements Eric Sprangle, Intel Anwar Rohillah, Intel Anwar Ghuloum, Intel Doug Carmean, Intel Background Single-thread

ACMP UK Panel Discussion How Do We Turn the Dial in Delivering Virtual Change Management in the

Amdahl s Law 18 Amdahl s Law The fundamental theorem of performance optimization

Touchless Handle Touchless Handle | Product Vision Touchless Handle is a gesture-based way to

Concurrent Programming Romolo Marotta Data Centers and High Performance Computing Amdahl

A Knowledge Proxy bridging between SNMP and ACMP Formatvorlage des Untertitelmasters durch

Touchless Handle Swipe to lock/unlock Touchless Handle is a hands-free way to operate a bathroom

Boyd, Metcalfe and Amdahl - Modelling Networked Warfighting Systems Carlo Kopp, BE(Hons),

Institute of Law Institute of Law Institute of Law Institute of Law Law Made Simple

Statement of Ohms Law Circuit diagram of Ohms Law Formula of Ohms Law Ohms law in

What Keeps You Up at Night? Issues of Fraud and Abuse Compliance Series How to Handle the Bad

2. Adjustable Litter Handle Litter Handle for Search and Rescue 3,453 wilderness rescue

Studying Law at Salford Presented by: Ian King (Law UG Programme Leader) and Emma Clarke (Final

An Agent Architecture An Agent Architecture An Agent Architecture An Agent Architecture for

Architecture: Culture and Space Architecture: Culture and Space Architecture: Culture and Space

30. Parallel Programming I Moores Law and the Free Lunch, Hardware Architectures, Parallel

NFS Don Porter CSE 506 together as the !handle for a rue. The inode generation number is

ECON ASICs Jim Hirschauer, Ralph Wickwire ASICs PMG 11 Nov 2019 DOE CD-1 IPR and CERN P2UG

On serial group rings of central extensions of simple groups Andrei Kukharau Siberian Federal

The Relational Database Engine: An Efficient Validator of T emporal Properties on Event T races

CPL 2016, week 4 Inter-thread communication Oleg Batrashev Institute of Computer Science, Tartu,

15-721 DATABASE SYSTEMS Lecture #04 Optimistic Concurrency Control Andy Pavlo / / Carnegie

Dark Storm: Further Adventures I n XT Architecture Flexibility John P. Noe Robert A. Ballance

Linux on the Ipaq by Jon Nelson Linux on the Ipaq Distros Familiar Intimate

Datapath component (4) Prof. Usagi Recap: Memory hierarchy in modern processor