High-Performance Execution of Multithreaded Workloads on CMPs M. Aater Suleman Advisor: Yale Patt HPS Research Group The University of Texas at Austin 1
How do we use the transistors? • More transistors � Higher performance core – Performance increases without programmer effort – Larger cores are complex and consume more power But, do CMPs improve • More transistors � Bigger cache performance? – Assist the core by reducing memory accesses – Easier to design and consume less power – Pentium M: 50M out of the 77M were cache • More transistors � More cores – Chip Multiprocessors (CMPs) – Less complex – Run at lower frequency (Power α frequency 3 ) 2
Multithreading Single-Threaded But, can we do this for all applications? To leverage CMPs, applications must be split into threads 3
Easy-to-parallelize Kernels Kernel from ImageMagick GrayscaleToMonochrome (picture) foreach (OldPixel in picture) if( OldPixel > Threshold) NewPixel = 1 else NewPixel = 0 4
Serial Kernels Kernel from 1 2 3 4 ImageMagick Old pixels: avg avg avg New pixels: Smooth(Picture) for i = 1 to N Pixel[i] = (Pixel[i-1] + Pixel[i])/2 5
Amdahl’s Law As the number of cores increase, even a small serial part can have significant impact on overall performance Future CMPs must improve performance of both parallel and serial parts 6
Outline • Background • Speeding up serial part – Asymmetric Chip Multiprocessor (ACMP) • Speeding up parallel part – Feedback-Driven Threading (FDT) • Summary 7
Current CMP Architectures 8
Current CMP Architectures Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core “Niagara” Approach • Tile many small cores • Sun Niagara Processor • High throughput on the parallel part • Low performance on the serial part 9
Current CMP Architectures Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core “Niagara” Approach 10
Current CMP Architectures Niagara Niagara Niagara Niagara -like -like -like -like Large Large core core core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like Large Large core core core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core “Niagara” Approach “Tile-Large”Approach • Tile a few large cores • IBM Power 5, AMD Barcelona, Intel Core2Quad • High performance on the serial part • Low throughput on the parallel part 11
Current CMP Architectures Niagara Niagara Niagara Niagara -like -like -like -like Large Large core core core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like Large Large core core core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core “Niagara” Approach “Tile-Large”Approach 12
The Asymmetric Chip Multiprocessor (ACMP) Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like Large Large Large core core core core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like -like -like Large Large core core core core core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like -like -like core core core core core core core core ACMP Approach “Niagara” Approach “Tile-Large”Approach • Provide one large core and many small cores • Accelerate serial part using the large core • Execute parallel part on small cores for high throughput 13
The Asymmetric Chip Multiprocessor (ACMP) Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like Large Large Large core core core core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like -like -like Large Large core core core core core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like -like -like core core core core core core core core ACMP Approach “Niagara” Approach “Tile-Large”Approach 14
The Asymmetric Chip Multiprocessor (ACMP) Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like Large Large Large core core core core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like -like -like Large Large core core core core core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like -like -like core core core core core core core core ACMP Approach “Niagara” Approach “Tile-Large”Approach • Analytical experiment details – One large core replaces four small cores – Large core provides 2x performance 15
Performance vs. Parallelism 9 At medium At high parallelism, Niagara Speedup vs. 1 Large Core 8 parallelism, ACMP Niagara Tile-Large 7 wins outperforms ACMP ACMP 6 Niagara beats ACMP at 97% 5 Both ACMP and parallelism 4 Tile-Large outperform Niagara 3 2 1 0 0 0.2 0.4 0.6 0.8 1 Degree of Parallelism 16
Throughput of ACMP vs. Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like Large core core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like -like -like core core core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like -like -like core core core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like -like -like core core core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like -like -like core core core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like -like -like core core core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like -like -like core core core core core core core core 17
ACMP Scheduling Niagara Niagara -like -like Large core core core Niagara Niagara -like -like core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core ACMP Approach 18
Data Transfers in ACMP • Data is transferred if the serial part requires the data generated by the parallel part or vice-versa • ACMP – Data is transferred from all small cores • Niagara/Tile-Large – Data is transferred from all but one core • Number of data transfers increases by only 3.8% 19
Experimental Methodology • Configurations : – Niagara: 16 small cores – Tile-Large: 4 large cores – ACMP: 1 large core, 12 small cores • Simulated existing multithreaded applications without modification • Simulation parameters: – x86 cycle accurate processor simulator – Large core: 2GHz, out-of-order, 128-entry window, 4-wide issue, 12-stage pipeline – Small core: 2GHz, in-order, 2-wide, 5-stage pipeline – Private 32 KB L1, private 256KB L2 – On-chip interconnect: Bi-directional ring 20
Performance Results Tile-Large 1.4 ACMP 1.2 Speedup over Niagara 1 0.8 0.6 0.4 0.2 Low Medium High 0 Parallelism Parallelism Parallelism h p p h p p p y t f e 4 d r s c s s s s m k e s g 6 e a m a s a a a a v 2 a o l n l n p e n n n p p . _ _ h s l _ _ _ o s o t g _ s g p c _ r h a m i e m t c c f f m f 21
Recommend
More recommend