Simultaneous Multithreading: Simultaneous Multithreading: Multiplying Alpha Performance Multiplying Alpha Performance Dr. Joel Emer Dr. Joel Emer Principal Member Technical Staff Principal Member Technical Staff Alpha Development Group Alpha Development Group Compaq Computer Corporation Compaq Computer Corporation www.compaq.com Outline Outline � Alpha Processor Roadmap Alpha Processor Roadmap � � Motivation for Introducing SMT Motivation for Introducing SMT � � Implementation of an SMT CPU Implementation of an SMT CPU � � Performance Estimates Performance Estimates � � Architectural Abstraction Architectural Abstraction � www.compaq.com
Alpha Microprocessor Overview Alpha Microprocessor Overview Higher Performance 0.18 µ µ m 0.125 µ µ m µ µ 0.35 µ µ m µ µ µ µ Lower Cost EV8 EV7 EV7 EV8 21264 21264 EV6 EV6 0.125 µ µ m µ µ 0.28 µ µ µ m µ EV78 EV78 21264 21264 ... EV67 EV67 0.18 µ µ m µ µ 21264 21264 EV68 EV68 1998 1999 2000 2001 2002 2003 First System Ship www.compaq.com EV8 Technology Overview EV8 Technology Overview � Leading edge process technology Leading edge process technology – – 1.2 1.2- -2.0GHz 2.0GHz � � 0.125µm CMOS 0.125µm CMOS � � SOI SOI- -compatible compatible � � Cu interconnect Cu interconnect � � low low- -k dielectrics k dielectrics � � Chip characteristics Chip characteristics � � ~1.2V ~1.2V Vdd Vdd � � ~250 Million transistors ~250 Million transistors � � ~1100 signal pins in flip chip packaging ~1100 signal pins in flip chip packaging � www.compaq.com
EV8 Architecture Overview EV8 Architecture Overview � Enhanced out Enhanced out- -of of- -order execution order execution � � 8 8- -wide wide superscalar superscalar � � Large on Large on- -chip L2 cache chip L2 cache � � Direct RAMBUS interface Direct RAMBUS interface � � On On- -chip router for system interconnect chip router for system interconnect � � Glueless Glueless, directory , directory- -based, based, ccNUMA ccNUMA for up to 512 for up to 512- -way SMP way SMP � � 4 4- -way simultaneous multithreading (SMT) way simultaneous multithreading (SMT) � www.compaq.com Goals Goals � Leadership single stream performance Leadership single stream performance � � Extra multistream performance with multithreading Extra multistream performance with multithreading � � Without major architectural changes Without major architectural changes � � Without significant additional cost Without significant additional cost � www.compaq.com
Instruction Issue Instruction Issue Time Reduced function unit utilization due to dependencies www.compaq.com Superscalar Issue Issue Superscalar Time Superscalar leads to more performance, but lower utilization www.compaq.com
Predicated Issue Predicated Issue Time Adds to function unit utilization, but results are thrown away www.compaq.com Chip Multiprocessor Chip Multiprocessor Time Limited utilization when only running one thread www.compaq.com
Fine Grained Multithreading Fine Grained Multithreading Time Intra-thread dependencies still limit performance www.compaq.com Simultaneous Multithreading Simultaneous Multithreading Time Maximum utilization of function units by independent operations www.compaq.com
Basic Out- -of of- -order Pipeline order Pipeline Basic Out Fetch Decode/ Queue Reg Execute Dcache/ Reg Retire Map Read Store Write Buffer PC Register Map Regs Regs Dcache Icache Thread-blind www.compaq.com SMT Pipeline SMT Pipeline Fetch Decode/ Queue Reg Execute Dcache/ Reg Retire Map Read Store Write Buffer PC Register Map Regs Dcache Regs Icache www.compaq.com
Changes for SMT Changes for SMT � Basic pipeline Basic pipeline – – unchanged unchanged � � Replicated resources Replicated resources � � Program counters Program counters � � Register maps Register maps � � Shared resources Shared resources � � Register file (size increased) Register file (size increased) � � Instruction queue Instruction queue � � First and second level caches First and second level caches � � Translation buffers Translation buffers � � Branch predictor Branch predictor � www.compaq.com Multiprogrammed workload Multiprogrammed workload 250% 200% 1T 150% 2T 3T 100% 4T 50% 0% SpecInt SpecFP Mixed Int/FP www.compaq.com
Decomposed SPEC95 Applications Decomposed SPEC95 Applications 250% 200% 1T 150% 2T 3T 100% 4T 50% 0% Turb3d Swm256 Tomcatv www.compaq.com Multithreaded Applications Multithreaded Applications 300% 250% 200% 1T 150% 2T 4T 100% 50% 0% Barnes Chess Sort TP www.compaq.com
Architectural Abstraction Architectural Abstraction � 1 CPU with 4 Thread Processing Units ( 1 CPU with 4 Thread Processing Units (TPUs TPUs) ) � � Shared hardware resources Shared hardware resources � TPU 0 TPU1 TPU2 TPU3 Icache TLB Dcache Scache www.compaq.com System Block Diagram System Block Diagram M M M 0 1 2 3 EV8 EV8 EV8 IO IO IO M M M EV8 EV8 EV8 IO IO IO M M M EV8 EV8 EV8 IO IO IO www.compaq.com
Quiescing Idle Threads Idle Threads Quiescing � Problem: Problem: � Spin looping thread consumes resources Spin looping thread consumes resources � Solution: Solution: � Provide quiescing quiescing operation that allows a operation that allows a Provide TPU to sleep until a memory location changes TPU to sleep until a memory location changes www.compaq.com Summary Summary � Alpha will maintain single stream performance leadership Alpha will maintain single stream performance leadership � � SMT will significantly enhance multistream performance SMT will significantly enhance multistream performance � � Across a wide range of applications, Across a wide range of applications, � � Without significant hardware cost, and Without significant hardware cost, and � � Without major architectural changes Without major architectural changes � www.compaq.com
References References � " " Simultaneous Multithreading: Maximizing On Simultaneous Multithreading: Maximizing On- -Chip Parallelism Chip Parallelism " by " by Tullsen Tullsen, , � Eggers and Levy in ISCA95. Eggers and Levy in ISCA95. � " " Exploiting Choice: Instruction Fetch and Issue on an Exploiting Choice: Instruction Fetch and Issue on an Implementable Implementable � Simultaneous Multithreaded Processor " by " by Tullsen Tullsen, Eggers, Emer, Levy, Lo , Eggers, Emer, Levy, Lo Simultaneous Multithreaded Processor and Stamm Stamm in ISCA96. in ISCA96. and � “ “ Converting Thread Converting Thread- -Level Parallelism to Instruction Level Parallelism to Instruction- -Level Parallelism via Level Parallelism via � Simultaneous Multithreading Simultaneous Multithreading ” by Lo, Eggers, Emer, Levy, ” by Lo, Eggers, Emer, Levy, Stamm Stamm and and Tullsen Tullsen in ACM Transactions on Computer Systems, August 1997. in ACM Transactions on Computer Systems, August 1997. � “Simultaneous Multithreading: A Platform for Next “Simultaneous Multithreading: A Platform for Next- -Generation Generation Prcoessors Prcoessors” by ” by � Eggers, Emer, Levy, Lo, Stamm Eggers, Emer, Levy, Lo, Stamm and and Tullsen Tullsen in IEEE Micro, October, 1997. in IEEE Micro, October, 1997. www.compaq.com
Recommend
More recommend