Inherently Lower Complexity Inherently Lower Complexity Architectures using Architectures using Dynamic Optimization Dynamic Optimization Michael Gschwind Michael Gschwind Erik Altman Erik Altman ÿþýüûúùúüø÷öõôóüòñõñ÷ðïîüíñóöñð
What is the Problem? What is the Problem? Out of order superscalars achieve high Out of order superscalars achieve high performance. performance. high hardware complexity ... But at the cost of high hardware complexity ... But at the cost of Predictors Predictors Complex decode Complex decode Complex issue queues with wakeup and issue Complex issue queues with wakeup and issue logic logic Register mapping tables Register mapping tables ... ...
What is the Problem? What is the Problem? Out of order superscalars achieve high Out of order superscalars achieve high performance. performance. ... But at the cost of high power. high power. ... But at the cost of Many out of order components operate Many out of order components operate every cycle. every cycle. Many components query a large set of Many components query a large set of data to operate on a single element. data to operate on a single element.
What is the Problem? What is the Problem? Out of order superscalars achieve high Out of order superscalars achieve high performance. performance. ... But at the cost of deep pipelines. deep pipelines. ... But at the cost of Complex logic has long latency. Complex logic has long latency. To achieve high frequency with long To achieve high frequency with long latency, super pipelining is required. latency, super pipelining is required. Deep pipelines require excellent branch Deep pipelines require excellent branch predictors. predictors. Excellent branch predictors are complex. Excellent branch predictors are complex. Complex logic has long latency ... Complex logic has long latency ...
What is the Problem? What is the Problem? Out of order superscalars achieve high Out of order superscalars achieve high performance. performance. ... But at the cost of high verification and high verification and ... But at the cost of debug complexity. debug complexity. With Moore's Law, schedule slips = With Moore's Law, schedule slips = performance slips performance slips Schedule Slip Relative Performance 1 month 4% 3 month 12% 6 month 26% 9 month 41% 12 month 59% 18 month 100%
What is the Solution? What is the Solution? Software Dynamic Optimization Software Dynamic Optimization Allows reduced hardware complexity: Allows reduced hardware complexity: Shorter pipelines for same frequency. Shorter pipelines for same frequency. Fewer hardware predictors. Fewer hardware predictors. Simpler issue logic. Simpler issue logic. Less power, a la Transmeta. Less power, a la Transmeta. Less debug and verification. Less debug and verification. Smaller chips and higher yield. Smaller chips and higher yield.
How to Implement the Solution How to Implement the Solution BOA Architecture for Complexity BOA Architecture for Complexity Effective Design Effective Design BOA = = B Binary Translation inary Translation O Optimized ptimized BOA Architecture rchitecture A BOA in combination with its dynamic BOA in combination with its dynamic optimization software is architecturally optimization software is architecturally compatible with PowerPC. compatible with PowerPC.
What is interesting about BOA? What is interesting about BOA? Software dynamic optimization. Software dynamic optimization. Precise behavior on most memory faults. Precise behavior on most memory faults. Load/Store order tables ensure memory Load/Store order tables ensure memory semantics and allow aggressive dynamic semantics and allow aggressive dynamic software reordering. software reordering. Instruction recirculation mechanism to Instruction recirculation mechanism to simplify issue and exception handling. simplify issue and exception handling. Predictable latencies handled by Predictable latencies handled by software, unpredictable by hardware. software, unpredictable by hardware.
BOA System Architecture BOA System Architecture Update Goto Interpret Ins X Statistics next ins (PowerPC) X X Prev No Seen X Translated 15 times No Entry Pt Yes Yes Form Group at X Exec Group X's and Translate Ins BOA Translation to BOA Instruc
BOA ISA (1) BOA ISA (1) BOA is variable length VLIW machine. BOA is variable length VLIW machine. BOA instructions (bundles) are 128 bits. BOA instructions (bundles) are 128 bits. Bundles have 3 primitive ops. Bundles have 3 primitive ops. Primitive ops have 39 bits plus stop bit. Primitive ops have 39 bits plus stop bit. Complex PowerPC ops cracked. Complex PowerPC ops cracked. 8 bits of bundle reserved for future uses 8 bits of bundle reserved for future uses such as predication. such as predication. Instruction Issue: Instruction Issue: Up to 6 primitive ops are issued together. Up to 6 primitive ops are issued together. Only last op issued may have stop bit set. Only last op issued may have stop bit set.
BOA BOA Instructions Instructions
BOA ISA (2) BOA ISA (2) 64 Integer Registers Integer Registers 64 64 Float Registers Float Registers 64 16 4-bit 4-bit Condition Registers Condition Registers 16 Branches take 1 1 cycle: cycle: Branches take Branch mispredicts cost 7 7 cycles cycles Branch mispredicts cost Static branch pred ( using interpreter stats using interpreter stats ) ) Static branch pred ( At most one branch per cycle At most one branch per cycle
PowerPC State and Precise PowerPC State and Precise Exceptions Exceptions PowerPC Regs Shadow Regs Scratch Regs ÿþýüûú��� ÿþýüûúùø÷þø ����ûø�ý�
BOA Latencies BOA Latencies Integer ops take 1 1 cycle cycle Integer ops take No bypass => => Dependent ops must be 2 Dependent ops must be 2 No bypass . cycles apart cycles apart LOADs take 3 3 cycles cycles LOADs take No bypass => => Dependent ops must be 4 Dependent ops must be 4 ..... ..... No bypass cycles later cycles later
BOA Resources BOA Resources Issue Slots 6 Issue Slots 6 2 LOAD / STORE LOAD / STORE units units 2 Each with own copy of register file Each with own copy of register file 4 Integer Integer units units 4 Each with own copy of register file Each with own copy of register file 2 Float Float units units 2 1 Branch Branch unit unit 1 32-entry -entry Load Load and and Store Buffers Store Buffers 32 Register scoreboarding of LOAD values Register scoreboarding of LOAD values Stall when try to use loaded value Stall when try to use loaded value
Dynamic Dynamic Optimization Optimization
BOA Dynamic Optimization BOA Dynamic Optimization BOA's software optimizer originates with BOA's software optimizer originates with IBM's earlier DAISY project. IBM's earlier DAISY project. BOA adjusted and tuned optimizer: BOA adjusted and tuned optimizer: To support a narrower, higher frequency target To support a narrower, higher frequency target machine. machine. To optimize along single hyperblock paths, To optimize along single hyperblock paths, instead of tree region with multiple paths. instead of tree region with multiple paths. Improves code packing, reduces TLB misses Improves code packing, reduces TLB misses Improves code layout and helps IFetch, a la Improves code layout and helps IFetch, a la trace caches. trace caches.
Dynamic Optimization Dynamic Optimization Environments Environments Dynamic Optimization can be used in a Dynamic Optimization can be used in a variety of environments: variety of environments: Process level Process level Idealized virtual memory Idealized virtual memory Fewer difficult system/kernel code issues Fewer difficult system/kernel code issues Operating system level Operating system level No modifications to operating system No modifications to operating system More transparent More transparent Less danger of compatibility issues Less danger of compatibility issues
Dynamic Optimization Targets (1) Dynamic Optimization Targets (1) Simpler implementation of the same Simpler implementation of the same architecture architecture Ability to bail out and revert to native Ability to bail out and revert to native execution: execution: If overhead too high If overhead too high For hard to emulate sequences For hard to emulate sequences When no benefit of DO can be measured When no benefit of DO can be measured Or actually degrades Or actually degrades
Dynamic Optimization Targets (2) Dynamic Optimization Targets (2) Different architecture, e.g., RISC => Different architecture, e.g., RISC => VLIW VLIW Drastically simplify architecture Drastically simplify architecture Reduce decoding overhead even further Reduce decoding overhead even further Add more registers, add new concepts Add more registers, add new concepts All code must be emulated. Can cause All code must be emulated. Can cause severe degradation if low reuse, e.g. severe degradation if low reuse, e.g. WinStone. WinStone. Get benefits of code packing Get benefits of code packing
Recommend
More recommend