Monte Carlo Processor Modeling Monte Carlo Processor Modeling of Contemporary Computer of Contemporary Computer Architectures Architectures Jeanine Cook Jeanine Cook Students: Waleed Alkohlani, Ram Srinivasan Students: Waleed Alkohlani, Ram Srinivasan New Mexico State University New Mexico State University LACSS 2008 QuickTime™ and a decompressor are needed to see this picture.
Problem Problem � Need tools to do performance analysis of � Need tools to do performance analysis of contemporary architectures (design, prediction, contemporary architectures (design, prediction, procurement) procurement) � Cycle-accurate simulation � Cycle-accurate simulation � Great for accuracy, hard on time! � Great for accuracy, hard on time! � Lack of freely available simulators that simulate � Lack of freely available simulators that simulate contemporary architectures contemporary architectures � Analytic models � Analytic models � Hard to use � Hard to use � Not very accurate or robust � Not very accurate or robust LACSS 2008 QuickTime™ and a decompressor are needed to see this picture.
Solution Solution � Statistical model � Statistical model � Based on processor and application � Based on processor and application characteristics characteristics � Generates fast, accurate predictions � Generates fast, accurate predictions � Can do more than just predict execution time � Can do more than just predict execution time � Robust � Robust LACSS 2008 QuickTime™ and a decompressor are needed to see this picture.
Monte Carlo Processor Modeling Monte Carlo Processor Modeling � Processor pipeline abstracted into statistical � Processor pipeline abstracted into statistical model using model using � dynamic application profiles � dynamic application profiles � processor microarchitecture characteristics � processor microarchitecture characteristics � Based on CPI = CPI I + CPI S � Based on CPI = CPI I + CPI S � CPI I ==> Intrinsic CPI based on issue width � CPI I ==> Intrinsic CPI based on issue width � CPI S ==> CPI due to stalls � CPI S ==> CPI due to stalls LACSS 2008 QuickTime™ and a decompressor are needed to see this picture.
Current Capabilities Current Capabilities � Single and multi-core � Single and multi-core � In-order instruction execution � In-order instruction execution � Flexible cache model � Flexible cache model � Captures instruction sequence relationships � Captures instruction sequence relationships � Niagara 1 and 2, Cell, Itanium � Niagara 1 and 2, Cell, Itanium LACSS 2008 QuickTime™ and a decompressor are needed to see this picture.
Future Capabilities Future Capabilities � Improved flexible cache model � Improved flexible cache model � Implement out-of-order model methodology � Implement out-of-order model methodology � Develop method for modeling multi-threaded � Develop method for modeling multi-threaded processors processors � Implement power models for consumption � Implement power models for consumption prediction prediction � Integrate into communication model ==> MP � Integrate into communication model ==> MP model model � Modeling framework � Modeling framework LACSS 2008 QuickTime™ and a decompressor are needed to see this picture.
Cell Model Cell Model � PPE (PowerPC) - 2- � PPE (PowerPC) - 2- issue, in-order, 2-way issue, in-order, 2-way SMT SMT � SPEs - 2-issue, in- � SPEs - 2-issue, in- order, SIMD order, SIMD � EIB - 96 bytes/cycle � EIB - 96 bytes/cycle LACSS 2008 QuickTime™ and a decompressor are needed to see this picture.
Synergistic Processing Elements Synergistic Processing Elements (SPEs) (SPEs) � SPU - statically � SPU - statically scheduled, 128x128- scheduled, 128x128- bit regs, 256KB local bit regs, 256KB local store (LS) store (LS) � MFC - handles � MFC - handles communication; DMA communication; DMA requests (from PPE requests (from PPE and SPUs), mailboxes, and SPUs), mailboxes, signals signals LACSS 2008 QuickTime™ and a decompressor are needed to see this picture.
SPU EUs SPU EUs Partition EU (and latency in cycles) Even FP6(6), FP7(7), FPD(13), FX2(2), FX3(4), FXB(4), NOP(0) Odd LSU(6), BR(4), SHUF(4), SPR(4), LNOP(0) � FPD not fully pipelined; when insn issued, stalls � FPD not fully pipelined; when insn issued, stalls global insn issue for 6 cycles global insn issue for 6 cycles LACSS 2008 QuickTime™ and a decompressor are needed to see this picture.
12 Steps (1) 12 Steps (1) 1. Issue mechanism : SPU stall-at-use 1. Issue mechanism : SPU stall-at-use 2. CPIi : should be 1/2, but due to even/odd restrictions, we 2. CPIi : should be 1/2, but due to even/odd restrictions, we measured from dynamic insn stream measured from dynamic insn stream 3. Stall reasons : unresolved dependences, mis-speculated 3. Stall reasons : unresolved dependences, mis-speculated branches branches 4. EU characteristics : on prior slide; not shared 4. EU characteristics : on prior slide; not shared 5. 5. Cache characteristics : no cache hierarchy Cache characteristics : no cache hierarchy 6. Memory characteristics : only modeled SPUs; no 6. Memory characteristics : only modeled SPUs; no access directly to memory; access latency LS 6 cycles access directly to memory; access latency LS 6 cycles 7. 7. Branch predictor characteristics : 18 cycle fixed Branch predictor characteristics : 18 cycle fixed penalty branch mis-predict penalty branch mis-predict LACSS 2008 QuickTime™ and a decompressor are needed to see this picture.
12 Steps (2) 12 Steps (2) 8. Variable latency : branch unit depends on quality of 8. Variable latency : branch unit depends on quality of branch hints branch hints 9. 9. Application characteristics : CPI i , dynamic insn mix Application characteristics : CPI i , dynamic insn mix (generate transition probs), dependence distance (generate transition probs), dependence distance histograms, hint-to-branch histogram, prob of taken and histograms, hint-to-branch histogram, prob of taken and hinted branches hinted branches 10. Collect application profile : designed instrumentation 10. Collect application profile : designed instrumentation tool for Cell tool for Cell 11. Model 11. Model 12. Validation 12. Validation LACSS 2008 QuickTime™ and a decompressor are needed to see this picture.
Cell SPU Model Cell SPU Model LACSS 2008 QuickTime™ and a decompressor are needed to see this picture.
Token Generation Token Generation � Instruction mix translated to probability � Instruction mix translated to probability � Tokens encoded as integers for each insn � Tokens encoded as integers for each insn class class � Markov token generator � Markov token generator LACSS 2008 QuickTime™ and a decompressor are needed to see this picture.
Dependence Generator and Stall Unit Dependence Generator and Stall Unit � Based on application dependence histograms (e.g., � Based on application dependence histograms (e.g., FP-use, LD-use) FP-use, LD-use) LACSS 2008 QuickTime™ and a decompressor are needed to see this picture.
Branch Hints Branch Hints � SPUs statically predict branches not taken (0 cycle � SPUs statically predict branches not taken (0 cycle penalty) penalty) � 18 cycle penalty for taken (mis-predicted) � 18 cycle penalty for taken (mis-predicted) branches branches � Hinting mechanism to reduce penalty � Hinting mechanism to reduce penalty � Hints take effect after 4th pipe stage � Hints take effect after 4th pipe stage � Take 9 more cycles to fetch target � Take 9 more cycles to fetch target � If branch appears within 4 cycles of hint, hint does � If branch appears within 4 cycles of hint, hint does nothing nothing LACSS 2008 QuickTime™ and a decompressor are needed to see this picture.
Hint-to-Branch: Taken Hint-to-Branch: Taken LACSS 2008 QuickTime™ and a decompressor are needed to see this picture.
Problem with Branch Hints Problem with Branch Hints � Hinted, not-taken branches can stall up to � Hinted, not-taken branches can stall up to 27 cycles! 27 cycles! � Hinting not-taken branch - hint is probably � Hinting not-taken branch - hint is probably wrong wrong LACSS 2008 QuickTime™ and a decompressor are needed to see this picture.
Hint-to-Branch: Not Taken Hint-to-Branch: Not Taken LACSS 2008 QuickTime™ and a decompressor are needed to see this picture.
Hint Unit Hint Unit � Service time of BR unit based on hint-to-branch � Service time of BR unit based on hint-to-branch distance histogram distance histogram � Statistically determine from application � Statistically determine from application � probability that branch taken/not taken; branch is hinted � probability that branch taken/not taken; branch is hinted LACSS 2008 QuickTime™ and a decompressor are needed to see this picture.
Model Parameters Model Parameters � CPI i � CPI i � Instruction transition probabilities � Instruction transition probabilities � Dependence distance histograms � Dependence distance histograms � Hint-to-branch distance histograms � Hint-to-branch distance histograms � Probability of taken branch � Probability of taken branch � Probability of hinted branch � Probability of hinted branch LACSS 2008 QuickTime™ and a decompressor are needed to see this picture.
Recommend
More recommend