☎ � � ✁ � ✂ ✄ ✆ Cost-Effective Compiler Directed Memory Prefetching and Bypassing Daniel Ortega, , Eduard Ayguad´ e , Jean-Loup Baer and Mateo Valero Departamento de Arquitectura de Computadores, Department of Computer Science and Engineering, Universidad Polit´ ecnica de Catalu˜ na – Barcelona University of Washington – Seattle dortega,eduard,mateo @ac.upc.es baer@cs.washington.edu PACT 2002 – p.1/20
Conventional Cache Hierarchy Regs 1 cycle L1 L1 10s cycles L2 >100 cycles Main Memory PACT 2002 – p.2/20
Conventional Cache Hierarchy Regs 3-5 cycles (more?) L1 L1 10s cycles L2 >100 cycles Main Memory PACT 2002 – p.2/20
Conventional Cache Hierarchy Regs 3-5 cycles (more?) (smaller?) L1 L1 10s cycles L2 >100 cycles Main Memory PACT 2002 – p.2/20
✝ ✝ ✞ ✞ ✞ Our approach Attack register-L1 gap with memory instruction bypassing Use hardware prefetcher for L1-L2 gap Directed by the software thus simple Hardware no recovery needed in case of misprediction PACT 2002 – p.3/20
Limit Study 2.5 SpeedUp No Prefetching No Bypassing 2.0 No Prefetching Perfect Bypassing Perfect Prefetching No Bypassing Perfect Prefetching Perfect Bypassing 1.5 1.0 applu apsi hydro2d swim tomcatv Average Limit values for a 4-way machine PACT 2002 – p.4/20
✝ ✝ ✝ ✝ Index Motivation Memory Instruction Bypassing Compiler Directed Memory Prefetcher Comparison with APDP PACT 2002 – p.5/20
✠ ✟ ✡☛ MIB through Renaming I ... pref a[i] ... load r , a[i+1] load r , a[i] load r , a[i+3] ... loop branch PACT 2002 – p.6/20
✡☛ ✑ ✠ ✡ ✡ ✑ ✡ ✟ ✒ ✌ ✎ ✡ ✍ ✌ ☞ MIB through Renaming I ... ... ✡✏✎ ✡✏✎ pref a[i] pref r , a[i] ... ... load r , a[i+1] load r , a[i+1] load r , a[i] load r , a[i] load r , a[i+3] load r , a[i+3] ... ... loop branch loop branch PACT 2002 – p.6/20
✔ ✕ ✗ ✕ ✓ ✕✖ ✕ ✖ ✓ ✓ ✓ ✕ ✓ ✕ ✓ ✕✖ ✕ ✓ ✔ ✓ ✓ ✓ ✔ MIB through Renaming II Renaming Table . . . . . . r f r f r f . . . . . . Special Renaming Table . . . . . . r . . . r . . . r . . . . . . . . . PACT 2002 – p.7/20
✖ ✕ ✕ ✓ ✤ ✓ ✥ ✕✖ ✕ ✗ ✔ ✥✦ ✤ ✓ ✓ ✓ ✕ ✔ ✓ ✕ ✕ ✓ ✕ ✓ ✣ ✁ ✘ ✙ ✢ ✛ ✕✖ ✚ ✓ ✔ ✓ MIB through Renaming II ... Renaming Table ✚✜✛ ✚✜✛ pref r , a[i] . . . . . . r f r f r f . . . . . . Special Renaming Table . . . . . . r f r f r f . . . . . . PACT 2002 – p.7/20
✕ ✥✦ ✕ ✕ ✕ ✓ ✓ ✕✖ ✥ ✤ ✔ ✕ ✓ ✓ ✤ ✓ ✔ ✓ ✓ ✕ ✖ ✓ ✕✖ ✓ ✔ ✚ ✁ ✘ ✙ ✓ ✢ ✛ ✚ ✢ ✣ MIB through Renaming II ... Renaming Table ✚✜✛ ✚✜✛ pref r , a[i] . . . . . . r f ... r f load r , a[i+1] r f . . . . . . Special Renaming Table . . . . . . r f r . . . r f . . . . . . PACT 2002 – p.7/20
✓ ✤ ✕ ✕✖ ✓ ✕ ✕ ✓ ✓ ✓ ✕ ✓ ✥ ✤ ✔ ✕ ✕ ✓ ✓ ✖ ✔ ✔ ✓ ✢ ✘ ✛ ✚ ✣ ✙ ✢ ✚ ✥✦ ✁ ✘ ✓ MIB through Renaming II ... Renaming Table ✚✜✛ ✚✜✛ pref r , a[i] . . . . . . r f ... r f load r , a[i+1] r f . . . . . . load r , a[i] Special Renaming Table . . . . . . r . . . r . . . r f . . . . . . PACT 2002 – p.7/20
✕ ✓ ✓ ✦ ✚ ✚ ✥ ✤ ✥ ✔ ✕ ✔ ✓ ✤ ✕✖ ✓ ✓ ✘ ✓ ✓ ✕ ✓ ✁ ✓ ✘ ✕ ✙ ✓ ✢ ✢ ✛ ✚ ✣ ✖ ✕ ✚ ✕ MIB through Renaming II ... Renaming Table ✚✜✛ ✚✜✛ pref r , a[i] . . . . . . r f ... r f load r , a[i+1] r f . . . . . . load r , a[i] Special Renaming Table load r , a[i+3] . . . . . . r . . . r . . . r . . . . . . . . . PACT 2002 – p.7/20
✕ ✥ ✓ ✦ ✚ ✚ ✥ ✤ ✕ ✔ ✓ ✤ ✓ ✔ ✓ ✓ ✕✖ ✘ ✓ ✕ ✕ ✓ ✁ ✓ ✘ ✕ ✙ ✓ ✢ ✢ ✛ ✚ ✣ ✖ ✕ ✚ ✓ MIB through Renaming II ... Renaming Table ✚✜✛ ✚✜✛ pref r , a[i] . . . . . . r f ... r f load r , a[i+1] r f . . . . . . load r , a[i] Special Renaming Table load r , a[i+3] . . . . . . r . . . ... r . . . loop branch r . . . . . . . . . PACT 2002 – p.7/20
✝ ✝ ✝ ✝ Index Motivation Memory Instruction Bypassing Compiler Directed Memory Prefetcher Comparison with APDP PACT 2002 – p.8/20
✝ ✝ ✝ ✝ ✝ Decoupled Prefetcher Compiler inserts prefetching operations Instructions bring data closer to the processor Memory instruction bypassing takes care of bringing data to the register file from L1 Compiler also instructs prefetching hardware to prefetch ahead (to L1) No.s of prefetches are minimised by compiler control PACT 2002 – p.9/20
Prefetching Mechanism pctag type last @ ... PACT 2002 – p.10/20
✧ Prefetching Mechanism Decode phase pctag type last @ pctag PC last @ ... PACT 2002 – p.10/20
✧ ✧ Prefetching Mechanism Decode phase pctag type last @ Address calc. pctag PC last @ ... eff @ - stride PACT 2002 – p.10/20
✧ ✧ ✧ Prefetching Mechanism Decode phase pctag type last @ Address calc. pctag PC last @ ... Generate new prefetch eff @ - stride *N+ PACT 2002 – p.10/20
✧ ✧ ★ ✧ Prefetching Mechanism Decode phase pctag type last @ Address calc. pctag PC last @ ... Generate new prefetch eff @ N depends on type - stride *N+ PACT 2002 – p.10/20
★ ✧ ★ ✧ ✧ Prefetching Mechanism Decode phase pctag type last @ Address calc. pctag PC last @ ... Generate new prefetch eff @ N depends on type - stride wait for free port *N+ PACT 2002 – p.10/20
Performance Results 2.0 O3 LoadStore O3+SW pref LoadStore SpeedUp O3+SW pref only Bypassing O3+SW pref Proposal 16 1.5 O3+SW pref Proposal 32 O3+SW pref Proposal 64 O3+SW pref Proposal infinite 1.0 applu apsi hydro2d swim tomcatv Average Effect of number of streams in a 4-way machine PACT 2002 – p.11/20
Performance Results II 2.0 O3 LoadStore SpeedUp O3+SW pref LoadStore O3+SW pref only Bypassing 1.5 O3+SW pref Proposal N=(pref 1,load 1) O3+SW pref Proposal N=(pref 1,load 2) O3+SW pref Proposal N=(pref 2,load 1) 1.0 applu apsi hydro2d swim tomcatv Average Effect of lookahead policy in a 4-way machine with 32 entries PACT 2002 – p.12/20
✝ ✝ ✝ ✝ Index Motivation Memory Instruction Bypassing Compiler Directed Memory Prefetcher Comparison with APDP PACT 2002 – p.13/20
✞ ✝ ✞ ✞ ✩ ✞ ✄ Comparison with APDP Address Prediction for Data Prefetching predicts addresses of memory operations makes prediction available as soon as load arrives to decoding needs recovery mechanism in case of misprediction dynamic mechanism J. González and A. González, ICS 97 PACT 2002 – p.14/20
Prefetching Mechanism v address stride conf. val ... PACT 2002 – p.15/20
✧ Prefetching Mechanism Decode phase v address stride conf. val PC ... PACT 2002 – p.15/20
✧ ✧ Prefetching Mechanism Decode phase v address stride conf. val Is value correct? PC ... ? PACT 2002 – p.15/20
✧ ✧ ✪ ✧ Prefetching Mechanism Decode phase v address stride conf. val Is value correct? PC ... Yes bypass ? register file PACT 2002 – p.15/20
✧ ✧ ✧ ✪ ✧ Prefetching Mechanism Decode phase v address stride conf. val Is value correct? PC ... Yes bypass new address ? ? Address calc. register file PACT 2002 – p.15/20
✧ ✧ ✧ ✪ ✧ ✧ Prefetching Mechanism Decode phase v address stride conf. val Is value correct? PC ... Yes bypass new address ? Address calc. - Update table register file PACT 2002 – p.15/20
✧ ✧ ✧ ✧ ✧ ✪ ✧ Prefetching Mechanism Decode phase v address stride conf. val Is value correct? PC ... Yes bypass new address ? Address calc. - Update table register file ? + Generate prefetch? new prediction PACT 2002 – p.15/20
Comparison Results 2.0 O3 LoadStore O3 APDP only prefetching SpeedUp O3 APDP (squash) O3 APDP (selective) 1.5 O3+SW pref LoadStore O3+SW pref APDP only prefetching O3+SW pref APDP (squash) O3+SW pref APDP (selective) 1.0 applu apsi hydro2d swim tomcatv Average Effect of only prefetching in APDP (4-way and 2 ports) PACT 2002 – p.16/20
Comparison Results 1.2 O3 LoadStore SpeedUp O3 APDP (squash) O3 APDP (selective) 1.1 O3 only Bypassing O3 Proposal 16 entries O3 Proposal 32 entries 1.0 0.9 applu apsi hydro2d swim tomcatv Average 4 way machine comparison against APDP PACT 2002 – p.17/20
Memory traffic applu apsi hydro2d swim tomcatv 1.00 1.00 1.00 1.00 1.00 LoadStore 1.14 1.14 1.24 1.21 1.21 APDP sel. 1.14 1.15 1.24 1.21 1.22 APDP sq. 0.90 0.74 0.82 0.72 0.89 Bypassing 0.98 0.85 0.95 0.83 0.94 Proposal PACT 2002 – p.18/20
Any questions? PACT 2002 – p.19/20
Thank you dortega@ac.upc.es PACT 2002 – p.20/20
Recommend
More recommend