Register and Memory Dependences Store: SW Rt, A(Rs) LW Rt, A(Rs) 1. 1. Calculate effective Calculate effective Lecture 10: Memory Dependence memory address ⇒ memory address ⇒ Detection and Speculation dependent on Rs dependent on Rs 2. Write to D-Cache ⇒ 2. Read D-Cache ⇒ could dependent on Rt, and be memory-dependent Memory correctness, dynamic cannot be speculative on pending writes! memory disambiguation, speculative disambiguation, Alpha 21264 Example Compare “ADD Rd, Rs, Rt” When is the memory dependence known? What is the difference? 1 2 Load/store Buffer in Tomasulo Memory Correctness and Performance Correctness conditions: Original Tomasulo: IM Load/store address are pre- Only committed store instructions can � calculated before scheduling Fetch Unit write to memory Loads are not dependent on Any load instruction receives its memory � other instructions Reorder Decode Rename Regfile Buffer operand from its parent (a store Stores are dependent on instruction) instructions producing the store data At the end of execution, any memory word � S-buf L-buf RS RS receives the value of the last write Provide dynamic memory DM FU1 FU2 disambiguation: check the memory dependence between stores and loads Performance: Exploit memory level parallelism 3 4 Dynamic Scheduling with Integer Instructions Load/Store with Dynamic Execution IM Centralized design Only committed store instructions can write to memory � example: ⇒ Use store buffer as a temporary place for write Fetch Unit instruction output Centralized reservation stations usually include Reorder Any memory word receives the value of the last write � Decode Rename Regfile Buffer the load buffer ⇒ Store instructions write to memory in program order Integer units are shared by load/store Centralized RS Any memory word receives the value of the last write � and ALU instructions Memory level parallelism be exploited � I-Fu I-FU FU FU ⇒ Non-speculative solution: load bypassing and load forwarding What is the challenge addr data in detecting memory S-buf ⇒ Speculative solution: speculative load execution addr dependence? data D-Cache 5 6 1
Store Buffer Design Example Memory Dependence Any load instruction receives the memory Store instruction: operand from its parent (a store RS Wait in RS until the base instruction) � address and data are ready I-FU From RS Calculate address, move to If any previous store has not written the � store buffer D-cache, what to do? C Ry addr data Move data directly to young 0 0 � store buffer 0 1 If any previous store has not finished, Arch. Wait for commit 1 - � what to do? states 1 - If no exception/mis-predict old Wait for memory port To D-Cache 5. Simple Design: Delay all following loads; but Write to D-cache 6. how about performance? Otherwise flushed before writing D-cache 7 8 Memory-level Parallelism Load Bypassing and Load Forwarding Non-speculative solution for (i=0;i<100;i++) RS Read A[i] = A[i]*2; Dynamic Disambiguation: Read Match the load address with Read Store I-FU I-FU all store addresses Loop:L.S F2, 0(R1) unit Write Load bypassing: start cache MULT F2, F2, F4 match read if no match is found Write SW F2, 0(R1) Load forwarding: using store Write buffer value if a match is ADD R1, R1, 4 found Significant BNE R1, R3,Loop In-order execution improvement from limitation: must wait until all D-cache sequential previous store have finished F4 store 2.0 reads/writes 9 10 In-order Execution Limitation Speculative Load Execution Example 1: When is the If no dependence predicted Example 1: RS SW result available, Send loads out even if for (i=0;i<100;i++) and when can the next dependence is unknown A[i] = A[i]/2; load start? I-FU I-FU Do address matching at Loop:L.S F2, 0(R1) match Possible solution: start store commits DIV F2, F2, F4 store address Match found: memory SW F2, 0(R1) 1. calculation early ⇒ dependence violation, flush ADD R1, R1, 4 more complex design pipeline; BNE R1, R3,Loop Otherwise: continue 2. store-q load-q Example 2: D-cache Example2: When is the a->b->c = 100; Note: may still need load address “a->b->c” d = x; forwarding (not shown) available? 11 12 2
Alpha 21264 Pipeline Alpha 21264 Load/Store Queues Int issue queue fp issue queue Addr Int Int Addr FP FP ALU ALU ALU ALU ALU ALU Int RF(80) Int RF(80) FP RF(72) D-TLB L-Q S-Q AF Dual D-Cache 32-entry load queue, 32-entry store queue 13 14 Speculative Memory Disambiguation Load Bypassing, Forwarding, and RAW Detection PC commit match at commit ROB Load/store? 1024 1-bit Renamed inst head entry table Load: WAIT if LQ head not completed, then 1 move LQ head load-q store-q Store: mark SQ load addr store addr int issue queue committed head as If match: completed, then forward move SQ head To help predict memory dependence: D-cache Whenever a load causes a violation, set stWait bit in the table When the load is fetched, get its stWait from the table, send to issue queue with the load instruction If match: mark store-load trap A load waits there if its swWait is set and any previous store to flush pipeline (at commit) exists The tale is cleared periodically 15 16 Architectural Memory States Summary of Superscalar Execution Instruction flow techniques LQ Branch prediction, branch target prediction, and SQ Committed instruction prefetch states Completed entries L1-Cache Register data flow techniques L2-Cache Register renaming, instruction scheduling, in-order L3-Cache (optional) commit, mis-prediction recovery Memory Disk, Tape, etc. Memory data flow techniques Load/store units, memory consistency Memory request: search the hierarchy from top to bottom Source: Shen & Lipasti reference book 17 18 3
Recommend
More recommend