1
play

1 Prefetching Implementations Recall Stream Buffer Diagram - PDF document

Memory Wall 1000 Lecture 25: Advanced Data CPU Prefetching Techniques 100 Prefetching and data prefetching overview, Stride prefetching, 10 Markov prefetching, precomputation- DRAM based prefetching 1 1980 1981 1982 1983 1984 1985


  1. Memory Wall 1000 Lecture 25: Advanced Data CPU Prefetching Techniques 100 Prefetching and data prefetching overview, Stride prefetching, 10 Markov prefetching, precomputation- DRAM based prefetching 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Consider memory latency of 1000 processor Zhao Zhang, CPRE 585 Fall 2003 cycles or a few thousands of instructions … 1 2 Where Are Solutions? Where Are Solutions? 1. Consider an 4-way issue Increase cache size? Reducing miss rates 3. Reducing miss penalty or OOO processor miss rates via parallelism More levels of memory Larger block size � � 20-entry issue queue hierarchy? Non-blocking caches larger cache size � 80-entry ROB � � Itanium: 2-4MB L3 Hardware prefetching higher associativity � 100ns main memory cache � Compiler prefetching access latency victim caches � IBM Power4: 32MB � In how many cycles the eDRAM cache way prediction and � 4. Reducing cache hit time processor will stall on Pseudoassociativity cache miss to main � Small and simple caches Large caches are still compiler optimization memory? � very useful but may not � Avoiding address 2. Reducing miss penalty help fully address the translation issue Multilevel caches OOO processors may � Pipelined cache access � tolerate L2 latency but critical word first � Trace caches � not main memory read miss first � latency merging write buffers � 3 4 Prefetching Evaluation Prefetching Targets Prefetch: Predict future accesses and fetch Instruction prefetching data before they are demanded � Stream buffer is very useful Accuracy: How many prefetched items are Data prefetching really needed? � More complicated because of the diversities in � False prefetching: fetched wrong data data access pattern � Cache pollution: replace “good” data with “bad” data Prefetching for dynamic data (hashing, heap, Coverage: How many cache misses are sparse array, etc.) removed? � Usually with irregular access patterns Timeliness: Does the data return before they are demanded? Linked-list prefetching (Pointer chasing) Other considerations: complexity and cost � A special type of data prefetching for data in linked-list 5 6 1

  2. Prefetching Implementations Recall Stream Buffer Diagram Sequential and stride prefetching from processor to processor Tagged prefetching � Simple stream buffer � Stride prefetching � Correlation-based prefetching Direct Markov prefetching � mapped Dead-block correlating prefetching Tags Data � cache Precomputation-based Keep running programs on cache misses; or � Use separate hardware for prefetching; or � Use compiler-generated threads on multithreaded processors � Other considerations tag and head Stream a one cache block of data Predict on miss addresses or reference address? comp � Prefetch into cache or a temp. buffer? tag a one cache block of data buffer � Demand-based or decoupled prefetching? tail tag a one cache block of data Source: Jouppi � tag a one cache block of data ICS’90 Shown with a single stream buffer +1 (way); multiple ways and filter may next level of cache 7 8 be used Stride Prefetching Stride Prefetching Diagram Limits of streaming buffer: Program may access data in either PC Effective address direction; i.e. how about - for (i = N-1; i >= 0; i --) Inst tag Previous address stride state … Data may be accessed in strides, i.e. for (i = 0; i < N; i ++) for (j = 0; j < N; j ++) sum[i] += X[i][j]; + Reference prediction table Prefetch address 9 10 Stride Prefetching Example Markov Prefetching float a[100][100], b[100][100], Target irregular mem c[100][100]; tag addr stride state ... access pattern Miss addresses: Load b 20004 4 trans. for ( i = 0; i < 100; i++) for ( j = 0; j < 100; j++) A B C D C E A C F F E Load c 30400 400 trans. for ( k = 0; k < 100; k++) A A B C D E A B C D C Load a 30000 0 steady a[i][j] += b[i][k] * c[k][j]; Iteration 2 tag addr stride state tag addr stride state Predicted addresses miss addr miss 1 pred 1 pred 2 pred 3 pred 4 Load b 20000 0 init Load b 20008 4 steady … … … … … Load c 30000 0 init Load c 30800 400 steady Load a 30000 0 init Load a 30000 0 steady miss N pred 1 pred 2 pred 3 pred 4 Markov model Iteration 1 Iteration 3 Prefetch Joseph and Grunwald, ISCA 1997 queue 11 12 2

  3. Markov Prefetching Performance Markov Prefetching Performance From left to right: number of addresses in table From left to right: stream, stride, correlation (Pomerene and Puzak), Markov, stream+stride+Markov serial, stream+stride+Markov parallel 13 14 Predictor-directed Stream Buffer Predictor-directed Stream Buffer Cons of existing approaches: Markov prediction table filters out irregular address transitions (reduce stream buffer Stride prefetching (using updated stream thrashing) buffer): Only useful for strid access; being Stream buffer filters out addresses used to train interfered by non-stride accesses Markov prediction table (reduce storage) Markov prefetching: Working for general access patterns but requiring large history storage (megabytes) PSB: Combining the two methods � To improve coverage of stream buffer; and � Keep the required storage low (several kilobytes) Sair et al., MICRO 2000 Which prefetching is used on each address? 15 16 Prefetching by Dynamically Building Precomputation-based Prefetching Data-dependence Graph Potential problems of for (i=0; i<10; i++) Annavaram et al., “Data prefetching by dependence stream buffer or Markov for (j=0; j<100; j++) graph precomputation”, ISCA 2001 prefetching: data[j]->val[j]++; Low accuracy => high memory Needs external help to identify problematic loads bandwidth waste Loop: Builds dependence graph in reverse order I1 load r1=[r2] Another approach: use some I2 add r3=r3+1 Uses separate prefetching engine computation resource for I3 add r6=r3-100 prefetching, because I4 add r2=r2+8 computation is increasingly I5 add r1=r4+r1 cheaper IF PRE-DE DECODE EXE WB COMMIT I6 load r5=[r1] I7 add r5=r5+1 Speculative execution for prefetching I8 store [r1]=r5 DG generator IT INST OP1 OP2 No architectural changes I9 blt r6, lop � Not limited by hardware � DG Buffer With high accuracy and Collins et al, MICRO 2001 � good coverage Updated inst fetch queue prefetching EXE Engine 17 18 3

  4. Using “Future” Threads for Precomputation with SMT Prefetching Supporting Speculative Threads Balasubramonian et al. “Dynamically allocating processor Collins et al. “Speculative precomputation: long-range resources between nearby and distant ILP,” ISCA prefetching of delinquent loads,” ISCA 2001. 2001 Precomputation is done by an explicit speculative OOO processors stall on cache misses to DRAM thread (p-thread) because of exhausting some resources (IQ or ROB or The code of p-threads may be constructed by registers) compiler or hardware Why not keep the program run during the stall time for prefetching? Main thread execution spawns p-threads on triggers Then, must reserve resources for “future” thread (e.g. when an PC is encountered) Future thread continues the execution for prefetching � Main thread some register values and initial PC for p-thread Using the existing OOO pipeline and FUs for � P-thread may trigger another p-thread for further prefetching execution For more complier issues, see Luk, “Tolerating Memory May release registers or ROB speculatively, thus can examine a much larger instruction window Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors”, ISCA Still accurate in producing reference addresses 2001 19 20 Summary of Advanced Prefetching Being actively studied because of the increasing CPU-memory speed gap Improving cache performance beyond the limit of cache size Precomputation may be limited in prefetching distance (how good is the timeliness?) Note there is no perfect cache/prefetching solution, e.g. while (1) { myload (addr); addr = myrandom() + addr; } How to design complexity-effective memory systems for future processors? 21 4

Recommend


More recommend