Bouquet of f In Instruction Pointers: In Instruction Pointer Classifier-based Hardware Prefetching DPC3@ISCA ‘19 Samuel Pakalapati (Intel Technology Pvt. Ltd. and BITS Pilani) and Biswabandan Panda (Indian Institute of Technology Kanpur) 1
Why a Bouquet? No single IP based prefetcher performs well across all applications 2
Our Goal: Idealistic Though ☺ Core L1 L1 Prefetcher L1 hit rate of 100% (a dream ☺ ) RIP Memory wall ☺ Reality with SPEC CPU 2017 benchmarks provided by DPC3: L1 hit rate of 88.12% What about L2? 23.55% 3
Zooming into the Prefetcher Instruction Pointer (a.k.a. PC) Prefetcher Future Memory Accesses Demand Memory Accesses (cache-line aligned addresses) We use the IP information: can eliminate compulsory misses ☺ Started with the simplest IP prefetcher: IP-Stride 4
IP- Stride Prefetcher [Fu et al. MICRO ‘92] IP Last-address Stride Prefetch Address = Current Address + Stride Good for constant strides 5
Our Bouquet First IP prefetcher: Constant stride 6
Constant-stride prefetcher (CS class) IP_index IP_tag Valid? Last_page Page_offset Stride Confidence [0,63], Cache line offset within a 4KB OS page If (current_page=last_page) then stride within a page Page boundary learning: If (current_page=last_page ±1) Stride = 64± (page_offset_new-page_offset_old) 7
Valid Bit? IP_index IP_tag Valid? Last_page Page_offset Stride Confidence IP A IP B Two different IP_tags can map to same IP_index IPA: V=1, IPB mapped to same entry: V=0, IPA: V=0: IPA mapped to same entry: V=1 If V=0 but IP_tag is different then clear the entry and make confidence zero ~ 2-way associative cache, minimize collisions 8
Constant Stride Class X, X+2, X+4, …………. Constant stride of 2 IP X, X+3, X+4, X+2 …… Variable stride of ? IP Signature Path Prefetching, DPC- 2, MICRO ‘ 16 9
Our Bouquet First IP prefetcher: Constant stride Second IP prefetcher: Complex stride 10
Complex Stride (CPLX Class) [Kim et al., DPC- 2/MICRO ‘ 16] IP Signature Stride Confidence IP A Sig A (+1, +2, +3) -3 2/3 +1 +2 +3 -3 +1 +2 + 3 -4 +1 +2 +3 -3 We call it Delta Prediction Table (DPT) 11
From Stride to Stream: Global Stream X, X+1, Y, Y+4, Z, …………………. IP X IP Y IP Z IP X drives the global stream: Y=X+2 and Z=X+7 IP independence can provide better coverage and timeliness 12
Our Bouquet First IP prefetcher: Constant stride Second IP prefetcher: Complex stride Third IP prefetcher: Global stream 13
Global Stream (GS Class) X, X+1, Y, Y+1, Z, …………………. ❶ ❸ ❷ IP Stream Valid? Stream Stream Direction Strength? Z . GHB IP X Yes (0/1) +/- Strong . (Global History Buffer) . n entries X ❹ ❸ If n/2 GHB hits, valid If 3n/4 hits, strong X+1 X+2 ……… X+PrefetchDegree 14
Our Bouquet First IP prefetcher: Constant stride Second IP prefetcher: Complex stride Third IP prefetcher: Global stream Fourth prefetcher: Next-line 15
No-IP: Next-line (NL Class) Prefetch Address = Current Address + 1 Detrimental to performance in case of irregular accesses SPECULATIVE NL: NL is ON L1 Misses Per Kilo Cycles (MPKC) is low (< 15 for single-core) NL is OFF Otherwise 16
The Bouquet Constant Stride (CS class) Complex Stride (CPLX class) Global Stream (GS class) Next Line (NL class) Design Choice: A hardware table for each class? Our Proposal: IPCP, a single hardware table for all the classes 17
Our Proposal (IPCP at L1) L1 access [IP, Access address] CPLX GS CS IP Valid? Page no. Page offset Stride Confidence Signature Stream valid? Direction Strength Priority of classes: Z . GHB GS > CS > CPLX > NL . Stride Confidence . Prefetch Degree: X GS: 6, CS and CPLX: 3 18
Our Proposal (IPCP at L2) GS, CS, CPLX, NL, NO L1 Prefetcher Trained Stride, Stream Direction L1 Miss IP Valid? Class_type Stride No IP classification at the L2, table construction based on metadata No prefetching for CPLX class Prefetch Degree: 4 for GS and 4 for CS if MSHR is less than half full else 3 19
Metadata L1 Prefetch Packet L1 Prefetch Packet Metadata Stride (7 bits) Class-type (3 bits) SPEC_NL (1 bit) Stream direction in case of GS class type 20
Hardware Overhead Table Entry size * #Entries Total IP Table 77 * 1024 (L1) + 17 * 1024 (L2) bits 12.03 KB DPT Table 9 * 4096 bits 4.6 KB GHB Table 16 * 58 bits 928 bits Others 100 bits 86 bits 16.7 KB 21
0.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 600.perlbench_s-570B Single-core Performance [SPEC CPU 2017] Multi-core: 25 mixes, 22% improvement On average: 43.75% improvement 602.gcc_s-1850B 3.59 602.gcc_s-2226B 602.gcc_s-734B 603.bwaves_s-1740B 603.bwaves_s-2609B 603.bwaves_s-2931B 3.02 603.bwaves_s-891B 605.mcf_s-1152B 605.mcf_s-1536B 605.mcf_s-1554B 605.mcf_s-1644B 605.mcf_s-472B 605.mcf_s-484B 605.mcf_s-665B 605.mcf_s-782B 605.mcf_s-994B 607.cactuBSSN_s-2421B 607.cactuBSSN_s-3477B 607.cactuBSSN_s-4004B 619.lbm_s-2676B 619.lbm_s-2677B 619.lbm_s-3766B 619.lbm_s-4268B 620.omnetpp_s-141B 620.omnetpp_s-874B 621.wrf_s-6673B. 621.wrf_s-8065B 623.xalancbmk_s-10B 623.xalancbmk_s-165B 623.xalancbmk_s-202B 627.cam4_s-490B 628.pop2_s-17B 641.leela_s-1083B 649.fotonik3d_s-10881B 649.fotonik3d_s-1176B 649.fotonik3d_s-7084B 649.fotonik3d_s-8225B 654.roms_s-1007B 654.roms_s-1070B 654.roms_s-1390B 654.roms_s-1613B 654.roms_s-293B 654.roms_s-294B 22 654.roms_s-523B 657.xz_s-2302B Geomean
100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0% 600.perlbench_s-570B Distribution of IP Classes 602.gcc_s-1850B 602.gcc_s-2226B 602.gcc_s-734B 603.bwaves_s-1740B 603.bwaves_s-2609B 603.bwaves_s-2931B 603.bwaves_s-891B On average, all classes trigger equally 605.mcf_s-1152B 605.mcf_s-1536B 605.mcf_s-1554B 605.mcf_s-1644B 605.mcf_s-472B 605.mcf_s-484B 605.mcf_s-665B 605.mcf_s-782B 605.mcf_s-994B 607.cactuBSSN_s-2421B GS 607.cactuBSSN_s-3477B 607.cactuBSSN_s-4004B CS 619.lbm_s-2676B 619.lbm_s-2677B 619.lbm_s-3766B CPLX 619.lbm_s-4268B 620.omnetpp_s-141B 620.omnetpp_s-874B 621.wrf_s-6673B. NL 621.wrf_s-8065B 623.xalancbmk_s-10B 623.xalancbmk_s-165B 623.xalancbmk_s-202B 627.cam4_s-490B 628.pop2_s-17B 641.leela_s-1083B 649.fotonik3d_s-10881B 649.fotonik3d_s-1176B 649.fotonik3d_s-7084B 649.fotonik3d_s-8225B 654.roms_s-1007B 654.roms_s-1070B 654.roms_s-1390B 654.roms_s-1613B 654.roms_s-293B 654.roms_s-294B 23 654.roms_s-523B 657.xz_s-2302B Mean
Comparison with the State-of-the-art: Performance [Higher the better] 46 Average Improvement in % 43.75 44 42 40.40 40 38 36 34.53 34 32 30 BO [HPCA '16, DPC-2 SPP+ Perceptron Filter IPCP Winner] [ISCA '19] 24
Key Takeaways Access patterns can be classified based on IPs (IPCP) Classification at the L1, reuse at the L2 through metadata Simple and modular collection of prefetchers Prefetchers like ISB [MICRO ‘13] and IMP [MICRO ‘15] can be added to the bouquet seamlessly High performance and low hardware overhead 25
Dream ☺ ? With IPCP, L1 hit rate jumps from 88.11% to 92.43% ☺ With IPCP, L2 hit rate jumps from 23.55% to 51.82% ☺ 26
“Great things are done by a series of small things brought together” Vincent Van Gogh, Dutch painter Than hank Y k You ou 27
Recommend
More recommend