Prior work to mitigate the cost of rigid hierarchies Bypass levels to avoid cache pollutions L3 Private L4 Do not install lines at specific levels L1 & L2 Give lines low priority in replacement policy It’s better to build the right hierarchy and Speculatively access up the hierarchy avoid the root cause: unnecessary accesses to Hit/miss predictors, prefetchers L4 Private L3 L1 & L2 Hide latency with speculative accesses unwanted cache levels They must still check all levels for correctness! Waste energy and bandwidth 7
Jenga = flexible hardware + smart software Software Hardware 8
Jenga = flexible hardware + smart software Software Time Hardware 8
Jenga = flexible hardware + smart software Software Read hardware monitors Time Hardware 8
Jenga = flexible hardware + smart software Software Optimize hierarchies Read hardware monitors Time Hardware 8
Jenga = flexible hardware + smart software Software Optimize hierarchies Read hardware Update monitors hierarchies Time Hardware 8
Jenga = flexible hardware + smart software Software Optimize hierarchies Read hardware Update monitors hierarchies 100ms Time Hardware 8
Jenga = flexible hardware + smart software Software Optimize Optimize hierarchies hierarchies Read hardware Update monitors hierarchies 100ms Time Hardware 8
Jenga hardware: supporting virtual hierarchies (VHs) Cores consult virtual hierarchy table (VHT) to find the access path Similar to Jigsaw [PACT’13, HPCA’15], but it supports two levels 9
Jenga hardware: supporting virtual hierarchies (VHs) Cores consult virtual hierarchy table (VHT) to find the access path Similar to Jigsaw [PACT’13, HPCA’15], but it supports two levels DRAM bank SRAM Bank NoC Router VH id TLB Core VHT Addr Private $ 9
Jenga hardware: supporting virtual hierarchies (VHs) Cores consult virtual hierarchy table (VHT) to find the access path Similar to Jigsaw [PACT’13, HPCA’15], but it supports two levels Two-level using both DRAM bank SRAM and DRAM SRAM Bank NoC Router VH id TLB Core VHT Addr Private $ 9
Jenga hardware: supporting virtual hierarchies (VHs) Cores consult virtual hierarchy table (VHT) to find the access path Similar to Jigsaw [PACT’13, HPCA’15], but it supports two levels Two-level using both DRAM bank SRAM and DRAM SRAM Bank NoC Router VH id TLB Core VHT Addr Private $ 9
Accessing a two-level virtual hierarchy Access path: SRAM bank DRAM bank Mem Tile DRAM cache bank Private Tile 10 Core 1 VHT Caches 10
Accessing a two-level virtual hierarchy Access path: SRAM bank DRAM bank Mem Tile Virtual L1 SRAM (bank 10) DRAM (VL1) cache Core miss VL1 bank 1 bank 1 Private Tile 10 Core 1 VHT Caches 10
Accessing a two-level virtual hierarchy Access path: SRAM bank DRAM bank Mem Virtual L2 DRAM (bank 38) (VL2) VL1 miss VL2 bank 2 Tile Virtual L1 SRAM (bank 10) DRAM (VL1) 2 cache Core miss VL1 bank 1 bank 1 Private Tile 10 Core 1 VHT Caches 10
Accessing a two-level virtual hierarchy Access path: SRAM bank DRAM bank Mem VL2 hit, serve line 3 Virtual L2 DRAM (bank 38) (VL2) VL1 miss VL2 bank 2 Tile Virtual L1 SRAM (bank 10) DRAM (VL1) 2 cache Core miss VL1 bank 1 bank 1 Private 3 Tile 10 Core 1 VHT Caches 10
Accessing an single-level VH using SRAM + DRAM With VHT, software can group any combinations of banks to form a VH Private Core VHT Caches Main Memory 11
Accessing an single-level VH using SRAM + DRAM With VHT, software can group any combinations of banks to form a VH Single-level using both Private Core VHT SRAM and DRAM Caches Main Memory 11
Accessing an single-level VH using SRAM + DRAM With VHT, software can group any combinations of banks to form a VH Addr X Single-level using both Private Core VHT SRAM and DRAM Caches Main Memory 11
Accessing an single-level VH using SRAM + DRAM With VHT, software can group any combinations of banks to form a VH Addr X Single-level using both Private Core VHT SRAM and DRAM Caches Addr Y Main Memory 11
Accessing an single-level VH using SRAM + DRAM With VHT, software can group any combinations of banks to form a VH Addr X Single-level using both Private Core VHT SRAM and DRAM Caches Addr Y Main Logically equivalent to… Memory SRAM SRAM Private Core SRAM Caches DRAM 11
Jenga software: finding near-optimal hierarchies Periodically, Jenga reconfigures VHs to minimize data movement Hardware Software Hardware Monitors Reconfigure Virtual Set VHTs Hierarchies 12
Jenga software: finding near-optimal hierarchies Periodically, Jenga reconfigures VHs to minimize data movement Hardware Software Application miss curves Hardware Monitors Reconfigure Virtual Set VHTs Hierarchies 12
Jenga software: finding near-optimal hierarchies Periodically, Jenga reconfigures VHs to minimize data movement Hardware Software Application miss curves Virtual Hardware Hierarchy Monitors Allocation Reconfigure Virtual Set VHTs Hierarchies 12
Jenga software: finding near-optimal hierarchies Periodically, Jenga reconfigures VHs to minimize data movement Hardware Software VH sizes & levels Application VL2 miss curves Virtual VL1 Hardware Hierarchy Monitors Allocation Reconfigure Virtual Set VHTs Hierarchies 12
Jenga software: finding near-optimal hierarchies Periodically, Jenga reconfigures VHs to minimize data movement Hardware Software VH sizes & levels Application VL2 miss curves Virtual VL1 Hardware Hierarchy Monitors Allocation Reconfigure Bandwidth-Aware Virtual Placement Set VHTs Hierarchies 12
Jenga software: finding near-optimal hierarchies Periodically, Jenga reconfigures VHs to minimize data movement Hardware Software VH sizes & levels Application VL2 miss curves Virtual VL1 Hardware Hierarchy Monitors Allocation Reconfigure Bandwidth-Aware Virtual Placement Set VHTs Hierarchies Final allocation 12
Modeling performance of heterogeneous caches Treat SRAM and DRAM as different “flavors” of banks with different latencies 13
Modeling performance of heterogeneous caches Treat SRAM and DRAM as different “flavors” of banks with different latencies Color latency Start DRAM bank 13
Modeling performance of heterogeneous caches Treat SRAM and DRAM as different “flavors” of banks with different latencies Color latency Start DRAM bank Access Latency Cache DRAM bank 13 Total Capacity
Modeling performance of heterogeneous caches Treat SRAM and DRAM as different “flavors” of banks with different latencies Color latency Start Latency DRAM bank Access Latency Virtual Cache size Cache DRAM bank 13 Total Capacity
Modeling performance of heterogeneous caches Treat SRAM and DRAM as different “flavors” of banks with different latencies Color latency Start Latency DRAM bank Access Latency Virtual Cache size Cache Access latency DRAM bank 13 Total Capacity
Modeling performance of heterogeneous caches Treat SRAM and DRAM as different “flavors” of banks with different latencies Color latency Start Latency DRAM bank Miss curve from hardware monitors Access Latency Virtual Cache size Cache Access latency DRAM bank Miss latency 13 Total Capacity
Modeling performance of heterogeneous caches Treat SRAM and DRAM as different “flavors” of banks with different latencies Latency curve for single-level, Color latency Start heterogeneous cache Latency DRAM bank Miss curve from hardware monitors Access Latency Virtual Cache size Cache Access latency DRAM bank Miss latency Total latency 13 Total Capacity
Optimizing hierarchies by minimizing system latency 14
Optimizing hierarchies by minimizing system latency Our prior work has proposed algorithms to take latency curves, allocate capacity and place them on chip to minimize system latency But only builds single-level VHs 14
Optimizing hierarchies by minimizing system latency Our prior work has proposed algorithms to take latency curves, allocate capacity and place them on chip to minimize system latency But only builds single-level VHs App1 Latency App2 App3 Capacity 14
Optimizing hierarchies by minimizing system latency Our prior work has proposed algorithms to take latency curves, allocate capacity and place them on chip to minimize system latency But only builds single-level VHs App1 Capacity Latency App2 App2 App1 App3 App3 Capacity 14
Optimizing hierarchies by minimizing system latency Our prior work has proposed algorithms to take latency curves, allocate capacity and place them on chip to minimize system latency But only builds single-level VHs App1 Capacity Latency App2 App2 App1 App3 App3 Capacity 14
Multi-level hierarchies are much more complex 15
Multi-level hierarchies are much more complex Many intertwined factors Best VL1 size depends on VL2 size Best VL2 size depends on VL1 size Should we have VL2? (Depends on total size) 15
Multi-level hierarchies are much more complex Many intertwined factors Best VL1 size depends on VL2 size Best VL2 size depends on VL1 size Should we have VL2? (Depends on total size) Jenga encodes these tradeoffs in a single curve Can reuse prior allocation algorithms 15
How to get a latency curve for a multi-level VH 16
How to get a latency curve for a multi-level VH Two-level hierarchies form a latency surface! 16
How to get a latency curve for a multi-level VH Two-level hierarchies form a latency surface! Project Best 1- and 2-level hierarchy at every size 16
How to get a latency curve for a multi-level VH Two-level hierarchies form a latency surface! Project Best 1- and 2-level hierarchy at every size 16
How to get a latency curve for a multi-level VH Two-level hierarchies form a latency surface! Project Best 1- and 2-level Best overall hierarchy hierarchy at every size at every size 16
How to get a latency curve for a multi-level VH Two-level hierarchies form a latency surface! Project Best 1- and 2-level Best overall hierarchy hierarchy at every size at every size 16
How to get a latency curve for a multi-level VH Curve lets us optimize Two-level hierarchies form multi-level hierarchies! a latency surface! Project Best 1- and 2-level Best overall hierarchy hierarchy at every size at every size 16
Allocating virtual hierarchies Latency curves VH1 VH2 VH3 17
Allocating virtual hierarchies Latency curves VH1 Cache allocation algorithm VH2 VH3 17
Allocating virtual hierarchies Total capacity Latency curves of each VH VH1 Cache allocation algorithm Capacity VH2 VH3 VH1 VH2 VH3 17
Allocating virtual hierarchies Total capacity Latency curves of each VH VH1 Decide Cache the best allocation hierarchy algorithm Capacity VH2 VH3 VH1 VH2 VH3 17
Allocating virtual hierarchies Virtual hierarchy Total capacity Latency curves size and levels of each VH VH1 VL1 Decide Cache the best allocation hierarchy algorithm Capacity VH2 VL1 VH3 VL1 VL2 VH1 VH2 VH3 17
Bandwidth-aware virtual hierarchy placement DRAM bank VL1 SRAM bank VL1 VL1 VL2 18
Bandwidth-aware virtual hierarchy placement Place data close without saturating DRAM bandwidth DRAM bank VL1 SRAM bank VL1 VL1 VL2 18
Bandwidth-aware virtual hierarchy placement Place data close without saturating DRAM bandwidth Every iteration, Jenga … Chooses a VH (via an opportunity cost metric, see paper) Greedily places a chunk of its data in its closest bank Update DRAM bank latency DRAM bank VL1 SRAM bank VL1 VL1 VL2 18
Bandwidth-aware virtual hierarchy placement Place data close without saturating DRAM bandwidth Every iteration, Jenga … Chooses a VH (via an opportunity cost metric, see paper) Greedily places a chunk of its data in its closest bank Update DRAM bank latency VL1 VL1 VL1 VL2 18
Bandwidth-aware virtual hierarchy placement Place data close without saturating DRAM bandwidth Every iteration, Jenga … Chooses a VH (via an opportunity cost metric, see paper) Greedily places a chunk of its data in its closest bank Update DRAM bank latency VL1 VL1 VL1 VL2 18
Bandwidth-aware virtual hierarchy placement Place data close without saturating DRAM bandwidth Every iteration, Jenga … Chooses a VH (via an opportunity cost metric, see paper) Greedily places a chunk of its data in its closest bank Update DRAM bank latency VL1 1.0X Latency VL1 1.0X Latency VL1 VL2 18
Bandwidth-aware virtual hierarchy placement Place data close without saturating DRAM bandwidth Every iteration, Jenga … Chooses a VH (via an opportunity cost metric, see paper) Greedily places a chunk of its data in its closest bank Update DRAM bank latency VL1 1.0X Latency 1.1X Latency VL1 1.0X Latency VL1 VL2 18
Bandwidth-aware virtual hierarchy placement Place data close without saturating DRAM bandwidth Every iteration, Jenga … Chooses a VH (via an opportunity cost metric, see paper) Greedily places a chunk of its data in its closest bank Update DRAM bank latency VL1 1.0X Latency 1.1X Latency 1.3X Latency VL1 1.0X Latency VL1 VL2 18
Bandwidth-aware virtual hierarchy placement Place data close without saturating DRAM bandwidth Every iteration, Jenga … Chooses a VH (via an opportunity cost metric, see paper) Greedily places a chunk of its data in its closest bank Update DRAM bank latency VL1 1.0X Latency 1.1X Latency 1.3X Latency VL1 1.0X Latency 1.1X Latency VL1 VL2 18
Bandwidth-aware virtual hierarchy placement Place data close without saturating DRAM bandwidth Every iteration, Jenga … Chooses a VH (via an opportunity cost metric, see paper) Greedily places a chunk of its data in its closest bank Update DRAM bank latency VL1 1.3X Latency 1.1X Latency 1.0X Latency VL1 1.0X Latency 1.1X Latency 1.3X Latency VL1 VL2 18
Bandwidth-aware virtual hierarchy placement Place data close without saturating DRAM bandwidth Every iteration, Jenga … Chooses a VH (via an opportunity cost metric, see paper) Greedily places a chunk of its data in its closest bank Update DRAM bank latency VL1 1.3X Latency 1.1X Latency 1.0X Latency VL1 1.0X Latency 1.1X Latency 1.3X Latency VL1 VL2 18
Bandwidth-aware virtual hierarchy placement Place data close without saturating DRAM bandwidth Every iteration, Jenga … Chooses a VH (via an opportunity cost metric, see paper) Greedily places a chunk of its data in its closest bank Update DRAM bank latency VL1 1.3X Latency 1.1X Latency 1.0X Latency VL1 1.0X Latency 1.1X Latency 1.3X Latency VL1 VL2 18
Jenga adds small overheads 19
Recommend
More recommend