Scatter updates are inefficient on conventional hierarchies 6 ¨ Poor temporal and spatial locality when inputs do not fit in cache ¤ Wasteful data transfers from main memory ¨ Multiple threads update the same vertex ¤ Cache line ping-ponging Memory 2 0 Shared Cache 3 2 … Cache Cache 1 4 … Core Core
Scatter updates are inefficient on conventional hierarchies 6 ¨ Poor temporal and spatial locality when inputs do not fit in cache ¤ Wasteful data transfers from main memory ¨ Multiple threads update the same vertex ¤ Cache line ping-ponging 2 Memory 2 0 Shared Cache 3 … Cache Cache 1 4 … Core Core
Scatter updates are inefficient on conventional hierarchies 6 ¨ Poor temporal and spatial locality when inputs do not fit in cache ¤ Wasteful data transfers from main memory ¨ Multiple threads update the same vertex ¤ Cache line ping-ponging Push PageRank on uk-2005 graph 1.4 2 Memory 2 1.2 Memory requests Updates 0 Shared Cache 1.0 Updates Destination per edge 3 Vertex 0.8 … Source Cache Cache 0.6 Vertex 1 CSR 0.4 4 … Core Core 0.2 0.0 Push UB
Scatter updates are inefficient on conventional hierarchies 6 ¨ Poor temporal and spatial locality when inputs do not fit in cache ¤ Wasteful data transfers from main memory ¨ Multiple threads update the same vertex ¤ Cache line ping-ponging Push PageRank on uk-2005 graph 1.4 2 Memory 2 1.2 93% of traffic due Memory requests Updates 0 Shared Cache 1.0 Updates to scatter updates Destination per edge 3 Vertex 0.8 … Source Cache Cache 0.6 Vertex 1 10x more traffic CSR 0.4 4 … Core Core than compulsory 0.2 0.0 Push UB
Prior hardware support for scatter updates 7
Prior hardware support for scatter updates 7 ¨ Remote memory operations (RMOs) send and perform update operations at a fixed location (e.g., shared cache banks)
Prior hardware support for scatter updates 7 ¨ Remote memory operations (RMOs) send and perform update operations at a fixed location (e.g., shared cache banks) ¤ Avoids cache-line ping ponging
Prior hardware support for scatter updates 7 ¨ Remote memory operations (RMOs) send and perform update operations at a fixed location (e.g., shared cache banks) ¤ Avoids cache-line ping ponging ¨ COUP [ MICRO’15 ] modifies the coherence protocol to perform commutative operations in a distributed fashion
Prior hardware support for scatter updates 7 ¨ Remote memory operations (RMOs) send and perform update operations at a fixed location (e.g., shared cache banks) ¤ Avoids cache-line ping ponging ¨ COUP [ MICRO’15 ] modifies the coherence protocol to perform commutative operations in a distributed fashion ¨ Both RMOs and COUP do not improve locality
Prior hardware support for scatter updates 7 ¨ Remote memory operations (RMOs) send and perform update operations at a fixed location (e.g., shared cache banks) ¤ Avoids cache-line ping ponging ¨ COUP [ MICRO’15 ] modifies the coherence protocol to perform commutative operations in a distributed fashion ¨ Both RMOs and COUP do not improve locality ¤ Bottlenecked by memory traffic with large inputs
PHI builds on Update Batching (UB) 8 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]
PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]
PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution Destination Source Vertices Vertices 0 A B . . C 8 D . . . . 16 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]
PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution Destination Source Vertices Vertices 0 A B . . C 8 Cache D fitting . . . slice . 16 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]
PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution Destination Source Destination Vertices Vertices Ids 0 A 0 5 11 B 0 7 9 . . C 4 6 8 Cache 3 8 D 12 fitting . . . slice . 16 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]
PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution ¨ Binning phase : Logs updates to memory, dividing them into cache-fitting slices (bins) of vertices 1. Binning Phase Destination Source Destination Vertices Vertices Ids 0 A 0 5 11 B 0 7 9 . . C 4 6 8 Cache 3 8 D 12 fitting . . . slice . 16 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]
PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution ¨ Binning phase : Logs updates to memory, dividing them into cache-fitting slices (bins) of vertices 1. Binning Phase Destination Source Destination Vertices Vertices Ids 0 A 0 5 11 0 A 5 A 0 B ……. 3 D B 0 7 9 . . Bin 0 C 4 6 8 Cache 3 8 D 12 fitting . . . slice . 16 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]
PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution ¨ Binning phase : Logs updates to memory, dividing them into cache-fitting slices (bins) of vertices 1. Binning Phase Destination Source Destination Vertices Vertices Ids 0 A 0 5 11 0 A 5 A 0 B ……. 3 D B 0 7 9 . . Bin 0 C 4 6 8 Cache 3 8 D 12 ……. 12 D 11 A 9 B 7 B fitting . . . slice Bin 1 . 16 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]
PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution ¨ Binning phase : Logs updates to memory, dividing them into cache-fitting slices (bins) of vertices 1. Binning Phase Destination Source Destination Vertices Vertices Ids 0 A 0 5 11 0 A 5 A 0 B ……. 3 D B 0 7 9 . . Bin 0 Main C 4 6 8 memory Cache 3 8 D 12 ……. 12 D 11 A 9 B 7 B fitting . . . slice Bin 1 . 16 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]
PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution ¨ Binning phase : Logs updates to memory, dividing them into cache-fitting slices (bins) of vertices ¨ Accumulation phase : Reads and applies logged updates bin-by-bin 2. Accumulation Phase 1. Binning Phase Destination Source Destination Vertices Vertices Ids 0 A 0 5 11 0 A 5 A 0 B ……. 3 D B 0 7 9 . . Bin 0 Main C 4 6 8 memory Cache 3 8 D 12 ……. 12 D 11 A 9 B 7 B fitting . . . slice Bin 1 . 16 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]
PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution ¨ Binning phase : Logs updates to memory, dividing them into cache-fitting slices (bins) of vertices ¨ Accumulation phase : Reads and applies logged updates bin-by-bin 2. Accumulation Phase 1. Binning Phase Destination Source Destination Vertices Vertices Ids 0 A 0 5 11 0 A 5 A 0 B ……. 3 D B 0 7 9 . . Bin 0 Main C 4 6 8 memory Cache 3 8 D 12 ……. 12 D 11 A 9 B 7 B fitting . . . slice Bin 1 . 16 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]
PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution ¨ Binning phase : Logs updates to memory, dividing them into cache-fitting slices (bins) of vertices ¨ Accumulation phase : Reads and applies logged updates bin-by-bin 2. Accumulation Phase 1. Binning Phase Destination Source Destination Vertices Vertices Ids 0 A 0 5 11 0 A 5 A 0 B ……. 3 D B 0 7 9 . . Bin 0 Main C 4 6 8 memory Cache 3 8 D 12 ……. 12 D 11 A 9 B 7 B fitting . . . slice Bin 1 . 16 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]
Update Batching tradeoffs 9
Update Batching tradeoffs 9 ¨ Perfect spatial locality for all main memory transfers ¤ Compulsory memory traffic for all data structures
Update Batching tradeoffs 9 ¨ Perfect spatial locality for all main memory transfers ¤ Compulsory memory traffic for all data structures ¨ Binning phase ignores temporal locality ¤ Generates large stream of updates even with structured inputs
Update Batching tradeoffs 9 ¨ Perfect spatial locality for all main memory transfers ¤ Compulsory memory traffic for all data structures ¨ Binning phase ignores temporal locality ¤ Generates large stream of updates even with structured inputs Push PageRank on uk-2005 graph 1.4 1.2 Memory requests Updates 1.0 Destination per edge 0.8 Vertex Source 0.6 Vertex 0.4 CSR 0.2 0.0 Push UB PHI Unstructured input
Update Batching tradeoffs 9 ¨ Perfect spatial locality for all main memory transfers ¤ Compulsory memory traffic for all data structures ¨ Binning phase ignores temporal locality ¤ Generates large stream of updates even with structured inputs Push PageRank on uk-2005 graph 1.4 1.2 Memory requests Updates 1.0 Destination per edge 0.8 Vertex Source 0.6 Vertex 0.4 CSR 0.2 0.0 Push UB PHI Unstructured input
Update Batching tradeoffs 9 ¨ Perfect spatial locality for all main memory transfers ¤ Compulsory memory traffic for all data structures ¨ Binning phase ignores temporal locality ¤ Generates large stream of updates even with structured inputs Push PageRank on uk-2005 graph 1.4 1.2 Memory requests Updates 1.0 Destination per edge 0.8 Vertex Source 0.6 Vertex 0.4 CSR 0.2 0.0 Push UB PHI Unstructured input
Update Batching tradeoffs 9 ¨ Perfect spatial locality for all main memory transfers ¤ Compulsory memory traffic for all data structures ¨ Binning phase ignores temporal locality ¤ Generates large stream of updates even with structured inputs Push PageRank on uk-2005 graph 1.4 1.4 1.2 1.2 Memory requests Memory requests Updates 1.0 1.0 Destination per edge per edge 0.8 0.8 Vertex Source 0.6 0.6 Vertex 0.4 0.4 CSR 0.2 0.2 0.0 0.0 Push UB PHI Push UB PHI Unstructured input Structured input
Agenda 10 ¨ Background ¨ PHI Design ¨ Evaluation
Key techniques of PHI 11
Key techniques of PHI 11 ¨ In-cache update buffering and coalescing ¤ Exploits temporal locality
Key techniques of PHI 11 ¨ In-cache update buffering and coalescing ¤ Exploits temporal locality ¨ Selective update batching ¤ Achieves high spatial locality
Key techniques of PHI 11 ¨ In-cache update buffering and coalescing ¤ Exploits temporal locality Bandwidth efficient ¨ Selective update batching ¤ Achieves high spatial locality
Key techniques of PHI 11 ¨ In-cache update buffering and coalescing ¤ Exploits temporal locality Bandwidth efficient ¨ Selective update batching ¤ Achieves high spatial locality ¨ Hierarchical buffering and coalescing Synchronization ¤ Enables update parallelism efficient ¤ Eliminates synchronization overheads
In-cache buffering and coalescing 12
In-cache buffering and coalescing 12 ¨ Buffer updates in cache without ever accessing main memory
In-cache buffering and coalescing 12 ¨ Buffer updates in cache without ever accessing main memory ¨ Treat cache as a large coalescing buffer for updates
In-cache buffering and coalescing 12 ¨ Buffer updates in cache without ever accessing main memory ¨ Treat cache as a large coalescing buffer for updates ¨ Reduction ALU in cache bank performs coalescing
In-cache buffering and coalescing 12 ¨ Buffer updates in cache without ever Memory accessing main memory 10 0xFOO ¨ Treat cache as a large coalescing buffer for updates Cache ¨ Reduction ALU in cache bank performs coalescing Core
In-cache buffering and coalescing 12 ¨ Buffer updates in cache without ever Memory accessing main memory 10 0xFOO ¨ Treat cache as a large coalescing buffer for updates Cache ¨ Reduction ALU in cache bank performs UPDATE coalescing 0xFOO, +4 Core
In-cache buffering and coalescing 12 ¨ Buffer updates in cache without ever Memory accessing main memory 10 0xFOO ¨ Treat cache as a large coalescing buffer for updates Cache 4 0xFOO ¨ Reduction ALU in cache bank performs UPDATE coalescing 0xFOO, +4 Core
In-cache buffering and coalescing 12 ¨ Buffer updates in cache without ever Memory accessing main memory 10 0xFOO ¨ Treat cache as a large coalescing buffer for updates Cache 4 0xFOO ¨ Reduction ALU in cache bank performs coalescing Core
In-cache buffering and coalescing 12 ¨ Buffer updates in cache without ever Memory accessing main memory 10 0xFOO ¨ Treat cache as a large coalescing buffer for updates Cache 4 0xFOO ¨ Reduction ALU in cache bank performs UPDATE coalescing 0xFOO, +2 Core
In-cache buffering and coalescing 12 ¨ Buffer updates in cache without ever Memory accessing main memory 10 0xFOO ¨ Treat cache as a large coalescing buffer for updates Cache 4 6 0xFOO ¨ Reduction ALU in cache bank performs UPDATE coalescing 0xFOO, +2 Core
Handling cache evictions 13
Handling cache evictions 13 ¨ PHI adapts to the amount of spatial locality in the evicted line
Handling cache evictions 13 ¨ PHI adapts to the amount of spatial locality in the evicted line ¨ Cache controller performs update batching selectively ¤ Achieves good spatial locality in all cases
Handling cache evictions 13 ¨ PHI adapts to the amount of spatial locality in the evicted line ¨ Cache controller performs update batching selectively ¤ Achieves good spatial locality in all cases ¨ Key insight : Update batching is a good tradeoff only when the evicted line has poor spatial locality
Case 1: Evicted line has few updates 14
Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache)
Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full
Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache F00 4 A48 7 0x10: F00 4 0x10: 0xA4: 0 0 7 0 Line with batched updates 0xF8: 0 3 0 0
Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache F00 4 A48 7 0x10: F00 4 0x10: Evict 0xA4 0xA4: 0 0 7 0 Line with batched updates 0xF8: 0 3 0 0
Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache F00 4 A48 7 0x10: F00 4 0x10: Evict 0xA4 Line with INV batched updates 0xF8: 0 3 0 0
Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache F00 4 A48 7 0x10: 0x10: F00 4 A48 7 Evict 0xA4 Line with INV batched updates 0xF8: 0 3 0 0
Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache F00 4 A48 7 0x10: 0x10: F00 4 A48 7 Line with INV batched updates 0xF8: 0 3 0 0
Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache F00 4 A48 7 0x10: 0x10: F00 4 A48 7 Line with INV Evict 0xF8 batched updates 0xF8: 0 3 0 0
Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache Evict 0x10 F00 4 A48 7 0x10: 0x10: F00 4 A48 7 Line with INV Evict 0xF8 batched updates 0xF8: 0 3 0 0
Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache Evict 0x10 F00 4 A48 7 0x10: 0x10: F00 4 A48 7 Line with INV Evict 0xF8 batched updates 0xF8: 0 3 0 0
Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory 0x10: F00 4 A48 7 INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache Evict 0x10 F00 4 A48 7 0x10: 0x10: F00 4 A48 7 Line with INV Evict 0xF8 batched updates 0xF8: 0 3 0 0
Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory 0x10: F00 4 A48 7 INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache F00 4 A48 7 0x10: INV Line with INV Evict 0xF8 batched updates 0xF8: 0 3 0 0
Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory 0x10: F00 4 A48 7 INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache F00 4 A48 7 0x10: F84 3 0x11: Line with INV batched updates INV
Case 1: Evicted line has many valid updates 15 ___
Case 1: Evicted line has many valid updates 15 ¨ Fetch line from main memory and merge updates ___
Case 1: Evicted line has many valid updates 15 ¨ Fetch line from main memory and merge updates Memory 0xF0: 1 2 1 7 INV Invalid line Cache 0xF0: 0 4 0 0 4 6 3 0 0xF0: Buffered-updates line ___ 0xDF: 0 7 9 2 0xBC: 5 6 1 8
Recommend
More recommend