phi architectural support for synchronization and
play

PHI: ARCHITECTURAL SUPPORT FOR SYNCHRONIZATION- AND - PowerPoint PPT Presentation

PHI: ARCHITECTURAL SUPPORT FOR SYNCHRONIZATION- AND BANDWIDTH-EFFICIENT COMMUTATIVE SCATTER UPDATES Anurag Mukkara , Nathan Beckmann, Daniel Sanchez MICRO 2019 Scatter updates are common but inefficient 2 Scatter updates are common but


  1. Scatter updates are inefficient on conventional hierarchies 6 ¨ Poor temporal and spatial locality when inputs do not fit in cache ¤ Wasteful data transfers from main memory ¨ Multiple threads update the same vertex ¤ Cache line ping-ponging Memory 2 0 Shared Cache 3 2 … Cache Cache 1 4 … Core Core

  2. Scatter updates are inefficient on conventional hierarchies 6 ¨ Poor temporal and spatial locality when inputs do not fit in cache ¤ Wasteful data transfers from main memory ¨ Multiple threads update the same vertex ¤ Cache line ping-ponging 2 Memory 2 0 Shared Cache 3 … Cache Cache 1 4 … Core Core

  3. Scatter updates are inefficient on conventional hierarchies 6 ¨ Poor temporal and spatial locality when inputs do not fit in cache ¤ Wasteful data transfers from main memory ¨ Multiple threads update the same vertex ¤ Cache line ping-ponging Push PageRank on uk-2005 graph 1.4 2 Memory 2 1.2 Memory requests Updates 0 Shared Cache 1.0 Updates Destination per edge 3 Vertex 0.8 … Source Cache Cache 0.6 Vertex 1 CSR 0.4 4 … Core Core 0.2 0.0 Push UB

  4. Scatter updates are inefficient on conventional hierarchies 6 ¨ Poor temporal and spatial locality when inputs do not fit in cache ¤ Wasteful data transfers from main memory ¨ Multiple threads update the same vertex ¤ Cache line ping-ponging Push PageRank on uk-2005 graph 1.4 2 Memory 2 1.2 93% of traffic due Memory requests Updates 0 Shared Cache 1.0 Updates to scatter updates Destination per edge 3 Vertex 0.8 … Source Cache Cache 0.6 Vertex 1 10x more traffic CSR 0.4 4 … Core Core than compulsory 0.2 0.0 Push UB

  5. Prior hardware support for scatter updates 7

  6. Prior hardware support for scatter updates 7 ¨ Remote memory operations (RMOs) send and perform update operations at a fixed location (e.g., shared cache banks)

  7. Prior hardware support for scatter updates 7 ¨ Remote memory operations (RMOs) send and perform update operations at a fixed location (e.g., shared cache banks) ¤ Avoids cache-line ping ponging

  8. Prior hardware support for scatter updates 7 ¨ Remote memory operations (RMOs) send and perform update operations at a fixed location (e.g., shared cache banks) ¤ Avoids cache-line ping ponging ¨ COUP [ MICRO’15 ] modifies the coherence protocol to perform commutative operations in a distributed fashion

  9. Prior hardware support for scatter updates 7 ¨ Remote memory operations (RMOs) send and perform update operations at a fixed location (e.g., shared cache banks) ¤ Avoids cache-line ping ponging ¨ COUP [ MICRO’15 ] modifies the coherence protocol to perform commutative operations in a distributed fashion ¨ Both RMOs and COUP do not improve locality

  10. Prior hardware support for scatter updates 7 ¨ Remote memory operations (RMOs) send and perform update operations at a fixed location (e.g., shared cache banks) ¤ Avoids cache-line ping ponging ¨ COUP [ MICRO’15 ] modifies the coherence protocol to perform commutative operations in a distributed fashion ¨ Both RMOs and COUP do not improve locality ¤ Bottlenecked by memory traffic with large inputs

  11. PHI builds on Update Batching (UB) 8 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]

  12. PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]

  13. PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution Destination Source Vertices Vertices 0 A B . . C 8 D . . . . 16 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]

  14. PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution Destination Source Vertices Vertices 0 A B . . C 8 Cache D fitting . . . slice . 16 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]

  15. PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution Destination Source Destination Vertices Vertices Ids 0 A 0 5 11 B 0 7 9 . . C 4 6 8 Cache 3 8 D 12 fitting . . . slice . 16 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]

  16. PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution ¨ Binning phase : Logs updates to memory, dividing them into cache-fitting slices (bins) of vertices 1. Binning Phase Destination Source Destination Vertices Vertices Ids 0 A 0 5 11 B 0 7 9 . . C 4 6 8 Cache 3 8 D 12 fitting . . . slice . 16 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]

  17. PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution ¨ Binning phase : Logs updates to memory, dividing them into cache-fitting slices (bins) of vertices 1. Binning Phase Destination Source Destination Vertices Vertices Ids 0 A 0 5 11 0 A 5 A 0 B ……. 3 D B 0 7 9 . . Bin 0 C 4 6 8 Cache 3 8 D 12 fitting . . . slice . 16 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]

  18. PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution ¨ Binning phase : Logs updates to memory, dividing them into cache-fitting slices (bins) of vertices 1. Binning Phase Destination Source Destination Vertices Vertices Ids 0 A 0 5 11 0 A 5 A 0 B ……. 3 D B 0 7 9 . . Bin 0 C 4 6 8 Cache 3 8 D 12 ……. 12 D 11 A 9 B 7 B fitting . . . slice Bin 1 . 16 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]

  19. PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution ¨ Binning phase : Logs updates to memory, dividing them into cache-fitting slices (bins) of vertices 1. Binning Phase Destination Source Destination Vertices Vertices Ids 0 A 0 5 11 0 A 5 A 0 B ……. 3 D B 0 7 9 . . Bin 0 Main C 4 6 8 memory Cache 3 8 D 12 ……. 12 D 11 A 9 B 7 B fitting . . . slice Bin 1 . 16 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]

  20. PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution ¨ Binning phase : Logs updates to memory, dividing them into cache-fitting slices (bins) of vertices ¨ Accumulation phase : Reads and applies logged updates bin-by-bin 2. Accumulation Phase 1. Binning Phase Destination Source Destination Vertices Vertices Ids 0 A 0 5 11 0 A 5 A 0 B ……. 3 D B 0 7 9 . . Bin 0 Main C 4 6 8 memory Cache 3 8 D 12 ……. 12 D 11 A 9 B 7 B fitting . . . slice Bin 1 . 16 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]

  21. PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution ¨ Binning phase : Logs updates to memory, dividing them into cache-fitting slices (bins) of vertices ¨ Accumulation phase : Reads and applies logged updates bin-by-bin 2. Accumulation Phase 1. Binning Phase Destination Source Destination Vertices Vertices Ids 0 A 0 5 11 0 A 5 A 0 B ……. 3 D B 0 7 9 . . Bin 0 Main C 4 6 8 memory Cache 3 8 D 12 ……. 12 D 11 A 9 B 7 B fitting . . . slice Bin 1 . 16 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]

  22. PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution ¨ Binning phase : Logs updates to memory, dividing them into cache-fitting slices (bins) of vertices ¨ Accumulation phase : Reads and applies logged updates bin-by-bin 2. Accumulation Phase 1. Binning Phase Destination Source Destination Vertices Vertices Ids 0 A 0 5 11 0 A 5 A 0 B ……. 3 D B 0 7 9 . . Bin 0 Main C 4 6 8 memory Cache 3 8 D 12 ……. 12 D 11 A 9 B 7 B fitting . . . slice Bin 1 . 16 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]

  23. Update Batching tradeoffs 9

  24. Update Batching tradeoffs 9 ¨ Perfect spatial locality for all main memory transfers ¤ Compulsory memory traffic for all data structures

  25. Update Batching tradeoffs 9 ¨ Perfect spatial locality for all main memory transfers ¤ Compulsory memory traffic for all data structures ¨ Binning phase ignores temporal locality ¤ Generates large stream of updates even with structured inputs

  26. Update Batching tradeoffs 9 ¨ Perfect spatial locality for all main memory transfers ¤ Compulsory memory traffic for all data structures ¨ Binning phase ignores temporal locality ¤ Generates large stream of updates even with structured inputs Push PageRank on uk-2005 graph 1.4 1.2 Memory requests Updates 1.0 Destination per edge 0.8 Vertex Source 0.6 Vertex 0.4 CSR 0.2 0.0 Push UB PHI Unstructured input

  27. Update Batching tradeoffs 9 ¨ Perfect spatial locality for all main memory transfers ¤ Compulsory memory traffic for all data structures ¨ Binning phase ignores temporal locality ¤ Generates large stream of updates even with structured inputs Push PageRank on uk-2005 graph 1.4 1.2 Memory requests Updates 1.0 Destination per edge 0.8 Vertex Source 0.6 Vertex 0.4 CSR 0.2 0.0 Push UB PHI Unstructured input

  28. Update Batching tradeoffs 9 ¨ Perfect spatial locality for all main memory transfers ¤ Compulsory memory traffic for all data structures ¨ Binning phase ignores temporal locality ¤ Generates large stream of updates even with structured inputs Push PageRank on uk-2005 graph 1.4 1.2 Memory requests Updates 1.0 Destination per edge 0.8 Vertex Source 0.6 Vertex 0.4 CSR 0.2 0.0 Push UB PHI Unstructured input

  29. Update Batching tradeoffs 9 ¨ Perfect spatial locality for all main memory transfers ¤ Compulsory memory traffic for all data structures ¨ Binning phase ignores temporal locality ¤ Generates large stream of updates even with structured inputs Push PageRank on uk-2005 graph 1.4 1.4 1.2 1.2 Memory requests Memory requests Updates 1.0 1.0 Destination per edge per edge 0.8 0.8 Vertex Source 0.6 0.6 Vertex 0.4 0.4 CSR 0.2 0.2 0.0 0.0 Push UB PHI Push UB PHI Unstructured input Structured input

  30. Agenda 10 ¨ Background ¨ PHI Design ¨ Evaluation

  31. Key techniques of PHI 11

  32. Key techniques of PHI 11 ¨ In-cache update buffering and coalescing ¤ Exploits temporal locality

  33. Key techniques of PHI 11 ¨ In-cache update buffering and coalescing ¤ Exploits temporal locality ¨ Selective update batching ¤ Achieves high spatial locality

  34. Key techniques of PHI 11 ¨ In-cache update buffering and coalescing ¤ Exploits temporal locality Bandwidth efficient ¨ Selective update batching ¤ Achieves high spatial locality

  35. Key techniques of PHI 11 ¨ In-cache update buffering and coalescing ¤ Exploits temporal locality Bandwidth efficient ¨ Selective update batching ¤ Achieves high spatial locality ¨ Hierarchical buffering and coalescing Synchronization ¤ Enables update parallelism efficient ¤ Eliminates synchronization overheads

  36. In-cache buffering and coalescing 12

  37. In-cache buffering and coalescing 12 ¨ Buffer updates in cache without ever accessing main memory

  38. In-cache buffering and coalescing 12 ¨ Buffer updates in cache without ever accessing main memory ¨ Treat cache as a large coalescing buffer for updates

  39. In-cache buffering and coalescing 12 ¨ Buffer updates in cache without ever accessing main memory ¨ Treat cache as a large coalescing buffer for updates ¨ Reduction ALU in cache bank performs coalescing

  40. In-cache buffering and coalescing 12 ¨ Buffer updates in cache without ever Memory accessing main memory 10 0xFOO ¨ Treat cache as a large coalescing buffer for updates Cache ¨ Reduction ALU in cache bank performs coalescing Core

  41. In-cache buffering and coalescing 12 ¨ Buffer updates in cache without ever Memory accessing main memory 10 0xFOO ¨ Treat cache as a large coalescing buffer for updates Cache ¨ Reduction ALU in cache bank performs UPDATE coalescing 0xFOO, +4 Core

  42. In-cache buffering and coalescing 12 ¨ Buffer updates in cache without ever Memory accessing main memory 10 0xFOO ¨ Treat cache as a large coalescing buffer for updates Cache 4 0xFOO ¨ Reduction ALU in cache bank performs UPDATE coalescing 0xFOO, +4 Core

  43. In-cache buffering and coalescing 12 ¨ Buffer updates in cache without ever Memory accessing main memory 10 0xFOO ¨ Treat cache as a large coalescing buffer for updates Cache 4 0xFOO ¨ Reduction ALU in cache bank performs coalescing Core

  44. In-cache buffering and coalescing 12 ¨ Buffer updates in cache without ever Memory accessing main memory 10 0xFOO ¨ Treat cache as a large coalescing buffer for updates Cache 4 0xFOO ¨ Reduction ALU in cache bank performs UPDATE coalescing 0xFOO, +2 Core

  45. In-cache buffering and coalescing 12 ¨ Buffer updates in cache without ever Memory accessing main memory 10 0xFOO ¨ Treat cache as a large coalescing buffer for updates Cache 4 6 0xFOO ¨ Reduction ALU in cache bank performs UPDATE coalescing 0xFOO, +2 Core

  46. Handling cache evictions 13

  47. Handling cache evictions 13 ¨ PHI adapts to the amount of spatial locality in the evicted line

  48. Handling cache evictions 13 ¨ PHI adapts to the amount of spatial locality in the evicted line ¨ Cache controller performs update batching selectively ¤ Achieves good spatial locality in all cases

  49. Handling cache evictions 13 ¨ PHI adapts to the amount of spatial locality in the evicted line ¨ Cache controller performs update batching selectively ¤ Achieves good spatial locality in all cases ¨ Key insight : Update batching is a good tradeoff only when the evicted line has poor spatial locality

  50. Case 1: Evicted line has few updates 14

  51. Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache)

  52. Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full

  53. Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache F00 4 A48 7 0x10: F00 4 0x10: 0xA4: 0 0 7 0 Line with batched updates 0xF8: 0 3 0 0

  54. Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache F00 4 A48 7 0x10: F00 4 0x10: Evict 0xA4 0xA4: 0 0 7 0 Line with batched updates 0xF8: 0 3 0 0

  55. Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache F00 4 A48 7 0x10: F00 4 0x10: Evict 0xA4 Line with INV batched updates 0xF8: 0 3 0 0

  56. Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache F00 4 A48 7 0x10: 0x10: F00 4 A48 7 Evict 0xA4 Line with INV batched updates 0xF8: 0 3 0 0

  57. Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache F00 4 A48 7 0x10: 0x10: F00 4 A48 7 Line with INV batched updates 0xF8: 0 3 0 0

  58. Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache F00 4 A48 7 0x10: 0x10: F00 4 A48 7 Line with INV Evict 0xF8 batched updates 0xF8: 0 3 0 0

  59. Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache Evict 0x10 F00 4 A48 7 0x10: 0x10: F00 4 A48 7 Line with INV Evict 0xF8 batched updates 0xF8: 0 3 0 0

  60. Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache Evict 0x10 F00 4 A48 7 0x10: 0x10: F00 4 A48 7 Line with INV Evict 0xF8 batched updates 0xF8: 0 3 0 0

  61. Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory 0x10: F00 4 A48 7 INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache Evict 0x10 F00 4 A48 7 0x10: 0x10: F00 4 A48 7 Line with INV Evict 0xF8 batched updates 0xF8: 0 3 0 0

  62. Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory 0x10: F00 4 A48 7 INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache F00 4 A48 7 0x10: INV Line with INV Evict 0xF8 batched updates 0xF8: 0 3 0 0

  63. Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory 0x10: F00 4 A48 7 INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache F00 4 A48 7 0x10: F84 3 0x11: Line with INV batched updates INV

  64. Case 1: Evicted line has many valid updates 15 ___

  65. Case 1: Evicted line has many valid updates 15 ¨ Fetch line from main memory and merge updates ___

  66. Case 1: Evicted line has many valid updates 15 ¨ Fetch line from main memory and merge updates Memory 0xF0: 1 2 1 7 INV Invalid line Cache 0xF0: 0 4 0 0 4 6 3 0 0xF0: Buffered-updates line ___ 0xDF: 0 7 9 2 0xBC: 5 6 1 8

Recommend


More recommend