nexus a new approach to replication in distributed shared
play

Nexus: A New Approach to Replication in Distributed Shared Caches - PowerPoint PPT Presentation

Nexus: A New Approach to Replication in Distributed Shared Caches Po-An Tsai , Nathan Beckmann, and Daniel Sanchez Executive summary 2 Executive summary Data replication reduces the access latency of non-uniform caches (NUCA) But


  1. Nexus allows replication even when read-only data cannot fit in the local bank Access latency (lower is better) Data fits in the local bank, 10 each thread owns 1 replica

  2. Nexus allows replication even when read-only data cannot fit in the local bank Access latency (lower is better) Data fits in the local bank, 1 replica shared 10 each thread owns 1 replica by every 4 neighbors

  3. Nexus allows replication even when read-only data cannot fit in the local bank Access latency (lower is better) 1 replica shared Data fits in the local bank, 1 replica shared by every 16 neighbors 10 each thread owns 1 replica by every 4 neighbors

  4. Nexus allows replication even when read-only data cannot fit in the local bank Access latency (lower is better) 1 replica shared by all threads → Same as S-NUCA 1 replica shared Data fits in the local bank, 1 replica shared by every 16 neighbors 10 each thread owns 1 replica by every 4 neighbors

  5. Nexus allows replication even when read-only data cannot fit in the local bank A significant latency Access latency (lower is better) reduction over prior work! 1 replica shared by all threads → Same as S-NUCA 1 replica shared Data fits in the local bank, 1 replica shared by every 16 neighbors 10 each thread owns 1 replica by every 4 neighbors

  6. Recent directory-less dynamic NUCAs enable replication beyond the local bank Threads LLC data X Y Z 11

  7. Recent directory-less dynamic NUCAs enable replication beyond the local bank Data placement is controlled using the virtual memory system and does not require a global directory Threads LLC data Core X Y TLB Z 11

  8. Recent directory-less dynamic NUCAs enable replication beyond the local bank Data placement is controlled using the virtual memory system and does not require a global directory Threads LLC data X Core X Y Y TLB Z Z 11

  9. Recent directory-less dynamic NUCAs enable replication beyond the local bank Data placement is controlled using the virtual memory system and does not require a global directory Threads LLC data X Core X Y Y TLB Z Z Data can be dynamically mapped to nearby banks and shared by arbitrary cores 11

  10. The number of replicas ( replication degree ) is important Read-only Threads data (4MB) 16 MB LLC capacity 12

  11. The number of replicas ( replication degree ) is important Read-only Threads data (4MB) 16 MB LLC capacity 12

  12. The number of replicas ( replication degree ) is important Replicating 4 times works best (4 x 4MB read-only = 16MB) Read-only Threads data (4MB) 16 MB LLC capacity 12

  13. The number of replicas ( replication degree ) is important Replicating 4 times works best (4 x 4MB read-only = 16MB) Read-only Threads data (4MB) 16 MB LLC capacity Choosing how much to replicate is more important than choosing which lines to replicate 12

  14. The number of replicas ( replication degree ) is important Read-only Threads data (1MB) Other data (8MB) 16 MB LLC capacity 13

  15. The number of replicas ( replication degree ) is important Read-only Threads data (1MB) Other data (8MB) 16 MB LLC capacity 13

  16. The number of replicas ( replication degree ) is important Replicating 8 times works best (8 x 1MB read-only + 8MB other = 16MB) Read-only Threads data (1MB) Other data (8MB) 16 MB LLC capacity 13

  17. The number of replicas ( replication degree ) is important Replicating 8 times works best (8 x 1MB read-only + 8MB other = 16MB) Read-only Threads data (1MB) Other data (8MB) 16 MB LLC capacity Too few replicas cause extra network traversals, while too many cause unnecessary cache misses 13

  18. No adaptive replication in directory-less D-NUCAs Instructions Threads (read-only) Other data 14

  19. No adaptive replication in directory-less D-NUCAs Reactive-NUCA (R-NUCA) [Hardavellas, ISCA 2009] always replicates instructions every 4 cores statically. Instructions Threads (read-only) Other data 14

  20. No adaptive replication in directory-less D-NUCAs Reactive-NUCA (R-NUCA) [Hardavellas, ISCA 2009] always replicates instructions every 4 cores statically. Instructions Threads (read-only) Other data Other directory-less D-NUCAs do not replicate data 14

  21. Workloads have different preferences to replication degrees  Study read-only data intensive workloads running on a 144-core system  Apply different replication degrees for all read-only data 15

  22. Workloads have different preferences to replication degrees  Study read-only data intensive workloads running on a 144-core system  Apply different replication degrees for all read-only data 15

  23. Workloads have different preferences to replication degrees  Study read-only data intensive workloads running on a 144-core system  Apply different replication degrees for all read-only data Observation 1: Applications prefer different degrees, requiring an adaptive approach. 15

  24. Workloads have different preferences to replication degrees  Study read-only data intensive workloads running on a 144-core system  Apply different replication degrees for all read-only data Observation 1: Applications prefer different degrees, requiring an adaptive approach. Observation 2: A few replication degrees suffice. 15

  25. Nexus: enabling adaptive replication degrees in NUCA 16

  26. Nexus: enabling adaptive replication degrees in NUCA  Builds on top of directory-less D-NUCAs  Read- only data’s on -chip location and coherence are tracked via the virtual memory system  Cores access and share closest replicas without directory overheads 16

  27. Nexus: enabling adaptive replication degrees in NUCA  Builds on top of directory-less D-NUCAs  Read- only data’s on -chip location and coherence are tracked via the virtual memory system  Cores access and share closest replicas without directory overheads  Nexus-R builds on R-NUCA [Hardavellas, ISCA’09]  Supports flexible replication degrees for all read-only data  Leverages set-sampling to choose the best replication degree 16

  28. Nexus: enabling adaptive replication degrees in NUCA  Builds on top of directory-less D-NUCAs  Read- only data’s on -chip location and coherence are tracked via the virtual memory system  Cores access and share closest replicas without directory overheads  Nexus-R builds on R-NUCA [Hardavellas, ISCA’09]  Supports flexible replication degrees for all read-only data  Leverages set-sampling to choose the best replication degree  Nexus-J builds on Jigsaw [PACT’13, HPCA’15]  Extends Jigsaw’s configuration algorithm to select the best replication degree  Outperforms Nexus-R in multi-program workloads 16

  29. Nexus: enabling adaptive replication degrees in NUCA  Builds on top of directory-less D-NUCAs  Read- only data’s on -chip location and coherence are tracked via the virtual memory system  Cores access and share closest replicas without directory overheads Focus of this talk  Nexus-R builds on R-NUCA [Hardavellas, ISCA’09]  Supports flexible replication degrees for all read-only data  Leverages set-sampling to choose the best replication degree  Nexus-J builds on Jigsaw [PACT’13, HPCA’15]  Extends Jigsaw’s configuration algorithm to select the best replication degree  Outperforms Nexus-R in multi-program workloads 16

  30. Nexus-R: Applying Nexus to R-NUCA 17

  31. Nexus-R: Applying Nexus to R-NUCA  Nexus uses the virtual memory system to classify pages into three types.  Similar to R-NUCA, but differentiates all read-only data (not just instructions) 17

  32. Nexus-R: Applying Nexus to R-NUCA  Nexus uses the virtual memory system to classify pages into three types.  Similar to R-NUCA, but differentiates all read-only data (not just instructions) Thread 0 Thread1 X Y Z Unknown Time 17

  33. Nexus-R: Applying Nexus to R-NUCA  Nexus uses the virtual memory system to classify pages into three types.  Similar to R-NUCA, but differentiates all read-only data (not just instructions) Thread 0 Thread1 X Y Z Thread Shared Unknown Private Read-only Shared Read-write Time 17

  34. Nexus-R: Applying Nexus to R-NUCA  Nexus uses the virtual memory system to classify pages into three types.  Similar to R-NUCA, but differentiates all read-only data (not just instructions) Thread 0 Thread1 Y Z X Thread Shared Read X Unknown Private Read-only First TLB miss Shared Read-write Time 17

  35. Nexus-R: Applying Nexus to R-NUCA  Nexus uses the virtual memory system to classify pages into three types.  Similar to R-NUCA, but differentiates all read-only data (not just instructions) Thread 0 Thread1 Z X Y Thread Shared Read X Unknown Private Read-only First TLB miss Read Y Shared Read-write Time 17

  36. Nexus-R: Applying Nexus to R-NUCA  Nexus uses the virtual memory system to classify pages into three types.  Similar to R-NUCA, but differentiates all read-only data (not just instructions) Thread 0 Thread1 Read TLB miss from other thread Z Y X Thread Shared Read X Unknown Private Read-only First TLB miss Read Y Read Y Shared Read-write Time 17

  37. Nexus-R: Applying Nexus to R-NUCA  Nexus uses the virtual memory system to classify pages into three types.  Similar to R-NUCA, but differentiates all read-only data (not just instructions) Thread 0 Thread1 Read TLB miss from other thread Z Y X Thread Shared Read X Unknown Private Read-only Nexus-R First TLB miss Read Y replicates this Read Y Shared Read-write Time 17

  38. Nexus-R: Applying Nexus to R-NUCA  Nexus uses the virtual memory system to classify pages into three types.  Similar to R-NUCA, but differentiates all read-only data (not just instructions) Thread 0 Thread1 Read TLB miss from other thread Y X Z Thread Shared Read X Unknown Private Read-only Nexus-R First TLB miss Read Y replicates this Read Y Read Z Shared Read-write Time 17

  39. Nexus-R: Applying Nexus to R-NUCA  Nexus uses the virtual memory system to classify pages into three types.  Similar to R-NUCA, but differentiates all read-only data (not just instructions) Thread 0 Thread1 Read TLB miss from other thread Y X Thread Shared Read X Unknown Private Read-only Nexus-R First TLB miss Read Y Write TLB miss replicates this Read Y from other thread Read Z Z Write Z Shared Read-write Time 17

  40. Nexus-R: Applying Nexus to R-NUCA 18

  41. Nexus-R: Applying Nexus to R-NUCA  Supports flexible replication degrees via flexible cluster sizes  R-NUCA always uses the cluster size of 4; Nexus-R supports reconfigurable sizes 18

  42. Nexus-R: Applying Nexus to R-NUCA  Supports flexible replication degrees via flexible cluster sizes  R-NUCA always uses the cluster size of 4; Nexus-R supports reconfigurable sizes 18

  43. Nexus-R: Applying Nexus to R-NUCA  Supports flexible replication degrees via flexible cluster sizes  R-NUCA always uses the cluster size of 4; Nexus-R supports reconfigurable sizes 18

  44. Nexus-R: Applying Nexus to R-NUCA  Supports flexible replication degrees via flexible cluster sizes  R-NUCA always uses the cluster size of 4; Nexus-R supports reconfigurable sizes Private data: Always local 18

  45. Nexus-R: Applying Nexus to R-NUCA  Supports flexible replication degrees via flexible cluster sizes  R-NUCA always uses the cluster size of 4; Nexus-R supports reconfigurable sizes Private data: Always local Shared read-write data: Always like S-NUCA 18

  46. Nexus-R: Applying Nexus to R-NUCA  Supports flexible replication degrees via flexible cluster sizes  R-NUCA always uses the cluster size of 4; Nexus-R supports reconfigurable sizes Replication degree of 9 on 36 cores → cluster with size of 4 (36 divided by 9) Shared read-only data: Replicated clusters Private data: Always local Shared read-write data: Always like S-NUCA 18

  47. Nexus-R: Applying Nexus to R-NUCA  Supports flexible replication degrees via flexible cluster sizes  R-NUCA always uses the cluster size of 4; Nexus-R supports reconfigurable sizes Replication degree of 9 on 36 cores → cluster with size of 4 (36 divided by 9) Shared read-only data: Replicated clusters 18

  48. Nexus-R: Applying Nexus to R-NUCA  Supports flexible replication degrees via flexible cluster sizes  R-NUCA always uses the cluster size of 4; Nexus-R supports reconfigurable sizes Replication degree of 9 on 36 cores → Replication degree of 4 → cluster with size of 4 (36 divided by 9) cluster with size of 9 Shared read-only data: Replicated clusters 18

  49. Nexus-R leverages set-sampling to select the best degree 19

  50. Nexus-R leverages set-sampling to select the best degree  Enhances set-sampling to monitor the performance of different degrees 19

  51. Nexus-R leverages set-sampling to select the best degree  Enhances set-sampling to monitor the performance of different degrees  Compares the cumulative access latency of each degree from sampled sets 19

  52. Nexus-R leverages set-sampling to select the best degree  Enhances set-sampling to monitor the performance of different degrees  Compares the cumulative access latency of each degree from sampled sets L1s Core 19

  53. Nexus-R leverages set-sampling to select the best degree  Enhances set-sampling to monitor the performance of different degrees  Compares the cumulative access latency of each degree from sampled sets 1. L1 Miss Address to Bank/Set Lookup Logic L1s Core 19

  54. Nexus-R leverages set-sampling to select the best degree  Enhances set-sampling to monitor the performance of different degrees  Compares the cumulative access latency of each degree from sampled sets 1. L1 Miss Address to Bank/Set Lookup Logic 2. Sampled access for degree of 4 L1s MSHR Core 19

  55. Nexus-R leverages set-sampling to select the best degree  Enhances set-sampling to monitor the performance of different degrees  Compares the cumulative access latency of each degree from sampled sets 1. L1 Miss Address to Bank/Set Lookup Logic 2. Sampled access for degree of 4 L1s MSHR 3. Sampled access returns Latency X Core 19

  56. Nexus-R leverages set-sampling to select the best degree  Enhances set-sampling to monitor the performance of different degrees  Compares the cumulative access latency of each degree from sampled sets 1. L1 Miss Address to Bank/Set Lookup Logic 2. Sampled access for degree of 4 L1s MSHR 3. Sampled access returns Latency X Counters record the latency Core 1/4 1/9 1/36 4/9 4/36 9/36 difference between degrees 19

  57. Nexus-R leverages set-sampling to select the best degree  Enhances set-sampling to monitor the performance of different degrees  Compares the cumulative access latency of each degree from sampled sets 1. L1 Miss Address to Bank/Set Lookup Logic 2. Sampled access for degree of 4 L1s MSHR 3. Sampled access returns Latency X +X -X -X Counters record the latency Core 4. Update counters 1/4 1/9 1/36 4/9 4/36 9/36 difference between degrees 19

Recommend


More recommend