numa friendly stack
play

NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu - PowerPoint PPT Presentation

NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu Justin Gottschlich Maurice Herlihy HotPar 13 1 Trends for Future Architectures 2 Uniform Memory Access (UMA) 3 Non-Uniform Memory Access (NUMA) NUMA NODE (multiple


  1. NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu Justin Gottschlich Maurice Herlihy HotPar ‘13 1

  2. Trends for Future Architectures 2

  3. Uniform Memory Access (UMA) 3

  4. Non-Uniform Memory Access (NUMA) NUMA NODE (multiple cores, shared NUMA NODE (multiple cores, shared Last Level Cache) Last Level Cache) ( interconnect ) NUMA NODE (multiple cores, shared NUMA NODE (multiple cores, shared Last Level Cache) Last Level Cache) Cache coherency maintained between caches on different NUMA nodes 4

  5. Overview • Motivation • Algorithms • Results • Conclusions 5

  6. Delegation NUMA node 0 NUMA node 1 Clients Clients Server SEQ STACK 6

  7. Delegation NUMA node 0 NUMA node 1 SEQ STACK Slots Slots Client 1 Client 5 Client 2 Client 6 Server Loops through Client 3 Client 7 Client 4 all slots Client 8 7

  8. Elimination, Rendezvous 8

  9. Local Rendezvous NUMA node 0 NUMA node 1 STACK 9

  10. Delegation + Elimination NUMA node 0 NUMA node 1 Clients Clients Server SEQ STACK 10

  11. Delegation + LOCAL Elimination NUMA node 0 NUMA node 1 Clients Clients Server SEQ STACK 11

  12. Effect of Elimination Throughput (Better) 90% push 10% pop 50% push 50% pop 12

  13. Effect of Delegation Throughput (Better) 90% push 10% pop 50% push 50% pop 13

  14. Number of Slots Throughput (Better) 90% push 10% pop 50% push 50% pop 14

  15. Workloads: Balanced vs. Unbalanced Throughput (Better) 70% push 30% pop 50% push 50% pop 15

  16. Advantages • Memory and cache locality • Reduced bus traffic • Increased parallelism through elimination 16

  17. Drawbacks • Communication cost between clients and server thread o Insignificant compared to the benefits • Serializing otherwise parallel data structures o Parallelism through elimination • Elimination opportunities decrease as workload more unbalanced 17

  18. Open Questions • Are there other data structures where we can use delegation and elimination? • Are there data structures where direct access is much better? • What can we do for those data structures? 18

  19. Thank you! Questions? 19

  20. References • A Scalable Lock-free Stack Algorithm http://www.inf.ufsc.br/~dovicchi/pos-ed/pos/artigos/p206- hendler.pdf • Flat Combining and the Synchronization-Parallelism Tradeoff http://www.cs.bgu.ac.il/~hendlerd/papers/flat-combining.pdf • Fast and Scalable Rendezvousing http://www.cs.tau.ac.il/~afek/rendezvous.pdf 20

  21. Cache to Cache Traffic Better 21

  22. Coefficient of Variation Better 22

  23. Flat Combining 23

  24. Delegation SERVER CLIENT Loop through all slots: Find corresponding slot If slot has message: (by NUMA node and cpuid) Post message Wait for response Take message Process message Send response Get response Time 24

  25. Delegation SERVER CLIENT Loop through all slots: Find corresponding slot If slot has message: (by NUMA node and cpuid) try_elimination: if (eliminate) return Post message Wait for response Take message Process message Send response Get response else try_elimination Time 25

  26. Delegation SERVER CLIENT Loop through all slots: Find corresponding slot If slot has message: (by NUMA node and cpuid) try_elimination: if (eliminate) return if (Acquire slot lock) Post message Wait for response Take message Process message Send response Get response Release slot lock else try_elimination Time 26

  27. Open Questions • Performance • Scalability • Power 27

Recommend


More recommend