NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu - - PowerPoint PPT Presentation

numa friendly stack
SMART_READER_LITE
LIVE PREVIEW

NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu - - PowerPoint PPT Presentation

NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu Justin Gottschlich Maurice Herlihy HotPar 13 1 Trends for Future Architectures 2 Uniform Memory Access (UMA) 3 Non-Uniform Memory Access (NUMA) NUMA NODE (multiple


slide-1
SLIDE 1

NUMA-Friendly Stack (using Delegation and Elimination)

Irina Calciu Justin Gottschlich Maurice Herlihy

HotPar ‘13

1

slide-2
SLIDE 2

Trends for Future Architectures

2

slide-3
SLIDE 3

Uniform Memory Access (UMA)

3

slide-4
SLIDE 4

Non-Uniform Memory Access (NUMA)

(interconnect) NUMA NODE (multiple cores, shared Last Level Cache) NUMA NODE (multiple cores, shared Last Level Cache) NUMA NODE (multiple cores, shared Last Level Cache) NUMA NODE (multiple cores, shared Last Level Cache) Cache coherency maintained between caches on different NUMA nodes

4

slide-5
SLIDE 5

Overview

  • Motivation
  • Algorithms
  • Results
  • Conclusions

5

slide-6
SLIDE 6

Delegation

NUMA node 0 NUMA node 1 Clients Clients SEQ STACK Server

6

slide-7
SLIDE 7

Delegation

NUMA node 0 NUMA node 1 Server Client 5 Client 6 Client 7 Client 8 Slots Client 1 Client 2 Client 3 Client 4 Slots Loops through all slots SEQ STACK

7

slide-8
SLIDE 8

Elimination, Rendezvous

8

slide-9
SLIDE 9

Local Rendezvous

NUMA node 0 NUMA node 1 STACK

9

slide-10
SLIDE 10

Delegation + Elimination

NUMA node 0 NUMA node 1 Clients Clients SEQ STACK Server

10

slide-11
SLIDE 11

Delegation + LOCAL Elimination

NUMA node 0 NUMA node 1 Clients Clients SEQ STACK Server

11

slide-12
SLIDE 12

Effect of Elimination

Throughput (Better) 50% push 50% pop 90% push 10% pop

12

slide-13
SLIDE 13

Effect of Delegation

Throughput (Better) 50% push 50% pop 90% push 10% pop

13

slide-14
SLIDE 14

Number of Slots

Throughput (Better) 50% push 50% pop 90% push 10% pop

14

slide-15
SLIDE 15

Workloads: Balanced vs. Unbalanced

Throughput (Better) 50% push 50% pop 70% push 30% pop

15

slide-16
SLIDE 16

Advantages

  • Memory and cache locality
  • Reduced bus traffic
  • Increased parallelism through elimination

16

slide-17
SLIDE 17

Drawbacks

  • Communication cost between clients and

server thread

  • Insignificant compared to the benefits
  • Serializing otherwise parallel data structures
  • Parallelism through elimination
  • Elimination opportunities decrease as

workload more unbalanced

17

slide-18
SLIDE 18

Open Questions

  • Are there other data structures where we can use

delegation and elimination?

  • Are there data structures where direct access is

much better?

  • What can we do for those data structures?

18

slide-19
SLIDE 19

Thank you! Questions?

19

slide-20
SLIDE 20

References

  • A Scalable Lock-free Stack Algorithm

http://www.inf.ufsc.br/~dovicchi/pos-ed/pos/artigos/p206- hendler.pdf

  • Flat Combining and the Synchronization-Parallelism Tradeoff

http://www.cs.bgu.ac.il/~hendlerd/papers/flat-combining.pdf

  • Fast and Scalable Rendezvousing

http://www.cs.tau.ac.il/~afek/rendezvous.pdf

20

slide-21
SLIDE 21

Cache to Cache Traffic

Better

21

slide-22
SLIDE 22

Coefficient of Variation

Better

22

slide-23
SLIDE 23

Flat Combining

23

slide-24
SLIDE 24

Delegation

CLIENT Find corresponding slot (by NUMA node and cpuid) Post message Wait for response Get response SERVER Loop through all slots: If slot has message: Take message Process message Send response Time

24

slide-25
SLIDE 25

Delegation

CLIENT Find corresponding slot (by NUMA node and cpuid) try_elimination: if (eliminate) return Post message Wait for response Get response else try_elimination SERVER Loop through all slots: If slot has message: Take message Process message Send response Time

25

slide-26
SLIDE 26

Delegation

CLIENT Find corresponding slot (by NUMA node and cpuid) try_elimination: if (eliminate) return if (Acquire slot lock) Post message Wait for response Get response Release slot lock else try_elimination SERVER Loop through all slots: If slot has message: Take message Process message Send response Time

26

slide-27
SLIDE 27

Open Questions

  • Performance
  • Scalability
  • Power

27