28th August 2017 HELP YOUR BUSY NEIGHBOURS DYNAMIC MULTICASTS OVER STATIC TOPOLOGIES Robert Kuban , Randolf Rotta, J¨ org Nolte Distributed Systems / Operating Systems
OUR TARGET SCENARIO objective: scalable multicasts + acknowledgement of completion + dynamic group membership (join/leave) applications: cache invalidation, esp. TLB shootdown hardware: many-cores like Intel XeonPhi, Tilera TilePro. . . + cache-coherent shared memory + point-to-point message passing 1 · Motivation 2
EXAMPLE: LINUX TLB SHOOTDOWN Linux 4.11 x86 smp_call_function_many() Initiator (Sender) 1. update page tables S 2. enqueue invalidation send ack tasklet at each thread 3. send IPI to each thread R0 R1 R2 ... Rn 4. wait on flag in each tasklet Other CPU Threads IPI handler processes tasklet: ⇒ flat topology � fast join/leave via bit-mask 1. invalidate page(s) in TLB 2. set ACK flag in tasklet � O ( n ) latency 1 · Motivation 3
EXAMPLE: MULTICASTS IN BARRELFISH propagate along a tree topology root use constraint solver send for optimized topology ack proposed for TLB shootdowns 1 R0 R1 R2 R3 � expensive join/leave or interrupt ex-members R4 R5 R6 R7 � O ( log n ) latency 1 Baumann et al., The multikernel: A new OS architecture for scalable multicore systems , 2009 1 · Motivation 4
DESIGN SPACE Multicasts Broadcasts (just members) (over all threads) � low latency for small groups � always high latency Flat � high latency � interrupts non-members for large groups � fast join/leave � good latency for large groups � always low latency Tree � bad latency � costly join/leave for small groups � interrupts non-members 1 · Motivation 5
MULTICASTS ON A STATIC TOPOLOGY Problem Statement: Combine. . . fast join/leave like with flat topology low latency like in tree topologies (parallel propagation) Solution Idea use static tree topology like in broadcasts (can be hand-crafted for the processor) membership as bit-mask for fast join/leave exploit shared memory to skip non-members, just message passing to actual members 2 · Multicasts on a Static Topology 6
TREES WITH ACKNOWLEDGEMENT Nodes = Cores; Two roles at each node ack send root 2 · Multicasts on a Static Topology 7
TREES WITH ACKNOWLEDGEMENT Logical nodes for larger design space & simpler code scatter nodes send ack gather nodes send ack root root 2 · Multicasts on a Static Topology 8
NON-MEMBER NODES IN BROADCASTS 3 send send 1 7 send send send send 4 0 9 send send 5 send send 2 8 send send 6 2 · Multicasts on a Static Topology 9
SOLUTION: HELPING Skip non-member scatter nodes 3 help 1 7 help help send help 4 send 0 9 help send 5 help help 8 2 send help 6 2 · Multicasts on a Static Topology 10
HUGE OVERHEAD FOR SMALL GROUPS :( 3 help help 1 7 help help help help 4 0 9 help help 5 help help 2 8 help help 6 2 · Multicasts on a Static Topology 11
SOLUTION: SKIPPING Jump over whole subtrees 6 help help skip 8 help 2 help help help 5 skip 0 9 help help 4 help help 1 7 skip help help 3 2 · Multicasts on a Static Topology 12
EVALUATION SETUP Flat Topology Binary Tree Setup Intel XeonPhi Knights Corner (1.053 GHz) 60 cores message passing via shared memory polling 3 · Evaluation 13
FLAT TOPOLOGY multicast similar to Linux TLB shootdown 80 median latency [k cycles] ● ● ● ● ● ● 60 40 20 0 0 20 40 60 group size ● broadcast multicast 3 · Evaluation 14
FLAT TOPOLOGY WITH HELPING Overhead from membership tests and graph traversal 80 median latency [k cycles] 60 40 20 0 0 20 40 60 group size ● broadcast broadcast with helping multicast 3 · Evaluation 15
BINARY TREE WITH HELPING, SKIPPING 80 median latency [k cycles] 60 40 20 0 0 20 40 60 group size broadcast with helping broadcast with skipping 3 · Evaluation 16
CONCLUSION Scalable, acknowledged, dynamic multicasts for manycores: Challenges: generating good topologies is costly, flat topology not scalable, non-members should not be interrupted Solution: static optimized broadcast topology, help and skip non-member cores Result: success for large groups, alright for small Implications: improve Linux TLB shootdown for Many-Core HPC apps 3 · Evaluation 17
ACKNOWLEDGE VIA SHARED MEMORY Decrement shared variable instead of message passing Only message passing: Using shared memory: 0 0 ack dec 2 ack 2 dec 1 1 ack 19
HELPING WITH SHARED MEM ACK → tree combining 2 for gather nodes 3 dec 1 7 dec help send 4 dec send 0 9 send 5 help 8 dec 2 dec+ack send dec 6 1 Yew et al., Distributing Hot-Spot Addressing in Large-Scale Multiprocessors , 1987 20
Recommend
More recommend