ACCELERATING ASYNCHRONOUS EVENTS FOR HYBRID PARALLEL RUNTIMES Kyle C. Hale and Peter Dinda 1
nautilus.halek.co v3vee.org/palacios v3vee.org 2
SOFTWARE EVENTS another event occurs execution in some context takes execution action based context on event for example, a thread 3
SOME TYPES OF EVENTS message arrival work is completed work is available something terrible happened 4
AN EXAMPLE: LEGION unit of work pthread_cond_broadcast() RACE worker threads: waiting for work ( pthread_cond_wait() ) pthread thread thread thread 0 1 2 CPU 0 CPU 1 CPU2 5
ASYNCHRONOUS EVENTS the receiving side is not blocked other things can run 6
WE WANT FAST EVENTS first moment of instruction of event trigger event handling } code notification latency we want to minimize this 7
WHAT’S THE LOWER LIMIT? light! 8
what we want: SoL † what we actually get with existing software events: SoL †† † speed of light †† s**t out of luck 9
OUTLINE software abstractions for asynchronous events hardware capabilities event performance NEMO: benefits of kernel mode NEMO: closer to the hardware 10
CONDITION VARIABLES pthread_cond_signal() thread pthread_cond_wait() running cond. thread scheduling delay var queue CPU ready queue 11
BROADCAST pthread_cond_broadcast() thread cond. var thread thread CPU 0 CPU 1 queue ready queue ready queue 12
IMMEDIATELY VISIBLE ISSUES we’re at the behest of the scheduler broadcast is linear in number of waiters we can’t tell scheduler to initiate a “fast” wakeup 13
OUTLINE software abstractions for asynchronous events hardware capabilities event performance NEMO: benefits of kernel mode NEMO: closer to the hardware 14
WHAT CAN WE DO IN HARDWARE? IDT inter-processor interrupts (IPIs) n int vector n handler code first instruction executed on receiving end 15
IPIS ARE FAST 1 95 th percentile = 1728 cycles 0.8 socket 0.6 CDF 0.4 NUMA domain logical core 0.2 physical core 0 1000 1200 1400 1600 1800 2000 2200 2400 Cycles mesaured from BSP (core 0) 16
OUTLINE software abstractions for asynchronous events hardware capabilities event performance NEMO: benefits of kernel mode NEMO: closer to the hardware 17
MEASURING EVENT WAKEUP LATENCY read_start_time() first event instruction trigger of handler core i core 0 read_end_time() 18
MEASURING EVENT BROADCAST LATENCY read_start_time() first event instruction trigger of handler core i core 0 read_end_time_i() first first instruction instruction of handler of handler core k core j read_end_time_j() read_end_time_k() 19
EXISTING SOFTWARE EVENTS ARE SLOW 30000 25000 min = 1145 Cycles to Wakeup 20000 max = 29955 µ = 25176.5 σ = 3698.93 16x 15000 min = 81 max = 29996 µ = 24640.5 σ = 3750.51 10000 min = 1150 max = 17397 µ = 1572.68 σ = 523.279 5000 0 pthread condvar futex wakeup unicast IPI 20
BROADCASTS ARE ALSO TERRIBLE 2.5 × 10 6 min = 17538 max = 2.17277e+06 µ = 995795 σ = 544512 2 × 10 6 Cycles to Wakeup 1.5 × 10 6 min = 16402 max = 1.89553e+06 µ = 370630 σ = 199680 1 × 10 6 500000 min = 1252 max = 57467 29x µ = 12827.3 σ = 2931.32 0 r p I P a u v I e d t k s n a a o w c c d x d a e a o t e r u b r f h t 21 p
SYNCHRONY for broadcasts, we want events to be delivered to all cores at the same time useful for, e.g. BSP apps with events measure the deviation of wakeup time across cores in a broadcast 22
SYNCHRONY 70x difference between hardware IPIs and software mechanisms hardly any predictability! 23
24
OUTLINE software abstractions for asynchronous events hardware capabilities event performance NEMO: benefits of kernel mode NEMO: closer to the hardware 25
NAUTILUS User Mode (Nothing) Kernel Mode Parallel Application Kernel Parallel Runtime HRT Aerokernel Synch/ Thread Topo Paging Alloc Ints Timers Misc Events Hardware Full Privileged HW Access [ Hale, Dinda HPDC ’15] [ Hale, Dinda VEE ’16] [ Hale, Hetland, Dinda FRIDAY] 26
27
RETAINING FAMILIAR INTERFACES use a lightweight, kernel-mode framework (like Nautilus) to eliminate overheads maintain userspace interfaces (e.g. condition variable wait,signal,broadcast etc.) if we build our kernel from scratch, how fast can we get? 28
NEMO HAS 2 COMPATIBLE CONDITION VARIABLE IMPLEMENTATIONS lightweight condition variables leverage IPI access to “kick” the scheduler 29
EXISTING SOFTWARE EVENTS ARE SLOW 30000 25000 min = 1145 Cycles to Wakeup 20000 max = 29955 µ = 25176.5 σ = 3698.93 16x 15000 min = 81 max = 29996 µ = 24640.5 σ = 3750.51 10000 min = 1150 max = 17397 µ = 1572.68 σ = 523.279 5000 0 pthread condvar futex wakeup unicast IPI 30
NEMO SPEEDS THINGS UP 30000 25000 min = 4195 max = 29990 µ = 9128.78 Cycles to Wakeup σ = 3025.12 20000 min = 1150 max = 17397 min = 1145 µ = 1572.68 max = 29955 σ = 523.279 15000 µ = 25176.5 σ = 3698.93 min = 81 max = 29996 min = 4730 10000 µ = 24640.5 max = 6392 σ = 3750.51 µ = 5348.51 σ = 290.006 5000 0 r p r I I P P a a u v v I I e d d + t k s n n a a r o o a w c c c v i n x d d l e u e n a n t o e u r c r e f h k l t e o p n r e r e A k o r e A 31
NEMO SPEEDS THINGS UP 30000 25000 min = 4195 Nemo events max = 29990 µ = 9128.78 Cycles to Wakeup σ = 3025.12 20000 min = 1150 5x max = 17397 min = 1145 µ = 1572.68 max = 29955 σ = 523.279 15000 µ = 25176.5 σ = 3698.93 min = 81 max = 29996 min = 4730 10000 µ = 24640.5 max = 6392 σ = 3750.51 µ = 5348.51 σ = 290.006 5000 0 r p r I I P P a a u v v I I e d d + t k s n n a a r o o a w c c c v i n x d d l e u e n a n t o e u r c r e f h k l t e o p n r e r e A k o r e A 32
BROADCASTS ARE ALSO TERRIBLE 2.5 × 10 6 min = 17538 max = 2.17277e+06 µ = 995795 σ = 544512 2 × 10 6 Cycles to Wakeup 1.5 × 10 6 min = 16402 max = 1.89553e+06 µ = 370630 σ = 199680 1 × 10 6 500000 min = 1252 max = 57467 29x µ = 12827.3 σ = 2931.32 0 r p I P a u v I e d t k s n a a o w c c d x d a e a o t e r u b r f h t 33 p
NEMO BRINGS US CLOSER TO IPI BROADCAST LATENCY 2.5 × 10 6 min = 17538 max = 2.17277e+06 µ = 995795 σ = 544512 2 × 10 6 Cycles to Wakeup Nemo events 1.5 × 10 6 min = 16402 max = 1.89553e+06 µ = 370630 σ = 199680 1 × 10 6 min = 3258 min = 7842 max = 612959 max = 464015 µ = 265820 µ = 132417 σ = 159421 σ = 98637.4 500000 min = 1252 3x max = 57467 µ = 12827.3 σ = 2931.32 0 r p r I I a a P P u v v I I e d d + t k s n n a a r o o w a c c c v d x d d l a e e n a o n t o e r u r b c r e f h k l t e o p n r e r e A k o r e A 34
SYNCHRONY we can do 2x better than user-space mechanisms (with compatible interfaces) 35
OUTLINE software abstractions for asynchronous events hardware capabilities event performance NEMO: benefits of kernel mode NEMO: closer to the hardware 36
WHAT IF WE GIVE UP THE FAMILIAR INTERFACE? modify condition variable semantics we don’t necessarily care which context (thread) receives the event, as long as it’s handled at a particular core not appropriate for all situations 37
ACTIVE MESSAGES handler handle_msg() message CPU memory claim: better fit than, e.g. cond vars, for many event-based schemes 38
39 we want to use IPIs as an active message substrate problem: IPIs don’t have a payload!
allocate several event IDs when core receives interrupt, lookup the event ID in a table indexed on core ID Action Lookup Table event ID 3 core 0 core 1 … core n-1 nemo_notify_event(core=1, event=3) 40
event ID corresponds to an “action” (a handler) Action Descriptor Table event ID 3 event ID 0 1 … 0xdeadbeef m-1 handle_event() 41
NEMO WAKEUPS HAVE CONSTANT OFFSET FROM IPIS 1 95 th % = 1728 95 th % = 1824 0.8 ~100 cycles 0.6 CDF 0.4 0.2 unicast IPI nemo event notify 0 1200 1400 1600 1800 2000 Cycles mesaured from BSP (core 0) 42
BROADCAST LATENCY ALSO ON PAR WITH IPIS 30000 25000 min = 1376 min = 1252 max = 29703 max = 26838 µ = 12958 µ = 12792 σ = 2819.29 σ = 2718.73 Cycles to Wakeup 20000 15000 10000 5000 0 IPI broadcast Nemo broadcast 43
NEMO ACHIEVES TIGHT SYNCHRONY < 50 cycles variation in broadcast wakeups between cores 44
SUMMARY if you want asynch. event delivery close to hardware latency… existing mechanisms are pretty terrible SOME WAYS TO FIX IT: throw out general purpose OS abstractions (e.g. user/kernel boundary) throw out typical event abstractions use the hardware directly! 45
ありがとう THANKS http://halek.co me http://presciencelab.org our lab http://nautilus.halek.co Nautilus http://xstack.sandia.gov/hobbes Hobbes Exascale OS/R project 46
BACKUPS 47
Recommend
More recommend