Building a C++CSP Channel Using C++ Atomics A Busy Channel Performance Analysis Dr Kevin Chalmers School of Computing Edinburgh Napier University Edinburgh k.chalmers@napier.ac.uk
Outline 1 Introduction and Background
Outline 1 Introduction and Background 2 Current Channel Operations
Outline 1 Introduction and Background 2 Current Channel Operations 3 An Atomic Channel Implementation
Outline 1 Introduction and Background 2 Current Channel Operations 3 An Atomic Channel Implementation 4 Benchmarking
Outline 1 Introduction and Background 2 Current Channel Operations 3 An Atomic Channel Implementation 4 Benchmarking 5 Conclusions
Outline 1 Introduction and Background 2 Current Channel Operations 3 An Atomic Channel Implementation 4 Benchmarking 5 Conclusions
Motivation • Most CSP inspired libraries build channels using a mutex. • Channel communication in most libraries is slow due to context switching. • A typical context switch on an i7 may take 1000ns+. • A channel build on a mutex may have up to three context switches: 1 Writer to reader after writer has stored the value to be written. 2 Reader to writer after reader has retrieved the value. 3 Writer to reader to complete the operation. • A channel effectively stops parallelism and forces two processes to become sequential during communication. • The aim of this work is to explore atomic operations as a method to improve channel performance.
Atomic Values • Any easily constructable type can be an atomic in C++, but how it is interacted with depends on its base type. • Five types: • Atomic Flag • Atomic Boolean • Atomic Integral • Atomic Pointer • Atomic Easily Constructable User Type • Flag and Boolean are different. • Flag is guaranteed lock free. • Flag only has two operations. • Test and Set (returns true on successful setting). • Clear.
Atomic Operations • Atomic Flag already mentioned. • Other types support dependant on type: store atomically stores a value. load atomically retrieves a value. exchange atomically retrieves the current value and stores a new value. compare and exchange tests the current value and if it matches expected value, exchanges with new value. Provides current value if not. strong guarantees the comparison is correct. weak may return false even if the values match. fetch and op gets the current value and performs the given operation.
Memory Ordering • Atomic operations in C++ work on the principle of what has been observed by the different threads. • We want to cause memory synchronisation to ensure a certain state has been reached. • Five types of memory ordering of interest: Sequentially consistent everything behaves as if everyone is watching. Slow, but easy to think about. Relaxed no synchronisation of memory. Fast. Can achieve the memory synchronisation from subsequent operations. Acquire for load operations, etc. Matches a release. Release for store operations, etc. Matches an acquire. Acquire-release for fetch and op, etc. Matches a release and acquire. • A naive explanation is that operations chain together to allow a memory history.
C++CSP Channel Model 1: procedure read 1: procedure write ( value ) lock lock 2: 2: if strength > 0 then if strength > 0 then 3: 3: throw throw 4: 4: if empty then hold ← value 5: 5: if empty then empty ← false 6: 6: wait empty ← false 7: 7: else if alting then 8: 8: empty ← true 9: 9: schedule else 10: to return ← hold 10: empty ← true notify 11: 11: if strength > 0 then notify 12: 12: throw wait 13: 13: return to return if strength > 0 then 14: 14: throw 15:
C++CSP Channel Model 1: procedure enable 1: procedure disable lock lock 2: 2: if strength > 0 then alting ← false 3: 3: return true return ! empty or strength > 0 4: 4: if empty then 5: alting ← true 6: return false 7: else 8: return true 9:
Objectives • The aim is to discover if atomics give us a better performing channel. • Objectives 1 Build atomic-based channel implementation – this is believed to be the first such implementation based on other CSP libraries. 2 Undertake performance analysis of atomic-based channel implementation – this analysis can be used to understand when to use an atomic-based channel.
Outline 1 Introduction and Background 2 Current Channel Operations 3 An Atomic Channel Implementation 4 Benchmarking 5 Conclusions
Operation Pairings • Six interactions possible when working with channels. 1 Write and read. 2 Write and select. 3 Write and poison. 4 Read and poison. 5 Select and poison. • The lock is removed and an analysis of the required ordering of instructions made. • Only write-read and write-select are covered here. The others are in the paper.
Write and Read - Read-first 1: procedure read procedure write ( value ) empty ← false hold ← value 2: wait empty ← true 3: wait notify 4: wait to return ← hold 5: notify wait 6:
Write and Read - Write-first 1: procedure read procedure writer ( value ) hold ← value 2: empty ← false 3: wait empty ← true 4: to return ← hold wait 5: notify wait 6: return to return 7:
Write and Select • If write goes first we have no concerns. 1: procedure enable procedure write ( value ) alting ← true hold ← value 2: return false empty ← false 3: 4: schedule wait 5:
Initial Analysis • Channel operations are fairly simple from an instruction point of view. • There is complexity in the ordering of the operations. • Mutex-based channel avoids this complexity by enforcing sequential behaviour during the interaction. • An atomic-based channel will need to synchronise on state, and then progress through the operations only when the correct next state is observed. • This requires some busy spinning (equivalent to waiting).
Outline 1 Introduction and Background 2 Current Channel Operations 3 An Atomic Channel Implementation 4 Benchmarking 5 Conclusions
Atomic-based Channel Members • The atomic-based channel uses six values: hold an atomic to store the value communicated via the channel. reading an atomic boolean to indicate if the reader is reading or not. writing an atomic boolean to indicate if the writer is writing or not. alt a reference to any alt construct currently engaged with the channel. alting an atomic bool to indicate if the channel is in a selection process. strength an atomic integral storing the current poison applied to the channel.
Atomic Read 1: procedure atomic read while ! load ( writing , acquire) do 2: skip 3: if load ( strength , relaxed) > 0 then 4: throw 5: to return ← load ( hold , relaxed) 6: store ( reading , true , release) 7: while load ( writing , acquire) do 8: skip 9: store ( reading , false , release) 10: return to return 11:
Atomic Write 1: procedure atomic write ( value ) store ( hold , value , relaxed) 2: store ( writing , true , release) 3: if load ( alting , acquire) then 4: 5: schedule while ! load ( reading , acquire) do 6: skip 7: if load ( strength , relaxed) > 0 then 8: throw 9: store ( writing , false , release) 10: while load ( reading , acquire) do 11: skip 12:
Atomic Enable 1: procedure atomic enable store ( alting , true , release) 2: temp ← load ( writing , acquire) 3: if load ( strength , relaxed) > 0 then 4: return true 5: return temp 6:
Atomic Disable 1: procedure atomic disable store ( alting , false , release) 2: return load ( writing , acquire) 3:
Write and Read Algorithm 1 Atomic Read-Write Interaction 1: procedure atomic read procedure atomic write ( value ) while ! load ( writing , acquire) do store ( hold , value , relaxed) 2: skip store ( writing , true , release) 3: while ! load ( reading , acquire) do to return ← load ( hold , relaxed) 4: skip store ( reading , true , release) 5: while load ( writing , acquire) do store ( writing , false , release) 6: skip while load ( reading , acquire) do 7: skip store ( reading , false , release) 8: return to return return 9:
Write and Select • Three possible outcomes (see paper). Concurrent version. 1: procedure atomic enable procedure atomic write ( value ) store ( alting , true , release) store ( hold , value , relaxed) 2: store ( writing , true , release) 3: temp ← load ( writing , acquire) if load ( alting , acquire) then 4: return temp 5: schedule . . .
Outline 1 Introduction and Background 2 Current Channel Operations 3 An Atomic Channel Implementation 4 Benchmarking 5 Conclusions
Test Bed • Intel Core i7-4770K CPU at 3.50GHz • Hyper-threading - 4 cores, 8 hardware threads. • Any benchmark > 8 threads will hit performance problems. • Linux 4.4 • GCC 7.1, -O3 flag.
Communication Time
Multiple Communication Times
Stressed Select Channels Writers/Channel Busy Select Busy stddev Mutex Select Mutex Select stddev 2 1 380.34 11.50 2646.69 430.68 2 2 526.06 34.80 3074.77 510.18 2 4 93526.69 110770.80 2892.89 374.15 2 8 31519.77 15630.47 2909.16 407.85 4 1 308.96 7.85 884.27 169.14 4 2 516.41 173.68 1228.36 378.08 4 4 25776.55 33808.00 1295.66 314.01 4 8 57581.95 48035.26 1354.72 142.64 8 1 6434.99 19581.05 994.48 249.51 8 2 40251.15 50379.31 1233.26 234.37 8 4 187678.50 213596.23 1099.47 312.47 8 8 563502.86 721881.75 1065.47 196.07
Recommend
More recommend