Reactive Software with elegance Reactive design patterns for microservices on multicore Reactive summit - 22/10/18 charly.bechara@tredzone.com
Outline Microservices on multicore Reactive Multicore Patterns Modern Software Roadmap 2
1 MICROSERVICES ON MULTICORE 3
Microservices on Multicore Microservice architecture with actor model µService µService Actor µService Message passing 4
Microservices on Multicore Fast data means more inter-communication Batch Stream Computations Real time event processing Fast Data Highly Interconnected Workflows Communications 5
Microservices on Multicore Microservice architecture µService µService Actor µService Message passing 6
Microservices on Multicore Microservice architecture + Fast Data µService µService Actor µService Message passing New interactions 7
Microservices on Multicore Microservice architecture + Fast Data µService µService Actor µService Message passing New interactions More interactions 8
Microservices on Multicore More microservices should run on the same muticore machine machine 9
Microservices on Multicore Microservice architecture + Fast Data + Multicore + machine µService µService Core 10
Microservices on Multicore Microservice architecture + Fast Data + Multicore + machine Universal Law of Scalability (Gunther law) Performance model of a system based on queueing theory Perfect scalability ( N) Contention impact ( σ ) Coherency impact ( κ ) σ >>0, κ >0 σ >>0, κ =0 σ =0, κ =0 11
Microservices on Multicore From inter-thread communications... Core 12
Microservices on Multicore From inter-thread communications... Core 13
Microservices on Multicore …to inter -core communications Core 14
Microservices on Multicore Inter-core communication => cache coherency machine core 1 core N 1 cycle (0.3 ns) Registers Registers 4 cycles (1.3 ns) L1 I$ L1D$ L1 I$ L1D$ MESI > 600 600 cycles 12 cycles L2$ L2$ (200 ns) (4 ns) > 30 cycles > Shared L3$ or LLC (10 ns) Assuming Freq = 3 GHz 15
Microservices on Multicore Exchange software are pushing performance to hardware limits machine 50%ile 99.99%ile Stability Velocity Volume k.msg/s M. msg/s msec µsec 16
Simplx: one thread per core No context switching 17
Simplx: actors multitasking per thread High core utilization 18
Simplx: one event loop per core for communications Lock free Event loop = ~30 300 ns Event loop = ~30 300 ns ns Si Simplx runs on all cores 19
Multicore WITHOUT multithreaded programming ? 20
Microservices on Muticore Very good resources, but no multicore-related patterns machine 21
2 REACTIVE MULTICORE PATTERNS 22
23
Reactive multicore Patterns 7 patterns to unleash multicore reactivity machine Core-to-core messaging (2 patterns) Core monitoring (2 patterns) Core-to-core flow control (1 pattern) Core-to-cache management (2 patterns) 24
Core-to to-core messaging patterns 25
Pattern #1: the core-aware messaging pattern Inter-core communication: push message ~1 µs – 10 µs Push message ~ 500 ns ~ 300 ns Pipe pipe = new Pipe ( greenActorId ); pipe.push<HelloEvent>(); sender destination core socket server 26
Pattern #1: the core-aware messaging pattern Intra-core communication: push message ~300 ns Pipe pipe = new Pipe ( greenActorId ); pipe.push<HelloEvent>(); Push a message asynchronous 27
Pattern #1: the core-aware messaging pattern Intra-core communication: x150 speedup with direct call over push ~300 ns Pipe pipe = new Pipe ( greenActorId ); pipe.push<HelloEvent>(); Push a message asynchronous ~2 ns ActorReference < GreenActor > target = getLocalReference ( greenActorId ); [...] target -> hello (); Direct call synchronous Optimize calls according to the deployment 28
Pattern #2: the message mutualization pattern Network optimizations, core optimizations. Same fight In this use case, the 3 red consumers process the same data core actor push data 29
Pattern #2: the message mutualization pattern Communication has a cost Many events means high cache coherence usage (L3) 3 events core actor push data 30
Pattern #2: the message mutualization pattern Let’s mutualize inter-core communications 3 events 1 event 3 direct calls Local router 31
Pattern #2: the message mutualization pattern WITH pattern vs WITHOUT pattern: Linear improvement 32
Core monitoring patterns @ real-time 33
Pattern #3: the core stats pattern Use case: monitoring the data distribution throughput We want to know in real-time the number of messages received per second, globally, and per core. StartSequence startSequence ; startSequence . addActor < RedActor >(0); // core 0 startSequence . addActor < RedActor >(0); // core 0 startSequence . addActor < RedActor >(1); // core 1 startSequence . addActor < RedActor >(1); // core 1 Simplx simplx ( startSequence ); core actor core actor data 34
Pattern #3: the core stats pattern Use case: monitoring the data distribution throughput struct LocalMonitorActor : Actor { […] void newMessage () { ++ count ; 1 } } struct RedActor : Actor { […] ReferenceActor monitor ; RedActor () { monitor = newSingletonActor < LocalMonitorActor >(); } void onEvent () { monitor -> newMessage (); } } 1 Local monitoring Increase 1 Singleton message counter 35
Pattern #3: the core stats pattern Use case: monitoring the data distribution throughput struct LocalMonitorActor : Actor , TimerProxy { […] LocalMonitorActor : TimerProxy (*this) { setRepeat ( 1000 ); } virtual void onTimeout () { 1 serviceMonitoringPipe . push < StatsEvent >( count ); count = 0 ; } } 1 sec ec 1 Inform monitoring of Timer Service monitoring the last second statistics 36
Pattern #4: the core usage pattern Core utilization Detect overloading cores before it is too late Relying on the CPU usage provided by the OS is not enough 100% does not mean the runtime is overloaded 10% does not tell how much data you can really process 37
Pattern #4: the core usage pattern No push, no event, no work 1 sec ec 20 loops in a second 0% core usage Reality is more about 3 millions loops per second Idle loop 38
Pattern #4: the core usage pattern Efficient core usage 1 sec ec 20 loops in a second 0% core usage 11 loops = 3 working loops 60% core usage Idle loop Working loop 39
Pattern #4: the core usage pattern Runtime performance counters help measurement Duration(IdleLoop) = 0.05 s Reality is more about Duration Idle loop ~300 ns CoreUsage = 1 – ∑( idleLoop)*0.05 100 idleLoop= 0|1 0 1 0 0 0 1 0 0 1 0 0 11 loops 8 idle loops 3 working loops 1 sec ec 60% core usage Idle loop Working loop Core usage actor 40
Demo: Real-time core monitoring A typical trading workflow Data stream Data processing 41
Core-to to-core flow control patterns 42
Pattern #5: the queuing prevention pattern What if producers overflow a consumer ? Your software cannot be more optimized ? Still, the incoming throughput could be too high, implying strong queuing. Continue ? Stop the flow ? Merge data ? Throttling ? Whatever the decision, we need to detect the issue 43
Pattern #5: the queuing prevention pattern What’s happening behind a push ? 44
Pattern #5: the queuing prevention pattern Local Simplx loops handle the inter-core communication Batc atch ID = 145 145 45
Pattern #5: the queuing prevention pattern Once the destination reads the data, the BatchID is incremented Batc atch ID = 145 145 Batc atch ID = 146 146 46
Pattern #5: the queuing prevention pattern BatchID does not increment if destination core is busy Batc atch ID = 145 145 Batc atch ID = 145 145 47
Pattern #5: the queuing prevention pattern Core to core communication at max pace Batc atch ID = 145 145 Batc atch ID = 145 145 BatchID batchID ( pipe ); pipe . push < Event >(); ( … ) if(batchID.hasChanged()) { // push again } else { //destination is busy //merging data, start throttling, reject orders … } 48
Pattern #5: the queuing prevention pattern Demo: code java Same id Last id => queuing => no queuing 49
Core-to to-cache management patterns 50
Pattern #6: the cache-aware split pattern FIX + execution engine new w or orde der 51
Pattern #6: the cache-aware split pattern FIX + execution engine A FIX order can easily size ~ 200 Bytes new w or orde der ack cknowledgment Almost all tags sent in the new order request need to be sent back in the acknowledgment 52
Pattern #6: the cache-aware split pattern Stability depends on the ability to be cache friendly 20 200 Bytes To stay « in-cache » and get stable per per order: performance, one core can store ~1300 open orders. 1 10000 open orders per book Local storage order book Local storage 53
Recommend
More recommend