Learning Queuing Networks by Recurrent Neural Networks Giulio Garbi , Emilio Incerto and Mirco Tribastone IMT School for Advanced Studies Lucca Lucca, Italy giulio.garbi@imtlucca.it ICPE 2020 Virtual Conference April 20—24, 2020
Motivation • Performance means revenue • « We are not the fastest retail site on the internet today » [Walmart, 2012] • « […] page speed will be a ranking factor for mobile searches. » [Google] è It’s worth investing in system performance. How? Garbi, Incerto, Tribastone 2
Motivation • Question: where to invest? • Performance estimation: • Profiling: easy, does not predict • Modeling: needs expert and continuous update, predictions Garbi, Incerto, Tribastone 3
Motivation: our vision • If we had a model, we could try all possible choices, forecast and choose the best option. è Automate model generation!!! Garbi, Incerto, Tribastone 4
Our Main Contribution • Direct association between: • Model: Fluid Approximation of Closed Queuing Networks • Automation: Recurrent Neural Networks • Automatic generation of models from data Garbi, Incerto, Tribastone 5
Model: Queuing Networks • Model that represent contention P 2,1 of resources by clients < µ 2 , s 2 > • Clients ask for work to station (resources) P 1,2 x 2 < µ 1 , s 1 > • Stations have a maximum concurrency level, and a speed x 1 < µ 3 , s 3 > • Clients once served ask another P 1,3 resource according to routing matrix x 3 P 3,1 Garbi, Incerto, Tribastone 6
Model of a system P 2,1 < µ 2 , s 2 > • Resource è hardware P 1,2 x 2 < µ 1 , s 1 > • Routing matrix è program code • Clients è program instances x 1 < µ 3 , s 3 > P 1,3 x 3 P 3,1 Garbi, Incerto, Tribastone 7
How our procedure works Profiling Learning Model Changes Prediction Garbi, Incerto, Tribastone 8
Recurrent Neural Networks • Recurrent neural networks (RNN) work with sequences (e.g. time series) • We will encode the model as a RNN with a custom structure. Cell� h Cell� h Cell� h �-�t�� 1 �-�t�� 1 �-�t�� 1 � 1 ��t�� 1 �P 1,2 � 1 ��t�� 1 �P 1,2 � 1 ��t�� 1 �P 1,2 min ∑ min ∑ min ∑ ��t�� 2 �P 2,1 ��t�� 2 �P 2,1 ��t�� 2 �P 2,1 �-�t�� 2� �-�t�� 2� �-�t�� 2� � 2 � 2 � 2 min ∑ min ∑ min ∑ �-�t�� �� �-�t�� �� �-�t�� �� � M � M � M min ∑ min ∑ min ∑ ���� 1 ���� H -1 ���� 2 Garbi, Incerto, Tribastone 9
Recurrent Neural Networks • The system parameters are directly encoded in the RNN cell è Learned model explains the system! (Explainable Neural Network) • We can modify the system afterwards to do prediction! Cell� h Cell� h Cell� h �-�t�� 1 �-�t�� 1 �-�t�� 1 � 1 ��t�� 1 �P 1,2 � 1 ��t�� 1 �P 1,2 � 1 ��t�� 1 �P 1,2 min ∑ min ∑ min ∑ ��t�� 2 �P 2,1 ��t�� 2 �P 2,1 ��t�� 2 �P 2,1 �-�t�� 2� �-�t�� 2� �-�t�� 2� � 2 � 2 � 2 min ∑ min ∑ min ∑ �-�t�� �� �-�t�� �� �-�t�� �� � M � M � M min ∑ min ∑ min ∑ ���� 1 ���� H -1 ���� 2 Garbi, Incerto, Tribastone 10
Synthetic case studies: setting • 10 random systems: five with M=5 stations, five with M=10 stations • Concurrency levels between 15 and 30 • Service rate between 4 and 30 clients/time unit • 100 traces, each one being an average of 500 executions, with [0, 40 M] clients • Learning time: 74 min for M = 5 and 86 min for M = 10 • Error function: % clients wrongly placed Garbi, Incerto, Tribastone 11
Synthetic case studies: prediction with different #clients 10 M=5 No significant difference M=10 8 among network size and Prediction error (err) number of clients. 6 è Good predictive 4 power among different 2 conditions 0 100 200 300 400 500 600 700 800 N #clients Garbi, Incerto, Tribastone 12
Synthetic case studies: prediction with different concurrency levels 5 M=5 M=10 Increased concurrency as 4 Prediction error (err) to resolve the bottleneck 3 è Learning outcome 2 resilient to changes in 1 part of the network 0 50 100 150 200 250 N #clients Garbi, Incerto, Tribastone 13
Real case study: setting �EAL ����EM A�CHI�EC���E �N MODEL M 2 C 1 � 2,1 ��2,� 2 �10� 10 � 1,2 M 2 ��������� M 1 W ��1,� 1 ��� ��3,� 3 �5� • node.js web application, replicated � 1,3 M 1 M 3 M 3 3 times C 2 � 3,1 LB ��4,� � �6� � 1,� M 4 5 • Python script simulates N clients ��������� � �,1 • Learning time: 27 min for N=26 M 4 C � �NKNO�N �A�AME�E�� 1 ������� 6 ��������� N ��������� Garbi, Incerto, Tribastone 14
Real case study: prediction with different #clients 45 45 70 70 40 40 60 60 35 M 1 RNN-learned QN 35 Queue Length M 1 RNN-learned QN Queue Length M 1 Real System 50 30 50 M 1 Real System Queue Length 30 N = 78 N = 52 Queue Length M 2 RNN-learned QN M 2 RNN-learned QN 25 40 25 M 2 Real System 40 M 2 Real System 20 M 3 RNN-learned QN 20 M 3 RNN-learned QN 30 30 M 3 Real System 15 M 3 Real System err = 5.03% err = 6.46% 15 M 4 RNN-learned QN 20 M 4 RNN-learned QN 20 10 M 4 Real System 10 M 4 Real System 5 10 10 5 0 0 0 0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 0 1 2 2 3 4 3 4 5 6 5 6 t(s) t(s) t(s) t(s) 80 90 80 90 80 70 80 70 M 1 RNN-learned QN M 1 RNN-learned QN 70 Queue Length 60 M 1 Real System 70 Queue Length M 1 Real System 60 M 2 RNN-learned QN M 2 RNN-learned QN N = 104 60 N = 130 Queue Length 60 50 Queue Length M 2 Real System 50 M 2 Real System 50 M 3 RNN-learned QN 50 M 3 RNN-learned QN 40 40 M 3 Real System 40 M 3 Real System 40 M 4 RNN-learned QN 30 err = 6.45% M 4 RNN-learned QN err = 9.05% 30 M 4 Real System 30 30 M 4 Real System 20 20 20 20 10 10 10 10 0 0 0 0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 0 1 2 2 3 4 3 4 5 6 5 6 t(s) t(s) t(s) t(s) M 3 is the bottleneck, and this affects the UX. We need to solve it… Garbi, Incerto, Tribastone 15
Real case study: prediction with different structure …by increasing the concurrency level of M 3 …by changing the LB scheduling policy err: 5.98% err: 6.10% 90 90 80 80 M 1 RNN-learned QN 70 M 1 RNN-learned QN 70 M 1 Real System M 1 Real System 60 Queue Length 60 Queue Length M 2 RNN-learned QN M 2 RNN-learned QN 50 50 M 2 Real System M 2 Real System M 3 RNN-learned QN M 3 RNN-learned QN 40 40 M 3 Real System M 3 Real System 30 30 M 4 RNN-learned QN M 4 RNN-learned QN M 4 Real System 20 20 M 4 Real System 10 10 0 0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 t(s) t(s) Bottleneck solved. Nice results also on a real HW+SW system. Garbi, Incerto, Tribastone 16
Limits • Many traces required to learn the system. • System must be observed at high frequency. • Layered systems currently not supported. • Resilient to limited changes, not extensive ones. Garbi, Incerto, Tribastone 17
Related work • Performance models from code (e.g. PerfPlotter, not predictive) • Modelling black-box systems (e.g. Siegmund et al., tree-structured models) • Program-driven generation of models (e.g. Hrischuk et al., distributed components that communicate via RPC) • Estimation of service demands in QN through several techniques (we estimate service demands and routing matrix) Garbi, Incerto, Tribastone 18
Conclusions • We provided a method to estimate QN parameters using a RNN that converges on feasible parameters. • With the estimated parameters, it is possible to estimate the evolution of the system using a population different from the one used during learning or when doing structural modifications. • We want to apply the technique to more complex systems (e.g LQN,multiclass), use other learning methodologies (e.g. neural ODEs) and improve the accuracy of the results Garbi, Incerto, Tribastone 19
Thank you!
Recommend
More recommend