Fragility Risks of Low Latency Dynamic Queuing in Large-Scale Clouds: Complex System Perspective Vladimir Marbukh FIT 2017
Outline • Empirical observations & modeling perspectives • Markov model and approximations of systemic risk • Cloud models • Gradual vs. abrupt instabilities • Implications for Internet transport • Conclusion, future research 2
Complex/Networked Systems: Empirical Observations Inherent connectivity systemic benefit/risk tradeoff Connectivity is economically driven (rich gets richer, economy of scale, risk sharing, etc.) Economics fail to address systemic risks of: (cyber)security, cascading failures, etc. Conventional Risk Management : use historical data to extrapolate, i.e., “fight the last war”. Challenge : unexpected consequences due to - externalities due to strategic selfish or malicious (cybersecurity, terrorism) components - non-linear component interactions, randomness, e.g., stochastic resonance Ultimate Goal : systemic risk/benefit control through combination of regulations/incentives 3
Markov Micro-description Markov process with locally interacting components [R. Dobrushin, 1971] Graph: nodes=components, (directed) links=interactions x n ( t ) Internal node dynamics Markov process with transition rates dependent on internal states of neighbors � X ( t ) ( x ( t ),.., x ( t )) System microstate: 1 N � � � P ( X ) lim P ( t , X ) P ( t , X ) Pr( X ( t ) X ), Non-steady and steady probabilities � � t are solutions to the corresponding Kolmogorov equations. Kolmogorov system’s dimension ~ exp(N) => solution intractable, metastability In “very particular case” of time reversible Markov process, P(X) ~ exp[U(X)] Local minima of potential U(X) = metastable states (Landau theory of phase transitions) N � In a general case we use mean-field approximation � P ( t , X ) P ( t , x ) based on “hypothesis of chaos propagation”: n 4 � n 1
Individual & Systemic Risks Desirable states Undesirable states � � * { x } : 1 � � * { x } \ { x } : 0 n n n n n � � � � � � � � � 1 2 1 2 E [ ] E [ ] Negative externalities: � � � � n n n n n n � � � � ( , i n ) where � n i � � � � s E [ ( )] Individual risk: � n n n n � n � � ( ) 0 ( 1 ) depending whether Individual risk can (can’t) where � n be transferred to the neighboring components � � � � � � � � � � � � ( ) s E � � when Example: � n n i n n j i J � � n � n i J Lorenz, J., Battiston, S., and Schweitzer, F. 2009 � � � � � � � � � � � S w s w Systemic risk: n n n � � � � n n 5
Cloud: Operational Model j Server group : � � � 1 f operational with prob. 1 J j f non-operational with prob. j B B Failures/recoveries on much slower J 1 time scale than job arrivals/departures Static load balancing is possible if: c c � � � � 1 � � � 1 2 f 0 , 1 O ( N ) J j j j N N 1 J � � � � � � � 0 , N ( N c ) and where utilization is j j j j j � f 0 , Problems: exogenous load uncertain, other uncertainties. j Possible solution: dynamic load balancing based on dynamic utilization, e.g., numbers of occupied servers, queue sizes, etc. � � c c , i j Problem: serving non-native requests is less efficient: ij i and according to A.L. Stolyar and E. Yudovina (2013) this may cause instability of “natural” dynamic load balancing 6
Cloud: Markov Model Failures/recoveries on much slower time scale than job arrivals/departures � � I � � � � � � � 1 ( ) [ f ( 1 f ) ] , where i i i i i 1 � � 0 , ( 1 ) if server group i is operational (non-operational) i Loss probability for class i jobs is: � � � � � � , � � � � � � � � � � � � � � � L ( ) 1 E 1 , E � � where � � i i i j i i � � � � � j J i � � 0 , 1 if server group i is, or respectively, is not available i q probability that class i job is admitted to the native server group i � � � 1 probability that class i job attempts for non-native service if i i J characterizes system topology i Markov description is intractable even for moderate size systems since it � � � I I N B ( ) 2 2 requires solving ~ Kolmogorov equations for vectors i 7
Cloud: Mean-field & Fluid Approximations � � � � � � � � � E ( ) , where i i i � � i { i } i { i } � � 0 , ( 1 ) if server group i has (does not have) available resources i ~ � � N B ( N ) i i i i ~ B ~ N ! N j � � � � � � � � ( ) ( 1 ) , � � i i ~ ~ ~ i i i i i i � � � � � N i N B 1 ( N ) ( N ) 1 i i i � � i i i i i ~ � � k ! N ! 1 � k 0 i i Informally: utilizations of different server group are jointly statistically independent and described by Erlang distribution with loads determined by self-consistency conditions, i.e., mean-field equations: ~ ~ � � � � � ( ), i 1 ,.., I i i � � � N B In a case of large server groups: , fluctuations are negligible: ~ i i ~ � � � � max( 0 , 1 1 ) , resulting in fluid approximation. i i 8
Symmetric Cloud: Loss Model ~ ~ L L � E 1 * B 1 1 E * B � E * B E 0 � B * B * B * A 0 � A * A A * A � * * A � * � * A * � � � � � � * � � � � * � � opt 1 1 0 0 * * * Revenue loss vs. exogenous load for Revenue loss vs. resource sharing different levels of resource sharing level for medium exogenous load Implications: • for sufficiently low level of resource sharing – no metastability • as resource sharing level increases, metastability emerges • performance in the “normal” (“congested”) metastable state gets better (worse) • economics drives system operator towards stability boundary 9
Symmetric Cloud: Queuing Model ~ ~ L L C ~ * 1 ~ 1 � 1 * L B � 1 * L � � � � � ~ � � � 1 1 � 1 L * A � � B A ~ ~ � * * � � L L * * � � * � � 0 1 0 1 ( 1 ) 1 * Large service groups: discontinuity & Small service groups: discontinuity in metastability in queue size vs. exogenous queue size vs. exogenous load for load for sufficient resource sharing sufficient level of resource sharing Implications: • for sufficiently low level of resource sharing – no discontinuous instability • as resource sharing level increases, discontinuous instability emerges • performance in the “normal” (“congested”) metastable state gets better (worse) • economics drives system operator towards stability boundary 10
Resource Sharing Drivers Generic: economy of scale Specific: multiplexing gain due to mitigating local imbalances We propose to quantify benefits of resource sharing by operational region increase Inefficiency of accommodating component i’s individual risk/load by component j � � � � � 1 , i j � ij ii 2 � ˆ System operational region without: 2 C risk sharing OAEBO: E 1 A System operational region with complete risk sharing OACEDBO: � D B 1 � ˆ 0 1 1 where: 11
Operational Region Boundary: Gradual/Abrupt Instability Loss system under fluid approximation with risk amplification L L 1 1 E D E C � � B A B 0 0 1 1 Low level of resource sharing High level of resource sharing Thesis : since instabilities are unavoidable due to exogenous demand variability, hardware break downs, etc., systemic risk management should favor gradual rather than abrupt instability on the boundary of the operational region. Motivation : - Gradual instabilities may be signaled by critical slowdown, anomalous fluctuations, etc. [ M. Scheffer, et al., Early-warning signals for critical transitions, Nature , 2009]. - Abrupt/discontinuous instabilities may cause unacceptably high performance deterioration as system gets outside operational region. - Abrupt/discontinuous instabilities are typically associated with undesirable metastable states inside operational region. 12
Recommend
More recommend