Reliability Support in Virtual Infrastructures 2 nd IEEE International Conference on Cloud Computing Technology and Science, Indianapolis, 2010 RESO Guilherme Koslovski (INRIA – University of Lyon) Wai-Leong Yeow (DoCoMo USA Labs) Cedric Westphal ( DoCoMo USA Labs) Tram Truong Huu (University of Nice – I3S) Johan Montagnat (CNRS – I3S) Pascale Vicat-Blanc Primet (INRIA - LYaTiss)
Reliability as a Service • Reliability : probability that a system will survive failures • Availability : fraction of time that a system is functional 99.95% availability 99.9% availability 99.95% reliability 100% uptime 100% network uptime • Actually nothing more than SLAs. – Failure => credits – Lock-ins – No guarantees at all 2 nd IEEE CloudCom – 2010 2 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc
Context Convergence of computing and communication: Virtual Infrastructure is a concept emerging from Virtual Networks and Infrastructures as a Service New models and tools to manage virtualized substrate & to help users in execution of their applications Network virtualization Users Resources virtualization Distributed & virtualized substrate Grid computing experience IaaS, PaaS, … XaaS concepts Complex applications 2 nd IEEE CloudCom – 2010 3 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc
Issue • Network and IT resources are subject to random failures • Failures can be measured: mean time between failures (MTBF) • Impact of a failure on a distributed application: • worker node failure: can affect the total execution time • database and servers: can compromise the entire execution • Some applications can recover from failures but • This process usually affects the execution time • This complicates the application development 2 nd IEEE CloudCom – 2010 4 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc
Our proposal Reliability as a service offered by the infrastructure provider Provide me a basic Provide me a reliable infrastructure infrastructure User PM VM 1 PM BKP 1 Application PM VM 2 PM BKP 2 PM VM n PM BKP n 2 nd IEEE CloudCom – 2010 5 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc
Our proposal Reliability becomes a service offered by the infrastructure provider Transparent realibility provisioning Users (applications) have no knowledge about physical failures User PM VM 1 PM BKP 1 Application PM VM 2 PM BKP 2 PM VM n PM BKP n 2 nd IEEE CloudCom – 2010 6 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc
Outline Providing Transparent Reliability Reliable Virtual Infrastructure description Automatic generation of backup nodes and backup links Allocation algorithm Evaluation through a use case application Conclusion & Future work 2 nd IEEE CloudCom – 2010 7 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc
Mechanism for providing transparent reliability I. Virtual Infrastructure description II. Translation of reliability requirements into real backup nodes III. Allocation of a reliable virtual infrastructure 2 nd IEEE CloudCom – 2010 8 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc
Virtual Infrastructure description: VXDL language VXDL: Virtual private eXecution infrastructure VXDL file Description Language – http://www.ens-lyon.fr/LIP/RESO/Software/vxdl/ vm1 General description Resources description workers [100 nodes] Network topology 1 GB, 2 GHz database description 2 cores [1 GB, 2 GB] Location: lyon.fr 2 GHz Reliability: 99.9% Timeline description 2 cores Location: lyon.fr Reliability: 99.99% 2 nd IEEE CloudCom – 2010 9 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc
Virtual Infrastructure extension Translation of reliability requirements into replica nodes Opportunistic Redundancy Pooling (ORP) mechanism [W. Yeow et al, 2010] : Input: Reliability level (user requirement) Probability of physical failures (from MTBF) Number of protected virtual nodes (user requirement) Output: the number of backup nodes – Backup nodes can be shared among different groups of critical nodes – For example, two sets of backup nodes (k1 and k2) can be shared to protect two groups of critical nodes. Thanks to ORP is required only the min(k1, k2) [W. Yeow et al, 2010]: Designing and embedding reliable virtual infrastructures, VISA workshop 2010. 2 nd IEEE CloudCom – 2010 10 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc
Virtual Infrastructure extension Backup links: consistent network topology Step 1 Step 3 Step 2 2 nd IEEE CloudCom – 2010 11 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc
Allocation of a Reliable Virtual Infrastructure An extended graph is composed by original description + backup components Backup components can have specific constraints: – For example, original node and backup node should be allocated on different physical racks Subgraph-isomorphism detection [Lischka et al, 2009] Physical substrate Embedded graph 2 nd IEEE CloudCom – 2010 12 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc
From mapping to allocation The map provided by the allocation is interpreted and instantiate using the HIPerNet framework [P. Primet et al, 2010] • Original VMs and replicas are synchronized by a modified version of the Remus live protection mechanism [B. Cully et al, 2008] 2 nd IEEE CloudCom – 2010 13 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc
Evaluation through a use case application Bronze Standard: distributed large-scale application – Quantifies the maximal error resulting from medical-image analysis – Large databases: more the data, more the accuracy – 31 VMs: 512 MB,1 GHz – 10 Mbps for each virtual link between the database and the workers I) Translated into VXDL II) Submitted to HIPerNet Two scenarios of reliability requirements: – Database protection: a failure stops the application execution – Workers protection: a failure increases the execution time Testbed: Grid’5000 – Physical substrate is composed by 100 nodes: – MTBF simulation values: 60000s, 30000s, 15000s [D. Atwood et al., 2008] 2 nd IEEE CloudCom – 2010 14 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc
Experimental results Goal : quantify the cost of a reliable virtual infrastructure Prices are based on Amazon EC2 for Europe VM specifications Basic node We do not include any specific link pricing Short term lease $0.095 cost without reliability support (short term lease): $2.95 / h Long term lease $0.031 Prices for computing nodes protection (30 VMs, 99.9%): Short term Long term MTBF Backup Total Reliability cost / total Total Reliability cost / total Nodes cost cost cost cost 60000 5 $3.42 16.1% $3.10 5.3% s 30000 8 $3.71 25.8% $3.19 8.4% s 15000 12 $4.09 38.7% $3.32 12.6% s 2 nd IEEE CloudCom – 2010 15 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc
Experimental results Goal: evaluate the application behavior when executing with reliability support Application makespan without substrate failures: 1205s, used as baseline – Database protection: DB label : database is the unique component protected Makespan increases proportionally to the number of failures – Worker nodes protection: WN label : only computing nodes are protected Makespan slightly increases 1800 1600 1400 MTBF DB WN 1200 Increase Increase 1000 NI DB 800 WN 60000s 16.26% 0.2% 600 400 30000s 26.47% 1.7% 200 0 15000s 40.08% 3.2% NI 60000s 30000s 15000s 2 nd IEEE CloudCom – 2010 16 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc
Experimental results Goal: reliability service vs resubmission mechanism: – Application is aware about substrate failures – A task is resubmitted on a new computing node – The makespan difference would have been more if backup nodes were not pre-allocated and configured 1600 1400 MTBF Makespan 1200 Increase 1000 60000s +13.08% 800 Reliability Resubmission 600 30000s +19.67% 400 15000s +22.19% 200 0 60000s 30000s 15000s 2 nd IEEE CloudCom – 2010 17 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc
Conclusions Reliability becomes a service offered by the infrastructure provider We have developed a framework to provide transparent reliability: – A language to specify the reliability requirements; – A mechanism to interpret these requirements and transform it in replicas (nodes and links) – A map and allocation process to provisioning the reliability level required by the user The framework was implemented on top of the HIPerNet framework, and validated over the Grid’5000 testbed Future work includes: – the implementation of a mechanism to protect virtual links – a detailed investigation on the economical aspects – Tomorrow there is a demonstration about the industry version of the HIPerNet framework (LYaTiss core) - http://www.lyatiss.com/ 2 nd IEEE CloudCom – 2010 18 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc
Recommend
More recommend