Reliability Support in Virtual Infrastructures 2 nd IEEE - PowerPoint PPT Presentation

Reliability Support in Virtual Infrastructures 2 nd IEEE International Conference on Cloud Computing Technology and Science, Indianapolis, 2010 RESO Guilherme Koslovski (INRIA – University of Lyon) Wai-Leong Yeow (DoCoMo USA Labs) Cedric Westphal ( DoCoMo USA Labs) Tram Truong Huu (University of Nice – I3S) Johan Montagnat (CNRS – I3S) Pascale Vicat-Blanc Primet (INRIA - LYaTiss)

Reliability as a Service • Reliability : probability that a system will survive failures • Availability : fraction of time that a system is functional 99.95% availability 99.9% availability 99.95% reliability 100% uptime 100% network uptime • Actually nothing more than SLAs. – Failure => credits – Lock-ins – No guarantees at all 2 nd IEEE CloudCom – 2010 2 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

Context Convergence of computing and communication: Virtual Infrastructure is a concept emerging from Virtual Networks and Infrastructures as a Service New models and tools to manage virtualized substrate & to help users in execution of their applications Network virtualization Users Resources virtualization Distributed & virtualized substrate Grid computing experience IaaS, PaaS, … XaaS concepts Complex applications 2 nd IEEE CloudCom – 2010 3 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

Issue • Network and IT resources are subject to random failures • Failures can be measured: mean time between failures (MTBF) • Impact of a failure on a distributed application: • worker node failure: can affect the total execution time • database and servers: can compromise the entire execution • Some applications can recover from failures but • This process usually affects the execution time • This complicates the application development 2 nd IEEE CloudCom – 2010 4 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

Our proposal Reliability as a service offered by the infrastructure provider Provide me a basic Provide me a reliable infrastructure infrastructure User PM VM 1 PM BKP 1 Application PM VM 2 PM BKP 2 PM VM n PM BKP n 2 nd IEEE CloudCom – 2010 5 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

Our proposal Reliability becomes a service offered by the infrastructure provider Transparent realibility provisioning Users (applications) have no knowledge about physical failures User PM VM 1 PM BKP 1 Application PM VM 2 PM BKP 2 PM VM n PM BKP n 2 nd IEEE CloudCom – 2010 6 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

Outline Providing Transparent Reliability Reliable Virtual Infrastructure description Automatic generation of backup nodes and backup links Allocation algorithm Evaluation through a use case application Conclusion & Future work 2 nd IEEE CloudCom – 2010 7 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

Mechanism for providing transparent reliability I. Virtual Infrastructure description II. Translation of reliability requirements into real backup nodes III. Allocation of a reliable virtual infrastructure 2 nd IEEE CloudCom – 2010 8 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

Virtual Infrastructure description: VXDL language VXDL: Virtual private eXecution infrastructure VXDL file Description Language – http://www.ens-lyon.fr/LIP/RESO/Software/vxdl/ vm1 General description Resources description workers [100 nodes] Network topology 1 GB, 2 GHz database description 2 cores [1 GB, 2 GB] Location: lyon.fr 2 GHz Reliability: 99.9% Timeline description 2 cores Location: lyon.fr Reliability: 99.99% 2 nd IEEE CloudCom – 2010 9 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

Virtual Infrastructure extension Translation of reliability requirements into replica nodes Opportunistic Redundancy Pooling (ORP) mechanism [W. Yeow et al, 2010] : Input: Reliability level (user requirement) Probability of physical failures (from MTBF) Number of protected virtual nodes (user requirement) Output: the number of backup nodes – Backup nodes can be shared among different groups of critical nodes – For example, two sets of backup nodes (k1 and k2) can be shared to protect two groups of critical nodes. Thanks to ORP is required only the min(k1, k2) [W. Yeow et al, 2010]: Designing and embedding reliable virtual infrastructures, VISA workshop 2010. 2 nd IEEE CloudCom – 2010 10 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

Virtual Infrastructure extension Backup links: consistent network topology Step 1 Step 3 Step 2 2 nd IEEE CloudCom – 2010 11 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

Allocation of a Reliable Virtual Infrastructure An extended graph is composed by original description + backup components Backup components can have specific constraints: – For example, original node and backup node should be allocated on different physical racks Subgraph-isomorphism detection [Lischka et al, 2009] Physical substrate Embedded graph 2 nd IEEE CloudCom – 2010 12 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

From mapping to allocation The map provided by the allocation is interpreted and instantiate using the HIPerNet framework [P. Primet et al, 2010] • Original VMs and replicas are synchronized by a modified version of the Remus live protection mechanism [B. Cully et al, 2008] 2 nd IEEE CloudCom – 2010 13 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

Evaluation through a use case application Bronze Standard: distributed large-scale application – Quantifies the maximal error resulting from medical-image analysis – Large databases: more the data, more the accuracy – 31 VMs: 512 MB,1 GHz – 10 Mbps for each virtual link between the database and the workers I) Translated into VXDL II) Submitted to HIPerNet Two scenarios of reliability requirements: – Database protection: a failure stops the application execution – Workers protection: a failure increases the execution time Testbed: Grid’5000 – Physical substrate is composed by 100 nodes: – MTBF simulation values: 60000s, 30000s, 15000s [D. Atwood et al., 2008] 2 nd IEEE CloudCom – 2010 14 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

Experimental results Goal : quantify the cost of a reliable virtual infrastructure Prices are based on Amazon EC2 for Europe VM specifications Basic node We do not include any specific link pricing Short term lease $0.095 cost without reliability support (short term lease): $2.95 / h Long term lease $0.031 Prices for computing nodes protection (30 VMs, 99.9%): Short term Long term MTBF Backup Total Reliability cost / total Total Reliability cost / total Nodes cost cost cost cost 60000 5 $3.42 16.1% $3.10 5.3% s 30000 8 $3.71 25.8% $3.19 8.4% s 15000 12 $4.09 38.7% $3.32 12.6% s 2 nd IEEE CloudCom – 2010 15 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

Experimental results Goal: evaluate the application behavior when executing with reliability support Application makespan without substrate failures: 1205s, used as baseline – Database protection: DB label : database is the unique component protected Makespan increases proportionally to the number of failures – Worker nodes protection: WN label : only computing nodes are protected Makespan slightly increases 1800 1600 1400 MTBF DB WN 1200 Increase Increase 1000 NI DB 800 WN 60000s 16.26% 0.2% 600 400 30000s 26.47% 1.7% 200 0 15000s 40.08% 3.2% NI 60000s 30000s 15000s 2 nd IEEE CloudCom – 2010 16 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

Experimental results Goal: reliability service vs resubmission mechanism: – Application is aware about substrate failures – A task is resubmitted on a new computing node – The makespan difference would have been more if backup nodes were not pre-allocated and configured 1600 1400 MTBF Makespan 1200 Increase 1000 60000s +13.08% 800 Reliability Resubmission 600 30000s +19.67% 400 15000s +22.19% 200 0 60000s 30000s 15000s 2 nd IEEE CloudCom – 2010 17 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

Conclusions Reliability becomes a service offered by the infrastructure provider We have developed a framework to provide transparent reliability: – A language to specify the reliability requirements; – A mechanism to interpret these requirements and transform it in replicas (nodes and links) – A map and allocation process to provisioning the reliability level required by the user The framework was implemented on top of the HIPerNet framework, and validated over the Grid’5000 testbed Future work includes: – the implementation of a mechanism to protect virtual links – a detailed investigation on the economical aspects – Tomorrow there is a demonstration about the industry version of the HIPerNet framework (LYaTiss core) - http://www.lyatiss.com/ 2 nd IEEE CloudCom – 2010 18 G. Koslovski, W. Yeow, C. Westphal, T. Huu, J. Montagnat, P. Vicat-Blanc

Reliability Support in Virtual Infrastructures 2 nd IEEE - PowerPoint PPT Presentation

Reliability Support in Virtual Infrastructures 2 nd IEEE International Conference on Cloud Computing Technology and Science, Indianapolis, 2010 RESO Guilherme Koslovski (INRIA University of Lyon) Wai-Leong Yeow (DoCoMo USA Labs) Cedric

ARG Availability and reliability monitoring for e-Infrastructures C. Kanellopoulos, GRNET K.

Transparent migration of virtual Transparent migration of virtual infrastructures in large

Belief Reliability for Uncertain Random Systems Rui Kang Center for Resilience and Safety of

The Web Week 10 LBSC 671 Creating Information Infrastructures Virtual Private Networks a

Deploying large-scale virtual infrastructures with Kadeploy3 Luc Sarzyniec, S ebastien Badia,

Designing Infrastructures for Appropriation Support in 3D Printing Communities FabLabCon

NFV Unbound from physical boxes to virtual, open infrastructures Christos Kolias Sr. Research

Bayesian Methods in Reliability Engineering ASQ Reliability Division Webinar Program Nov 15 th

Reliability Engineering - Discussions and Clarifications Reliability Engineering VS.

DECIDE at a glance Submitted to EC Call: FP7-INFRASTRUCTURES-2010-2 Virtual Research

Experiences of the Virtual Experiences of the Virtual Community Grid Workgroup

HPC Infrastructures HPC Infrastructures Moreno Baricevic CNR-INFM DEMOCRITOS, Trieste NETTAB

Infrastructures for Cloud Computing and Big Data M Cloud support and Global strategies Antonio

National High Reliability Electronics Virtual Center (HiREV) Program Update June 18 th , 2014

Outline Spatial Data Infrastructures Spatial Data Infrastructures Some Questions on SDIs

An Approach to Human Reliability Analysis of SAMG Actions based on a Time Uncertainty Analysis

Information Infrastructures Week 1 LBSC 671 Creating Information Infrastructures Tonight

Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming

- Reliability - Reliability What It Is, Why, and How Jason Nicholas, Ph.D. November 13,

Virtual MultiModal Museum (ViMM) Co-ordination and Support Action for a European

Synergy Action TSO 2020 Jan Veijer #TSO2020 TOWARDS SYNERGISED INFRASTRUCTURES IN THE EU WITH

Interoperation with Interoperation with Infrastructures: Infrastructures: NDGF-EGEE NDGF-EGEE

Teletherapy and Other Virtual Supports May 19, 2020 TELETHERAPY AND OTHER VIRTUAL SUPPORTS

EUFORIA FP7-INFRASTRUCTURES-2007-1, Grant 211804 EUFORIA 14 member Institutes 3.65M over 36