Grid Aware HA-OSCAR Kshitij Limaye 1 , Box Leangsuksun 1 , Venkata K. - PowerPoint PPT Presentation

Grid Aware HA-OSCAR Kshitij Limaye 1 , Box Leangsuksun 1 , Venkata K. Munganuru 1 , Zeno Greenwood 1 , Stephen L. Scott 2 , Richard Libby 3 , and Kasidit Chanchio 4 1.Louisiana Tech University, 2.Oak Ridge National Laboratory, 3.Intel, 4.Thammasat University, Thailand Oscar 05 Symposium May 2005

Outline � Introduction � Traditional & Dual head Architectures. � Proposed Framework � Smart Failover framework � Experiment � Planned & unplanned downtime � Conclusion � Future work 2 Oscar 05 Symposium May 2005

Introduction � Scientists across the world have employed Grid Computing to overcome various resource level hurdles. � Clusters are favored job sites in grids. � Rendering High availability becomes increasingly important as critical applications shift to grid systems. Though Grid is distributed , inevitable errors can make a � site unusable leading to reduced overall resources and slowing down the speed of computation. 3 Oscar 05 Symposium May 2005

Introduction – continued… � Efforts need to concentrate on making critical systems highly available and eliminate single point of failures in grids and clusters. � HA-OSCAR removes single point of failure of cluster based job site (Beowulf) by component redundancy and self-healing capabilities. � Smart Failover feature tries to make failover mechanism graceful in terms of job management. 4 Oscar 05 Symposium May 2005

Traditional Intra site cluster configuration Site-Manager is (cluster head � node having Globus Services) the node acting as the gateway between the cluster and the grid. Site-manager is critical from � point of site being used to its full potential. Failure of Site-Manager � causes whole site to go unused till it becomes healthy. Outages are non-periodical � and unpredictable and hence measures should be taken to guarantee high availability of services. Hence the proposed architecture. 5 Oscar 05 Symposium May 2005

Critical Service Monitoring & Failover- Failback capability for site-manager Client Client submits MPI job HAOSCAR failover if critical services (Gatekeeper, gridFTP, PBS) die Site-Manager Stand-By Compute nodes 6 Oscar 05 Symposium May 2005

Proposed Framework Most of the current efforts have focused on task-level fault tolerance as in retrying HA-OSCAR the job on an alternate site. HA-OSCAR policy-based There is dearth of solutions Service Monitoring recovery mechanism for fault detection and recovery at the site level. Grid Layer We monitor Gatekeeper & Cluster Software gridFTP services in the Service monitoring sublayer Operating System Applications and failover & failback in irreparable situations. 7 Oscar 05 Symposium May 2005

Grid Enabled HA service � The HA-OSCAR monitors the gatekeeper and gridFTP services every 3 seconds. � When a service fails, to start after 3 attempts, failover happens. � Standby also monitors Primary every 3 seconds to check whether it is alive. 8 Oscar 05 Symposium May 2005

Smart Failover Framework Event monitor triggers Job Queue monitor on events such as � JOB_ADD, JOB_COMPLETE and system events On sensing change in job queue, job queue monitor � triggers backup updater to update backup. 9 Oscar 05 Symposium May 2005

HA-OSCAR in a cluster based Grid environment � Production-quality Open source Linux-cluster project � HA and HPC clustering techniques to enable critical HPC infrastructure Self-configuration Multi- head Beowulf system � HA-enabled HPC Services: Active/Hot Standby � Self-healing with 3-5 sec automatic failover time � The first known field- grade open source HA Beowulf cluster release 10 Oscar 05 Symposium May 2005

Experiment � Globus Toolkit 3.2 � Oscar 3.0 � HA-OSCAR beta 1.0 11 Oscar 05 Symposium May 2005

Observations Group Service Type Time Alert � Average Failover time was 19 seconds and 1 Sun Service_ Gate Alert Xinetd. Nov 21 mon keeper alert average failback time was 09:10:30 2004 20 seconds . 2 Sun Nov Service- Gate Up Mail. 21 mon keeper alert alert 09:10:33 2004 � Services were restarted Group Service Type Time Alert in between 1-3 seconds 1 Sun Nov Primary Ping Alert Server- depending on when last 21 _server down 09:30:20 Alert monitoring was done . 2004 2 Sun Nov Primary Ping Up Server- 21 _server alert up 09:35:39 .alert 2004 12 Oscar 05 Symposium May 2005

Time needed for jobs to complete with/without “Smart Failover” � Assuming jobs start running after reboot on clusters. � TLR = Time to complete last running jobs . MTTR (seconds) Total Time needed without Total time needed with smart Smart Failover feature Failover feature 120 (2 min) 120 + run time of predecessors 20 + run time of predecessors – TLR (running jobs lost) + TLR 600 (10 min) 600 + run time of predecessors 20 + run time of predecessors – TLR (running jobs lost) + TLR 3600 (60 min) 3600 + run time of 20 + run time of predecessors predecessors + TLR – TLR (running jobs lost) 7200 (2 hours) 7200 + run time of 20 + run time of predecessors predecessors + TLR – TLR (running jobs lost) 13 Oscar 05 Symposium May 2005

Planned Downtime � Time to taken to setup and configure software adds to the planned downtime. � We have developed a easy Globus Toolkit configuration helper package. � Also helps installation of side packages, such as schedulers, MPI(s), etc. � This will help reducing planned downtime by automating the process. 14 Oscar 05 Symposium May 2005

Unplanned Downtime Assumptions: HAOSCAR enabled Grid Vs Traditional Grid Package used : SPNP 100.00% 98.00% Single Head 4 cluster 96.00% Grid Availability for grid having 94.00% Availability/year 92.00% HAOSCAR enabled 4 traditional cluster as intra site 90.00% 88.00% cluster grid 86.00% solution : 0.968 i.e. 11.68 days 84.00% Single Head 10 82.00% downtime per year. 80.00% Cluster Grid 78.00% 76.00% HAOSCAR enabled 74.00% Availability for grid having HA- 10 Cluster Grid 72.00% 70.00% OSCAR enabled cluster as intra 1000 2000 3000 5000 6000 site solution: 0.99992 i.e. 2 Mean Time To Failure(MTTF) in minutes downtime per year Hours Hence the obvious HA-OSCAR enabled Grid Vs Traditional Grid availability gain. 15 Oscar 05 Symposium May 2005

Polling Overhead Measurement � 20 sec failover time � 0.9% CPU usage at each monitoring interval 300 HA-OSCAR Network load in Packets/ Min m easured by 250 200 TCPtrace 150 100 50 0 1 2 5 10 15 20 30 60 HA-OSCAR Mon polling interval (s) Comparison of network usages for HA-OSCAR different polling sizes 16 Oscar 05 Symposium May 2005

Summary � Institutions have significant investment in resources and that needs to be guaranteed. � “Smart Failover” HA-OSCAR makes failover graceful in terms of job management. � “Smart Failover” HA-OSCAR with Failover Aware solution for site-manager provides better availability, self healing and fault tolerance. � HA-OSCAR ensures service and job level resilience for clusters and grids. 17 Oscar 05 Symposium May 2005

Current status � Smart failover feature tested with Oscar 3.0, OpenPBS as the scheduler. � Failover Aware client written to achieve resilience for jobs submitted through grid. � Lab grade automated Globus installation package ready. 18 Oscar 05 Symposium May 2005

Future Work � Develop the wrapper around scheduler for per job add/complete events. � Testing of Smart failover feature with the event monitoring system. � Integration of “Smart Failover” in next release of HA-OSCAR � Research into lazy failback mechanism. 19 Oscar 05 Symposium May 2005

Thank You 20 Oscar 05 Symposium May 2005

Grid Aware HA-OSCAR Kshitij Limaye 1 , Box Leangsuksun 1 , Venkata K. - PowerPoint PPT Presentation

Grid Aware HA-OSCAR Kshitij Limaye 1 , Box Leangsuksun 1 , Venkata K. Munganuru 1 , Zeno Greenwood 1 , Stephen L. Scott 2 , Richard Libby 3 , and Kasidit Chanchio 4 1.Louisiana Tech University, 2.Oak Ridge National Laboratory, 3.Intel, 4.Thammasat

Sun and Grid John Barr Grid Business Development 07808 328351 john.barr@sun.com Sun and Grid

Xen-OSCAR: OSCAR Testing with Xen Geoffroy Vallee, Stephen L. Scott Oak Ridge National Laboratory

ON-GRID VS OFF-GRID SOLAR On-Grid Solar is solar generation that is connected to the utility grid

Migrating from Grid to Cloud: Migrating from Grid to Cloud: Migrating from Grid to Cloud:

OSCAR on Debian the EDF Experience Geoffroy Valle Hugues Prisker Jean-Yves Berthou Daniel

Saint Oscar Romero 1917-1980 Year 4 Gods People Saint Oscar Romero 1917-1980 A

SEE-GRID Deploying a Grid-enabled eInfrastructure in SE Europe www.see-grid.org Jorge Sanchez,

Modernizing T&D on the Electric Grid 11/29/2011 Mark Nealon System Meter & Smart Grid

Grid Grid to Grid Grid-to to Ports Clock Routing for to-Ports Clock Routing for Ports Clock

Grid/Clo d Comp ting Grid/Clo d Comp ting Grid/Cloud Computing Grid/Cloud Computing over

SEE-GRID-SCI SEE-GRID Infrastructure for Regional eScience www.see-grid-sci.eu International

Toolkit to Support Intelligibility in Context Aware Applications Context-Aware Applications P

Grid! Alison Fulford Housekeeping National Grid 2 Introductions National Grid 3 Workplace

One Page Everywhere Fluid, Responsive Design with Semantic.gs The Semantic Grid System Grid

GRID PHD GRID, PHD The Smart Grid Cyber Security and the Future of Keeping the Lights On The

& Grid5000 Grid eXplorer eXplorer Grid Plates-formes de Grilles exprimentales

Grid Computing with Debian, Globus Grid Computing with Debian, Globus and ARC and ARC Mattias

Data Management Network transfers Network data transfers Not everyone needs to transfer large

IPv6 and the Grid Work in Progress S.Bhatti, P.Kirstein, S.Venaas, P.OHanlon and S. Jiang

Data Transfers in the Grid: Data Transfers in the Grid: Workload Analysis of Globus Globus

SELECTED TOPICS IN DISTRIBUTED AND PARALLEL SYSTEMS Grid Technologies - Practical Lab Setting

REANNZ Network SYDNEY 10Gb/s ladder design AUCKLAND soon: upgrade to 100Gb/s TAURANGA

Report on the Clusters at Fermilab Don Holmgren USQCD All-Hands Meeting Fermilab May 1-2, 2015

The Integration of Grid Technology with OGC Web Services (OWS) in NWGISS for NASA EOS Data Liping

Grid Aware HA-OSCAR Kshitij Limaye 1 , Box Leangsuksun 1 , Venkata K. - PowerPoint PPT Presentation

Grid Aware HA-OSCAR Kshitij Limaye 1 , Box Leangsuksun 1 , Venkata K. Munganuru 1 , Zeno Greenwood 1 , Stephen L. Scott 2 , Richard Libby 3 , and Kasidit Chanchio 4 1.Louisiana Tech University, 2.Oak Ridge National Laboratory, 3.Intel, 4.Thammasat

Sun and Grid John Barr Grid Business Development 07808 328351 john.barr@sun.com Sun and Grid

Xen-OSCAR: OSCAR Testing with Xen Geoffroy Vallee, Stephen L. Scott Oak Ridge National Laboratory

ON-GRID VS OFF-GRID SOLAR On-Grid Solar is solar generation that is connected to the utility grid

Migrating from Grid to Cloud: Migrating from Grid to Cloud: Migrating from Grid to Cloud:

OSCAR on Debian the EDF Experience Geoffroy Valle Hugues Prisker Jean-Yves Berthou Daniel

Saint Oscar Romero 1917-1980 Year 4 Gods People Saint Oscar Romero 1917-1980 A

SEE-GRID Deploying a Grid-enabled eInfrastructure in SE Europe www.see-grid.org Jorge Sanchez,

Modernizing T&amp;D on the Electric Grid 11/29/2011 Mark Nealon System Meter &amp; Smart Grid

Grid Grid to Grid Grid-to to Ports Clock Routing for to-Ports Clock Routing for Ports Clock

Grid/Clo d Comp ting Grid/Clo d Comp ting Grid/Cloud Computing Grid/Cloud Computing over

SEE-GRID-SCI SEE-GRID Infrastructure for Regional eScience www.see-grid-sci.eu International

Toolkit to Support Intelligibility in Context Aware Applications Context-Aware Applications P

Grid! Alison Fulford Housekeeping National Grid 2 Introductions National Grid 3 Workplace

One Page Everywhere Fluid, Responsive Design with Semantic.gs The Semantic Grid System Grid

GRID PHD GRID, PHD The Smart Grid Cyber Security and the Future of Keeping the Lights On The

&amp; Grid5000 Grid eXplorer eXplorer Grid Plates-formes de Grilles exprimentales

Grid Computing with Debian, Globus Grid Computing with Debian, Globus and ARC and ARC Mattias

Data Management Network transfers Network data transfers Not everyone needs to transfer large

IPv6 and the Grid Work in Progress S.Bhatti, P.Kirstein, S.Venaas, P.OHanlon and S. Jiang

Data Transfers in the Grid: Data Transfers in the Grid: Workload Analysis of Globus Globus

SELECTED TOPICS IN DISTRIBUTED AND PARALLEL SYSTEMS Grid Technologies - Practical Lab Setting

REANNZ Network SYDNEY 10Gb/s ladder design AUCKLAND soon: upgrade to 100Gb/s TAURANGA

Report on the Clusters at Fermilab Don Holmgren USQCD All-Hands Meeting Fermilab May 1-2, 2015

The Integration of Grid Technology with OGC Web Services (OWS) in NWGISS for NASA EOS Data Liping

Modernizing T&D on the Electric Grid 11/29/2011 Mark Nealon System Meter & Smart Grid

& Grid5000 Grid eXplorer eXplorer Grid Plates-formes de Grilles exprimentales