Grid Aware HA-OSCAR Kshitij Limaye 1 , Box Leangsuksun 1 , Venkata K. Munganuru 1 , Zeno Greenwood 1 , Stephen L. Scott 2 , Richard Libby 3 , and Kasidit Chanchio 4 1.Louisiana Tech University, 2.Oak Ridge National Laboratory, 3.Intel, 4.Thammasat University, Thailand Oscar 05 Symposium May 2005
Outline � Introduction � Traditional & Dual head Architectures. � Proposed Framework � Smart Failover framework � Experiment � Planned & unplanned downtime � Conclusion � Future work 2 Oscar 05 Symposium May 2005
Introduction � Scientists across the world have employed Grid Computing to overcome various resource level hurdles. � Clusters are favored job sites in grids. � Rendering High availability becomes increasingly important as critical applications shift to grid systems. Though Grid is distributed , inevitable errors can make a � site unusable leading to reduced overall resources and slowing down the speed of computation. 3 Oscar 05 Symposium May 2005
Introduction – continued… � Efforts need to concentrate on making critical systems highly available and eliminate single point of failures in grids and clusters. � HA-OSCAR removes single point of failure of cluster based job site (Beowulf) by component redundancy and self-healing capabilities. � Smart Failover feature tries to make failover mechanism graceful in terms of job management. 4 Oscar 05 Symposium May 2005
Traditional Intra site cluster configuration Site-Manager is (cluster head � node having Globus Services) the node acting as the gateway between the cluster and the grid. Site-manager is critical from � point of site being used to its full potential. Failure of Site-Manager � causes whole site to go unused till it becomes healthy. Outages are non-periodical � and unpredictable and hence measures should be taken to guarantee high availability of services. Hence the proposed architecture. 5 Oscar 05 Symposium May 2005
Critical Service Monitoring & Failover- Failback capability for site-manager Client Client submits MPI job HAOSCAR failover if critical services (Gatekeeper, gridFTP, PBS) die Site-Manager Stand-By Compute nodes 6 Oscar 05 Symposium May 2005
Proposed Framework Most of the current efforts have focused on task-level fault tolerance as in retrying HA-OSCAR the job on an alternate site. HA-OSCAR policy-based There is dearth of solutions Service Monitoring recovery mechanism for fault detection and recovery at the site level. Grid Layer We monitor Gatekeeper & Cluster Software gridFTP services in the Service monitoring sublayer Operating System Applications and failover & failback in irreparable situations. 7 Oscar 05 Symposium May 2005
Grid Enabled HA service � The HA-OSCAR monitors the gatekeeper and gridFTP services every 3 seconds. � When a service fails, to start after 3 attempts, failover happens. � Standby also monitors Primary every 3 seconds to check whether it is alive. 8 Oscar 05 Symposium May 2005
Smart Failover Framework Event monitor triggers Job Queue monitor on events such as � JOB_ADD, JOB_COMPLETE and system events On sensing change in job queue, job queue monitor � triggers backup updater to update backup. 9 Oscar 05 Symposium May 2005
HA-OSCAR in a cluster based Grid environment � Production-quality Open source Linux-cluster project � HA and HPC clustering techniques to enable critical HPC infrastructure Self-configuration Multi- head Beowulf system � HA-enabled HPC Services: Active/Hot Standby � Self-healing with 3-5 sec automatic failover time � The first known field- grade open source HA Beowulf cluster release 10 Oscar 05 Symposium May 2005
Experiment � Globus Toolkit 3.2 � Oscar 3.0 � HA-OSCAR beta 1.0 11 Oscar 05 Symposium May 2005
Observations Group Service Type Time Alert � Average Failover time was 19 seconds and 1 Sun Service_ Gate Alert Xinetd. Nov 21 mon keeper alert average failback time was 09:10:30 2004 20 seconds . 2 Sun Nov Service- Gate Up Mail. 21 mon keeper alert alert 09:10:33 2004 � Services were restarted Group Service Type Time Alert in between 1-3 seconds 1 Sun Nov Primary Ping Alert Server- depending on when last 21 _server down 09:30:20 Alert monitoring was done . 2004 2 Sun Nov Primary Ping Up Server- 21 _server alert up 09:35:39 .alert 2004 12 Oscar 05 Symposium May 2005
Time needed for jobs to complete with/without “Smart Failover” � Assuming jobs start running after reboot on clusters. � TLR = Time to complete last running jobs . MTTR (seconds) Total Time needed without Total time needed with smart Smart Failover feature Failover feature 120 (2 min) 120 + run time of predecessors 20 + run time of predecessors – TLR (running jobs lost) + TLR 600 (10 min) 600 + run time of predecessors 20 + run time of predecessors – TLR (running jobs lost) + TLR 3600 (60 min) 3600 + run time of 20 + run time of predecessors predecessors + TLR – TLR (running jobs lost) 7200 (2 hours) 7200 + run time of 20 + run time of predecessors predecessors + TLR – TLR (running jobs lost) 13 Oscar 05 Symposium May 2005
Planned Downtime � Time to taken to setup and configure software adds to the planned downtime. � We have developed a easy Globus Toolkit configuration helper package. � Also helps installation of side packages, such as schedulers, MPI(s), etc. � This will help reducing planned downtime by automating the process. 14 Oscar 05 Symposium May 2005
Unplanned Downtime Assumptions: HAOSCAR enabled Grid Vs Traditional Grid Package used : SPNP 100.00% 98.00% Single Head 4 cluster 96.00% Grid Availability for grid having 94.00% Availability/year 92.00% HAOSCAR enabled 4 traditional cluster as intra site 90.00% 88.00% cluster grid 86.00% solution : 0.968 i.e. 11.68 days 84.00% Single Head 10 82.00% downtime per year. 80.00% Cluster Grid 78.00% 76.00% HAOSCAR enabled 74.00% Availability for grid having HA- 10 Cluster Grid 72.00% 70.00% OSCAR enabled cluster as intra 1000 2000 3000 5000 6000 site solution: 0.99992 i.e. 2 Mean Time To Failure(MTTF) in minutes downtime per year Hours Hence the obvious HA-OSCAR enabled Grid Vs Traditional Grid availability gain. 15 Oscar 05 Symposium May 2005
Polling Overhead Measurement � 20 sec failover time � 0.9% CPU usage at each monitoring interval 300 HA-OSCAR Network load in Packets/ Min m easured by 250 200 TCPtrace 150 100 50 0 1 2 5 10 15 20 30 60 HA-OSCAR Mon polling interval (s) Comparison of network usages for HA-OSCAR different polling sizes 16 Oscar 05 Symposium May 2005
Summary � Institutions have significant investment in resources and that needs to be guaranteed. � “Smart Failover” HA-OSCAR makes failover graceful in terms of job management. � “Smart Failover” HA-OSCAR with Failover Aware solution for site-manager provides better availability, self healing and fault tolerance. � HA-OSCAR ensures service and job level resilience for clusters and grids. 17 Oscar 05 Symposium May 2005
Current status � Smart failover feature tested with Oscar 3.0, OpenPBS as the scheduler. � Failover Aware client written to achieve resilience for jobs submitted through grid. � Lab grade automated Globus installation package ready. 18 Oscar 05 Symposium May 2005
Future Work � Develop the wrapper around scheduler for per job add/complete events. � Testing of Smart failover feature with the event monitoring system. � Integration of “Smart Failover” in next release of HA-OSCAR � Research into lazy failback mechanism. 19 Oscar 05 Symposium May 2005
Thank You 20 Oscar 05 Symposium May 2005
Recommend
More recommend