monitoring and workflow management monitoring and
play

Monitoring and Workflow management Monitoring and Workflow - PowerPoint PPT Presentation

Monitoring and Workflow management Monitoring and Workflow management in large distributed systems in large distributed systems March 2011 1 The MonALISA Framework The MonALISA Framework MonALISA is a Dynamic, Distributed Service System


  1. Monitoring and Workflow management Monitoring and Workflow management in large distributed systems in large distributed systems March 2011 1

  2. The MonALISA Framework The MonALISA Framework  MonALISA is a Dynamic, Distributed Service System capable to collect any type of information from different systems, to analyze it in near real time and to provide support for automated control decisions and global optimization of workflows in complex grid systems.  The MonALISA system is designed as an ensemble of autonomous multi-threaded, self-describing agent-based subsystems which are registered as dynamic services, and are able to collaborate and cooperate in performing a wide range of monitoring tasks. These agents can analyze and process the information, in a distributed way, and to provide optimization decisions in large scale distributed applications. 2 2 Iosif Legrand March 2011

  3. The MonALISA Architecture The MonALISA Architecture Regional or Global High Level Regional or Global High Level Services, Services, HL services Repositories & Clients Repositories & Clients Secure and reliable communication Secure and reliable communication Dynamic load balancing Dynamic load balancing Scalability & Replication Scalability & Replication Proxies AAA for Clients AAA for Clients Distributed System for gathering and Distributed System for gathering and analyzing information based on analyzing information based on mobile agents: mobile agents: Customized aggregation, Triggers, Agents MonALISA services Customized aggregation, Triggers, Actions Actions Distributed Dynamic Distributed Dynamic Registration and Discovery- Registration and Discovery- Network of based on a lease based on a lease JINI-Lookup Services Secure & Public mechanism and remote events mechanism and remote events Fully Distributed System with no Single Point of Failure 3 3 Iosif Legrand March 2011

  4. MonALISA Service & Data Handling MonALISA Service & Data Handling Postgres Registration Lookup Lookup Data Store Service Service D Web i s c Service o Data Cache v e Service & DB WSDL r y SOAP WS Clients and service Data (via ML Proxy) Predicates & Agents Clients or Applications Configuration Control (SSL) Higher Level Services AGENTS AGENTS FILTERS / TRIGGERS FILTERS / TRIGGERS Collects any Dynamic Loading Monitoring Modules Monitoring Modules type of information Push and Pull 4 4 Iosif Legrand March 2011

  5. Registration / Discovery Registration / Discovery Admin Access and AAA for Clients Admin Access and AAA for Clients Application Registration Discovery (signed certificate) MonALISA Service Client Lookup Trust (other service) Service keystore Services Proxy Applications Multiplexer Data Data MonALISA Filters & Agents Filters & Agents Service Client Services authentication Proxy Multiplexer Admin SSL connection Lookup MonALISA Service Client Service (other service) Trust AAA services keystore 5 Iosif Legrand March 2011

  6. Monitoring Grid sites, Running Jobs, Monitoring Grid sites, Running Jobs, Network Traffic, and Connectivity Network Traffic, and Connectivity Running Jobs JOBS TOPOLOGY ACCOUNTING 6 6 Iosif Legrand March 2011

  7. Monitoring CMS Jobs Worldwide Monitoring CMS Jobs Worldwide CMS is using MonALISA and ApMon to monitor all the production and analysis jobs. This information is than used in the CMS dashboard frontend Rate of collected monitoring values Total Collected values Rates up to more than Collected ~5* 10 10 6000 values per monitoring values in second the last 12 months Lost in UDP < 5*10 -6 Organize and structure Monitoring Information More than 3 years continuous operation without any problems 7 Iosif Legrand March 2011

  8. Monitoring CMS Jobs Worldwide Monitoring CMS Jobs Worldwide Organize and structure Monitoring Information User-level task monitoring 8 Iosif Legrand March 2011

  9. Monitoring CMS Jobs Worldwide Monitoring CMS Jobs Worldwide Organize and structure Monitoring Information User-level task monitoring 9 Iosif Legrand March 2011

  10. Monitoring architecture in ALICE Monitoring architecture in ALICE AliEn AliEn AliEn Cluster IS CE Optimizers AliEn Monitor AliEn AliEn ApMon Job Agent ApMon ApMon Brokers ApMon AliEn TQ ApMon processes SE In/out ApMon slots run net job ApMon time c d e AliEn jobs t p a e e i ApMon o m u c status r vsz Job Agent l f a e p MySQL sockets s ApMon Servers MonALISA MonALISA ApMon rss migrated AliEn @Site @CERN mbytes Job Agent sessions active A ApMon g CastorGrid g r e Scripts g a t e d ApMon D API a t a Services Cluster AliEn AliEn Monitor SE ApMon CE ApMon ApMon ApMon AliEn open f o Q J s Job Agent files o u e b . e r A l u n i ApMon MonaLisa g e MonaLisa job f d e n status Alerts t s Repository Repository MonALISA cpu AliEn Actions ksi2k LCG Site Job Agent k s d MyProxy ApMon i e Long History d s status u DB AliEn Job Agent ApMon LCG Tools 10 10 Iosif Legrand March 2011

  11. ALICE : Global Views, Status & Jobs ALICE : Global Views, Status & Jobs http://pcalimonitor.cern.ch 11 Iosif Legrand March 2011

  12. Monitoring in ALICE: jobs, resources, services Monitoring in ALICE: jobs, resources, services 12 Iosif Legrand March 2011

  13. Monitoring in ALICE: Xrootd servers Monitoring in ALICE: Xrootd servers Iosif Legrand August 2009

  14. Active Available Bandwidth measurements Active Available Bandwidth measurements between all the ALICE grid sites between all the ALICE grid sites 14 Iosif Legrand March 2011

  15. Active Available Bandwidth measurements Active Available Bandwidth measurements between all the ALICE grid sites (2) between all the ALICE grid sites (2) 15 Iosif Legrand March 2011

  16. Local and Global Decision Framework Local and Global Decision Framework Two levels of decisions: local (autonomous), • Traffic • Jobs Actions based on global (correlations). Actions based on ML Service ML Service • Hosts global information global information Actions triggered by: • Apps values above/below given thresholds, Global Actions based on Actions based on ML local information local information absence/presence of values, Services correlations between any values. • Temperature Action types: • Humidity ML Service ML Service • A/C Power alerts (emails/instant msg/atom feeds), • … automatic charts annotations in the repository, Local Global Local Global Sensors Sensors decisions decisions decisions decisions running custom code, like securely ordering MLs service to change connectivity – optimize traffic, submit jobs, (re)start global service. 16 Iosif Legrand March 2011

  17. ALICE: Automatic job submission ALICE: Automatic job submission Restarting Services Restarting Services MySQL daemon is automatically restarted ALICE Production jobs queue is kept full by the when it runs out of memory automatic submission Trigger: threshold on VSZ memory usage Trigger: threshold on the number of aliprod waiting jobs Administrators are kept up-to-date on the services’ status Trigger: presence/absence of monitored information 17 17 Iosif Legrand March 2011

  18. Automatic actions in ALICE Automatic actions in ALICE ALICE is using the monitoring information to automatically: resubmit error jobs until a target completion percentage is reached, submit new jobs when necessary (watching the task queue size for each service account) production jobs, RAW data reconstruction jobs, for each pass, restart site services, whenever tests of VoBox services fail but the central services are OK, send email notifications / add chart annotations when a problem was not solved by a restart dynamically modify the DNS aliases of central services for an efficient load-balancing. Most of the actions are defined by few lines configuration files. 18 Iosif Legrand March 2011

Recommend


More recommend