Fabric Management with ELFms Presented by U. Schwickerath – CERN/ IT
Outline � The ELFms framework � Quattor � Lemon � SLS � LEAF German Cancio – CERN/ IT - n° 2
Fabric Managem ent w ith ELFm s ( I ) ELFms stands for ‘ E xtremely L arge F abric m anagement s ystem’ Subsystems: : configuration, installation and management of nodes � : system / service monitoring � : hardware / state management � N o Configuration d Management e Node Management � ELFms manages and controls most of the nodes in the CERN CC � ~ 4700 nodes out of ~ 5500.. Increasing! � Multiple functionality and cluster size (batch nodes, disk servers, tape servers, DB, web, … ) � Heterogeneous hardware (CPU, memory, HD size,..) � Supported OS: Linux (RHES3/ 4, Scientific Linux 3/ 4 – 32/ 64bit) and Solaris (RIP..) German Cancio – CERN/ IT - n° 3
Fabric Managem ent w ith ELFm s ( I I ) • ELFms (Quattor/ Lemon) were started in the scope of EU DataGrid. • Development is now coordinated by CERN/ IT in collaboration with other HEP institutes • Quattor/ Lemon are used in production in/ outside CERN • LCG T1/ T2 sites, ranging from 50-1000 nodes/ site • Complete configuration of system and LCG Grid middleware via Quattor • Integration with Grid services e.g. monitoring (GridICE, MonALISA) German Cancio – CERN/ IT - n° 4
http:/ / quattor.org German Cancio – CERN/ IT - n° 5
Quattor Quattor takes care of the configuration, installation and management of fabric nodes � A Configuration Database holds the ‘desired state’ of all fabric elements • Node setup (CPU, HD, memory, software RPMs/ PKGs, network, system services, location, audit info… ) • Cluster (name and type, batch system, load balancing info… ) � Autonomous management agents running on the node for • Base installation • Service ( re-) configuration • Softw are installation and m anagem ent German Cancio – CERN/ IT - n° 6
Architecture Configuration server SQL SQL backend CLI SOA GUI CDB P scripts XML backend HTTP XML configuration profiles SW server(s) Install server Node Configuration Manager NCM Install CompA CompB CompC Manager SW HTTP HTTP / ServiceA ServiceB ServiceC PXE RPMs base OS Repository System RPMs / PKGs installer SW Package Manager SPMA Managed Nodes German Cancio – CERN/ IT - n° 7
Configuration I nform ation � Configuration is expressed using a language called Pan � Information is arranged into templates � Common properties set only once � Using templates it is possible to create hierarchies to match service structures CERN name_srv1: 137.138.16.5 CC time_srv1: ip-time-1 cluster_name: lxbatch cluster_name: lxplus lxbatch lxplus disk_srv master: lxmaster01 pkg_add (lsf5.1) pkg_add (lsf5.1) eth0/ ip: 137.138.4.246 eth0/ ip: 137.138.4.225 lxplus001 lxplus020 lxplus029 pkg_add (lsf6_beta) German Cancio – CERN/ IT - n° 8
Quattor Deploym ent � Quattor in complete control of Linux boxes (~ 4700 nodes, to grow to ~ 6-8000 in 2008) � CDB holding information of all systems in CERN-CC � Over 100 NCM configuration components developed � From basic system configuration to Grid services setup… (including desktops) � SPMA used for managing all software � security and functional updates (including kernel upgrades) � Eg. KDE security upgrade (~ 300MB per node) and LSF client upgrade in 30 mins, without service interruption � Handles (occasional) downgrades as well � Developments ongoing: � CDB: Fine-grained ACL protection to templates, namespaces, improved SQL backend … � Security: Deployment of HTTPS instead of HTTP (usage of host certificates) � Proxy architecture for enhanced scalability … German Cancio – CERN/ IT - n° 9
Proxy server setup Server cluster Backend M M’ (“Master”) Installation images, RPMs, configuration profiles Frontend L1 proxies DNS-load balanced HTTP L2 proxies (“Head” H H H nodes) … Rack 1 Rack 2… … Rack N German Cancio – CERN/ IT - n° 10
Quattor outside CERN � Many sites (a dozen, including LAL/ IN2P3, NIKHEF, DESY,..) adopt quattor as fabric management framework… � See Quattor tool survey quattor.org/ documentation/ misc/ feedback-poll-0605.htm � … leading to improved core software robustness and completeness � Identified and removed site dependencies and assumptions � Documentation, installation guides, bug tracking, release cycles � Components available for a fully automated LCG configuration German Cancio – CERN/ IT - n° 11
http:/ / cern.ch/ lem on German Cancio – CERN/ IT - n° 12
Lem on – L HC E ra Mon itoring Repository SQL backend RRDTool / PHP Monitoring Correlation SOA SOA P P Engines Repository apache TCP/UD HTTP P Nodes Lemon Web Monitoring Agent browser CLI User Sensor Sensor Sensor User Workstations German Cancio – CERN/ IT - n° 13
Deploym ent and Enhancem ents � Smooth production running of Monitoring Agent and Oracle-based repository at CERN-CC � ~ 400 metrics sampled every 30s -> 1d; ~ 2 GB of data / day on ~ 4500 nodes � Usage outside CERN-CC, collaborations � GridICE (> 100 LCG sites) � CMS-Online � IN2P3 � INFN/ CNAF � Others… � Correlation and Fault Recovery � Light-weight local self-healing module (eg. / tmp cleanup, restart daemons) � Security for sample transport (TCP and UDP) (BARC) � Status and performance visualization pages … German Cancio – CERN/ IT - n° 14
Monitoring the Fabric Using a web-based status display: � CC Overview German Cancio – CERN/ IT - n° 15
Monitoring the Fabric Using a web-based status display: � CC Overview � Clusters and nodes German Cancio – CERN/ IT - n° 16
Monitoring the Fabric Using a web-based status display: � CC Overview � Clusters and nodes � VO’s German Cancio – CERN/ IT - n° 17
Monitoring the Fabric Using a web-based status display: � CC Overview � Clusters and nodes � VO’s � Power German Cancio – CERN/ IT - n° 18
Monitoring the Fabric Using a web-based status display: � CC Overview � Clusters and nodes � VO’s � Power � Error trending German Cancio – CERN/ IT - n° 19
Monitoring the Fabric Using a web-based status display: � CC Overview � Clusters and nodes � VO’s � Power � Error trending � Batch system German Cancio – CERN/ IT - n° 20
LAS ( Lem on Alarm System ) � Alarm system for operators (LAS, Lemon Alarm System) � Allow 24/ 24h 7/ 7d operators to receive, acknowledge, ignore, hide, process alarms received via Lemon � Recently put in production at CERN, replacing the old legacy SURE system German Cancio – CERN/ IT - n° 21
Quattor-LEMON integration Quattor and Lemon are tightly integrated at CERN � Note however that Quattor and Lemon have no mutual dependencies! � Configuration of Lemon Agent and Server: � CDB holds definitions of all sensors, metric classes, and metric instances � An NCM component (ncm-fmonagent) generates the Agent config file � Another NCM component updates the Oracle Server configuration � Configuration of Lemon Web Pages: � Information on what clusters exist, and what nodes belong to which cluster, is extracted from CDBSQL German Cancio – CERN/ IT - n° 22
Quattor-LEMON integration ( I I ) � Visualization of Quattor configuration � Indexed CDB templates available, linked to node and cluster status pages � XML profiles display � Alarm generation � E.g. generate an alarm if the configured kernel version differs from the actual one � Visualization of CC equipment � Geometry of CC (racks, robots, etc) � Location of each node in the CC (what rack) German Cancio – CERN/ IT - n° 23
SLS http:/ / cern.ch/ sls German Cancio – CERN/ IT - n° 24
SLS ( Service Level Status) � Service based views (user/ mgmt perspective) � Synoptical view of what services are running how – appropriate for end users and managers � http: / / cern.ch/ sls � See screenshots next slides German Cancio – CERN/ IT - n° 25
SLS Using a web-based status display: � (Meta-)Services Overview German Cancio – CERN/ IT - n° 26
SLS Using a web-based status display: � (Meta-)Services Overview � Drilling down to one meta-service German Cancio – CERN/ IT - n° 27
SLS Using a web-based status display: � (Meta-)Services Overview � Drilling down to one meta-service � More details: Tier-1 sites German Cancio – CERN/ IT - n° 28
SLS Using a web-based status display: � (Meta-)Services Overview � Drilling down to one meta-service � More details: Tier-1 sites � A specific Tier-1 site: Availability history German Cancio – CERN/ IT - n° 29
SLS Using a web-based status display: � (Meta-)Services Overview � Drilling down to one meta-service � More details: Tier-1 sites � A specific Tier-1 site: Availability history � Service-specific information German Cancio – CERN/ IT - n° 30
SLS Using a web-based status display: � (Meta-)Services Overview � Drilling down to one meta-service � More details: Tier-1 sites � A specific Tier-1 site: Availability history � Service-specific information � Other entry views: What services users are interested in German Cancio – CERN/ IT - n° 31
Recommend
More recommend