http://www.grnet.g r GRNET NOC Use puppet and network inventory to populate nagios/icinga configuration TF-NOC Dublin Alexandros Kosiaris (alex@noc.grnet.gr)
Network & Equipment Optical Network: •Storage Equipment: ~70 cities (+30 within next year) Netapp/IBM N5300 15years-leased dark fiber EMC Celerra NS-480 DWDM/CWDM network •Computing Equipment: • Virtualization (KVM) Optical Equipment: 12 Blade servers, HP BL-460c Alcatel 1626LM, 1696MS, 1678MCC 12 IBM 1U Servers Adva FSP2000 128 1U Fujitsu Servers Routing Equipment: 275 2U HP Proliant Servers Juniper T1600, Juniper MX960 ~200 Vms ~10x Cisco 12000s, a few Cisco 7200s/7300s Switching Equipment: Cisco 6500 Several Cisco 3750, Cisco 2970, Juniper ex4200, Extreme X450a/X350
Nagios + Network Equipment or (more accurately) Switching and Routing In-house developed Network Inventory (a.k.a. GRNETDB) •A MySQL database of almost 150 tables •Populated multiple times a day by a PHP discovery script SNMP, telnet + expect •Basic Concepts: Node Interface Layer Domain Location •These concepts get extended to represent functionality Routing, Switching nodes Layer2, Layer3 interfaces Switching, administrative domains
Nagios + Network Equipment or (more accurately) Switching and Routing In-house developed python Django project, with multiple sub-apps •Network (the interface to the database) •RG (router graphs, take a peek at http://mon.grnet.gr/rg) •Maps (take a look at http://mon.grnet.gr/network/maps) •Hostmaster •Optical network (built mostly on Location info) •Nadjicingo Builts on network app and generates a nagios/icinga configuration •Nagvis Same thing but generates/updates nagvis config
Nadjicingo A Django management command outputing nagios/icinga configuration •Run by crontab every hour (manage.py nadjicingo) •Will generate nagios configuration objects for Routers Switches Interfaces •L3 Topology aware (nagios hates cyclic dependencies – aka redundant links), populates parents field for most devices. •Hardware checks in devices •Business logic embedded in interface descriptions: Part of it is a unique identifier for a customers link –[.NTUA-4] => National Technical University's L3 link –[AUTH@ERMOU-1] => Aristotle University of Thessaloniki L2 link at Ermou PoP
Nagvis A Django management command (again...) •Run by crontab every hour (manage.py nagvis) •Will update a specific nagvis map configuration by: Removing obsolete nodes Adding new nodes to a special area for manual positioning on map •Also features an automated positioning mode based on devices Latitude Longitude. Nice for showoff but not for overview in monitoring applications •Will only populate host objects in map. •Service objects cluttered it too much and information is rightly available anyway
Nagvis Network Map
Servers, Services ? A little bit of history •For years, GRNET only had very basic services (DNS, email, Web) •And some router supporting services (Looking glass, mrtg, rancid) •And very few servers (<=10) •3 years ago, major paradigm shift from networking to services •20 Servers bought, and then 132 and recently 275 more •End user services were born: Public cloud storage service (Pithos) Virtual Private Servers (ViMa) Students books statements (Eudoxus) Student Id cards (Paso) Public IaaS (Okeanos) Academic Professor Elections (Apella) •Plus many other services and projects (TCS, Whois, NTP, VoD,…) •The result ? => 200 Vms were created for managing all this infrastructure
Puppet to the rescue What is Puppet? •It's a stack of applications •It's a language (a declarative one as well) •It's a policy and state enforcing tool •It's a attribute and state discovery tool (kind of...) •It's a new paradigm in managing systems! What is Puppet not? •Not just an automation tool •Not a “For loop” •Not a command execution framework (it can be reduced to that though) AGAIN: A new paradigm, you need to change the way you work
Puppet Concepts Facts •Attributes of a system: OS Version and family Available memory CPUs Block devices IP addresses/netmasks MAC addresses And anything else you can write code for it to be discovered LLDP neighbours IPMI functionality Hardware info Apache vhosts •Discovered by facter and then made available to Puppet
Puppet Concepts(2) Resources •Files, Directories •Users, Groups •Packages •Vlans •Interfaces •Nagios objects!!!! •And a lot more (http://docs.puppetlabs.com/references/latest/type.html) Classes •A way to group resources •Support inheritance and mixins (aka including) •The standard class has 3 resources defined •Package {'software': } •File { '/etc/software.conf': } •Service { 'softwared': }
Puppet Concepts(3) •Nodes •A.k.a. machines (VM or hardware) •A node CAN (and probably will) have multiple puppet classes •Node population can be done in multiple ways: •Puppet language config •LDAP •External script Puppetd agents running in each machine (daemon or crontab) Central Puppetmaster (with an RDBMS) holds all the configuration and data
Hello World example class helloworld { file { '/tmp/helloworld': ensure => present, owner => root, group => root, mode => 640, content => 'Hello world' } } node mynode { include helloworld } Will create the /tmp/helloworld with all the attributes as defined above More importantly, if run again it will make sure to wipe any possible changes and restore the state as is defined above
Back to nagios Let’s use a puppet native type nagios_host { “$hostname”: address => 10.10.10.10, alias => myhost, contact_groups => hostadmins, hostgroups => 'Puppeted Servers', } /etc/nagios/nagios_host.cfg gets populated Problem is ... •This is executed in the machine running puppetd not the nagios server. No problem. Puppet supports exported resources.
Exported resources Let’s prepend the definition with two @ signs @@nagios_service { 'myservice' contact_groups => hostadmins, host_name => $hostname, tag => 'collect_me_nagios_server', } •Exports the resource but does not realize it on the machine running puppetd •No /etc/nagios/nagios_service.cfg file created <<| Nagios_service tag == 'collect_me_nagiosserver' |>> • In nagios server’s manifest. •/etc/nagios/nagios_service.cfg populated. •nagios,icinga.cfg can now just include the file/directory and monitoring begins
Simple example A manifest for all authoritative DNS servers Install bind9, install configuration and ensure it is running Open up firewall Setup a simple DNS check class authoritativedns { include bind9 include service::dns @@nagios_service { "authdns": command => "check_dig!www.grnet.gr", servicegroups => "DNS,DNS:Authoritative" } }
Interesting use cases Class hierarchy means: A base class nagios::host that is included in all other So all servers nagios-monitored without any intervention But: A Server is physical and has IPMI capabilities: So export another nagios host for it if $ipmi_capable { @@nagios_host { "$ipmi_dns": address => $ipmi_ipaddress, tag => "hardwarehost", } }
Interesting use cases (2) Server is an HP Proliant Server class hp-health { package { [ 'hp-health', 'hpacucli' ]: ensure => present, } nagios::host::service { 'hpacucli': ensure => present, servicegroups => 'HARDWARE', command => 'check_nrpe!dsa-check-hpacucli!0', } nagios::host::service { 'hpasm': ensure => present, servicegroups => 'HARDWARE', command => 'check_nrpe!dsa-check-hpasm!0', } }
Interesting use cases (3) Multicast beacons (double exported resources!!!) define ssmping_check($ipv4, $ipv6) { $local = $::fqdn $remote = $name if ($::ipaddress and $ipv4 and $local != $remote) { @@nagios_service { "ping-ssm-$remote-$local-v4": ensure => present, check_command => "check_nrpe!check_ssmping!$ipv4", host_name => $local, service_description => "Multicast from $remote SSM IPv4", } … } # export the checks... @@ssmping_check { $fqdn: ipv4 => $ipaddress, ipv6 => $ipv6address}
Interesting use cases (4) Standard checks for all servers nagios::host::service { "disk": command => "check_nrpe!check_disk!13% 7%", } nagios::host::service { "load": command => "check_nrpe!check_load!4,3,2 5,4,3", } nagios::host::service { "users": command => "check_nrpe!check_load!20 30", } nagios::host::service { "swap": command => "check_nrpe!check_swap!60 40", } nagios::host::service { "check_tainted": command => "check_nrpe!check_tainted!0", } nagios::host::service { "check_firewall": command => "check_nrpe!check_firewall!0", }
Recommend
More recommend