Monitoring Systems and POWER5/6 LPARs with Ganglia Michael Perzl – michael@perzl.org
Agenda Ganglia – what is it ? Ganglia components and data flow An introduction to RRDTool Ganglia metrics – what can be measured ? New POWER5/6 metrics (AIX & Linux) Extending Ganglia with gmetric Add device specific information to Ganglia Ganglia network communication Installation issues Where to get Ganglia for AIX and Linux on POWER ? Best practices Future additions / plans Discussion Links 2 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia – what is it ?
Ganglia – what is it ? (1/3) Ganglia is an Open Source cluster performance monitoring tool and has been extended to include POWER5/6 features like shared processor LPARs, entitlement, physical CPU usage etc. This session covers: – the technical details of Ganglia and the POWER5/6 extensions – how to set it up and use it to monitor all LPARs in a single machine and lots of machines 4 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia – what is it ? (2/3) Ganglia properties: scalable distributed monitoring system for high-performance computing systems such as clusters and grids based on a hierarchical design targeted at federations of clusters relies on a multicast-based listen/announce protocol to monitor state within clusters and uses a tree of point-to-point connections amongst representative cluster nodes to federate clusters and aggregate their state leverages widely used technologies such as – XML for data representation – XDR (e X ternal D ata R epresentation) for compact, portable data transport – RRDtool for data storage and visualization uses carefully engineered data structures and algorithms to achieve very low per-node overheads and high concurrency robust implementation Open Source, written in C – Downloaded 110,000+ times, 145+ countries, 500+ clusters, 2000+ nodes 5 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia – what is it ? (3/3) Ganglia properties (cont.): has been ported to an extensive set of operating systems and processor architectures: – AIX – Darwin – FreeBSD – HP-UX – IRIX – Linux – OSF – NetBSD – Solaris – Windows (via Cygwin) is currently in use on over 500+ clusters around the world has been used to link clusters across university campuses and around the world and can scale to handle clusters with 2000+ nodes – check http://ganglia.info/ for more details 6 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia components and data flow
Ganglia components The ganglia system consists of: two unique daemons: – Ganglia Monitoring Daemon (gmond) • monitoring daemon, collects the metrics • runs on each node – Ganglia Meta Daemon (gmetad) • polls all gmond clients and stores the collected metrics in Round-Robin Databases (RRDs) a PHP-based web frontend a few other small utility programs – gmetric • can be used to easily extend Ganglia with additional user-defined metrics – gstat – gexec 8 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia – Schematic View From: “Ganglia: Past, Present and Future” by Matt Massie: URL: http://ganglia.info/talks/lug_lbl_talk/ 9 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia Architecture 10 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia Monitoring Daemon (gmond) G anglia Mon itoring D aemon (gmond) is a multi-threaded daemon which runs on each cluster node you want to monitor. Installation is easy: – just the daemon and a configuration file (/etc/gmond.conf) gmond has four main responsibilities: 1. monitor changes in host state 2. announce relevant changes 3. listen to the state of all other ganglia nodes via a unicast or multicast channel 4. answer requests for an XML description of the cluster state Each gmond transmits information in two different ways: – unicasting or multicasting host state in external data representation (XDR) format using UDP messages – sending XML over a TCP connection 11 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia Meta Daemon (gmetad) (1/2) G anglia Meta D aemon (gmetad) is a daemon which typically only runs on one specific cluster node – or on more when using a staged setup. Installation is easy: – just the daemon and a configuration file (/etc/gmetad.conf) Federation in Ganglia is achieved using a tree of point-to-point connections amongst representative cluster nodes to aggregate the state of multiple clusters. At each node in the tree a gmetad – periodically polls a collection of child data sources – parses the collected XML – saves all numeric volatile metrics to round-robin databases – exports the aggregated XML over a TCP socket to clients 12 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia Meta Daemon (gmetad) (2/2) Data sources may be either – gmond daemons, representing specific clusters or – other gmetad daemons, representing sets of clusters Data sources use source IP addresses for access control – Multiple IP addresses can be specified for failover – The capability is natural for aggregating data from clusters since each gmond daemon contains the entire state of its cluster 13 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia PHP web frontend (1/2) Web frontend properties: provides a view of the gathered information via real-time dynamic web pages displays Ganglia data in a meaningful way for system administrators and users – For example, one can view the CPU utilization over the past hour, day, week, month, or year – The web frontend shows similar graphs for memory usage, disk usage, network statistics, number of running processes, and all other Ganglia metrics 14 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia PHP web frontend (2/2) Web frontend properties (cont.): depends on the existence of the gmetad which provides it with data from several Ganglia sources opens the local port 8651 (by default) and expects to receive a Ganglia XML tree the web pages themselves are highly dynamic; any change to the Ganglia data appears immediately on the site – This behavior leads to a very responsive site, but requires that the full XML tree be parsed on every page access – Therefore, the Ganglia web frontend should run on a fairly powerful, dedicated machine if it presents a large amount of data is written in the PHP scripting language and uses graphs generated by gmetad to display history information has been tested on many flavors of Unix (primarily Linux) with the Apache web server and the PHP 4.1 module 15 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia - data flow (1/4) One daemon per node/LPAR gmond Operating System /etc/gmond.conf performance stats API File access Network Web 16 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia - data flow (2/4) Runs on web server One daemon per node/LPAR gmond gmetad /etc/gmetad.conf rrdtool Operating System database /etc/gmond.conf performance stats of statistics API Browser File access Network Web 17 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia - data flow (3/4) Runs on web server One daemon per node/LPAR gmond gmetad /etc/gmetad.conf rrdtool Operating System database /etc/gmond.conf performance stats of statistics API Ganglia FE scripts Browser Apache2 File access + PHP5 Network Web 18 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia - data flow (4/4) Runs on web server User command One daemon per node/LPAR gmetric gmond /etc/gmetad.conf gmetad rrdtool Operating System database /etc/gmond.conf performance stats of statistics API Ganglia FE scripts Browser Apache2 File access + PHP5 Network Web 19 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia - data flow again Only one instance with the Web Server One daemon per node/LPAR /etc/gmetad.conf gmond gmetad /etc/gmond.conf rrdtool gmond /etc/gmond.conf database of statistics gmond PHP scripts /etc/gmond.conf Browser Apache2 File access + PHP5 Network Web 20 Monitoring Systems and POWER5/6 LPARs with Ganglia
An introduction to RRDTool
RRDTool Homepage: http://oss.oetiker.ch/rrdtool/ RRD is the Acronym for R ound- R obin D atabase. RRD is a system to store and display time-series data (i.e., network bandwidth, machine-room temperature, server load average). It stores the data in a very compact way that will not expand over time ( fixed size of DB ), and it presents useful graphs by processing the data to enforce a certain data density. It can be used either via simple wrapper scripts (from shell or Perl) or via frontends that poll network devices and put a friendly user interface on it. RRDTool is the industry standard tool to store and display time-series data! 22 Monitoring Systems and POWER5/6 LPARs with Ganglia
RRDTool example graph Graph taken from http://oss.oetiker.ch/rrdtool/gallery/index.en.html Graph shows inbound and outbound call traffic going in and out of the switch via the 6 trunks connected to the Diamond exchange. Inbound traffic shown as positive and uses a lowest-free fill method. Outbound traffic shown as negative uses a distributed fill method. Tech details on RRDtrac. 23 Monitoring Systems and POWER5/6 LPARs with Ganglia
Recommend
More recommend