monitoring 6000 hosts in zabbix
play

Monitoring 6000+ hosts in Zabbix A Pseudo-DevOps journey About me - PowerPoint PPT Presentation

Monitoring 6000+ hosts in Zabbix A Pseudo-DevOps journey About me Senior Systems Engineer Tools and Automation Kinetic IT @ Department of Education Co founder of Passive Eye Ltd Open Source contributor Dept. Education ~800 schools


  1. Monitoring 6000+ hosts in Zabbix A Pseudo-DevOps journey

  2. About me š Senior Systems Engineer Tools and Automation Kinetic IT @ Department of Education š Co founder of Passive Eye Ltd š Open Source contributor

  3. Dept. Education š ~800 schools š ~400,000 end users š ~5000 SOE servers @ schools š ~1500 heterogeneous servers @ central office š Hub-spoke topology š Vast geographic distribution

  4. Problem definition š Multiple, disconnected monitoring tools š Poor coverage š Lack of correlation š Duplication of effort š Inconsistent practice š Difficulty measuring SLAs

  5. Requirements š Single pane of glass š Scalability š Extensibility š Ease of use š Low costs licensing

  6. Enter Zabbix š All-in-one š Support for diverse devices š Small footprint and scalable architecture š Extensible API š Configuration UI š Free and open source + support

  7. How Zabbix works š Primary server web š Database š Frontend š Proxy Servers š Agents proxy š Passive devices proxy

  8. How Zabbix works š Items and LLD š Active and passive checks š Hosts and templates š Triggers, Events and Actions š Graphs, Screens and Maps

  9. Planning š Scrum + Jira Agile

  10. Building š Discrete environments in Vagrant š Infrastructure as code š Discrete feature branches š Monolithic source repo http://nvie.com/posts/a-successful-git-branching-model/

  11. Puppet code

  12. Testing š Bamboo š Cucumber

  13. Deployment š Server, DB, Web š No Proxies (so far)

  14. Performance

  15. Integrations š Active Directory š CMDB š Service Management š ICT Dashboard

  16. Autonomy š Host registration š Low level discovery š User provisioning š Remediation scripts š Data housekeeping š Incidents and escalations

  17. Template hierarchy š Host š Class š Templates š Items, etc.

  18. Windows monitoring š OOB Support: š WMI queries š Performance counter probes š Event Log monitor š Service state

  19. Windows monitoring š Customisations š Hostname casing and format š Service discovery š Performance counter discovery š Failover Cluster discovery š Persistent disk and volume identification

  20. Windows monitoring š Tools š MSI installer package š Test PowerShell script š Performance Counter template builder

  21. Windows monitoring š Convert-PerfCountToZabbixTemplate.ps1 š Counter Set > Application š Multi-instance counter > Discovery Rule + Prototypes š Single-instance counter > Item check > ./Convert-PerfCountersToZabbixTemplate.ps1 -CounterSet Processor | Out-File template.xml

  22. Linux monitoring š Modules š Extensions for Linux kernel š PostgreSQL š Packaging š Test script

  23. SNMP monitoring š CRAC š UPS š Dell iDRAC š IPS š Mail gateways

  24. SNMP monitoring š mib2zabbix.pl š Tree nodes > Applications š OID Tables > Discovery Rules + Prototypes š OID Scalars > Item checks š OID Traps > Item + Trigger + snmptt config š Enums > Value Maps $ ./mib2zabbix.pl --template –oid=.1.3.6.1.2.1.25 --name=“Host resources”

  25. Application monitoring š Microsoft Exchange š Microsoft SCCM š Microsoft SQL Server š Microsoft Active Directory š PostgreSQL Server š EMC Avamar š HP BPM š Squid Proxy š Zabbix Server š …

  26. Risk mitigation š Document in code š Source control š Clearly defined interfaces š Quality gates š Upstream contribution š Change the hiring criteria to avoid SPOF

  27. Agent stress test š Critical to finding: š Memory leaks š Race conditions š Impact on system š Regressions š Validate efficiency improvements

  28. It’s no magic bullet… š Data aggregation š Visualizations š Alert Scripts

  29. Future š Zabbix v3 upgrade š Better engagement from ITOps š More devices and apps š More automation š Better use of data š Enterprise Integration Patterns š Cloud monitor

  30. DevOps? š Meta-software š Agile delivery š Infrastructure As Code š Continuous Integration š Theory of Constraints

  31. Contrib š PostgreSQL monitoring https://github.com/cavaliercoder/libzbxpgsql š Agent benchmarking https://github.com/cavaliercoder/zabbix_agent_bench š Windows MSI package https://github.com/cavaliercoder/zabbix-msi š Golang module adapter https://github.com/cavaliercoder/g2z

  32. Windows Counters š Performance Counter IDs are non-persistent š Today G: is PhysicalDisk(3 G:) , tomorrow it is PhysicalDisk(5) š Graphs and alerts break š Mapping is not practical in scripting APIs

  33. Windows physical disks š Mutable performance counter ID: PhysicalDisk(0 C:) C: Q: š Index (‘0’) changes on reboot, swap, failover, etc. š The drive letter (‘C:’) is undocumented š MBR Signatures and GPT GUIDs are more persistent

  34. Windows physical disks š Q: What runtime counter index maps to desired MBR/GUID? š Identify via MBR Signature or GPT GUID DeviceIOControl (IOCTL_DISK_GET_DRIVE_LAYOUT_EX) š Get device index (\\.\PHYSICALDRIVE<i>) DeviceIOControl (IOCTL_STORAGE_GET_DEVICE_NUMBER) š Iterate PhysicalDisk counters (ignore drive letter) PdhEnumObjectItems Sample code: https://github.com/cavaliercoder/sysinv/blob/master/diskinfo.cpp

  35. Windows Volumes š Performance counter ID: LogicalDisk(C:|HarddiskVolumeN) š Drive letter is mutable š N is mutable š Volume GUIDs or Serials are more persistent

  36. Windows Volumes š Q: Which runtime counter ID matches Volume GUID? š Find volumes with ID FindNextVolume š Compare GUID/Sig against name returned by GetVolumeInformation š Enumerate LogicalDisk counters with PdhEnumObjectItems š Test mount paths (N:) returned by GetVolumePathNamesForVolumeName š Test DOS Device Path (\Device\HarddiskVolumeN) returned by QueryDosDevice Sample code: https://github.com/cavaliercoder/sysinv/blob/master/diskinfo.cpp

  37. Windows Failover Clusters š Disks move between nodes š Node disks are visible on cluster IP š IDs and drive letters change

  38. Windows Failover Clusters š Q: Is a MBR/GUID listed as a cluster resource? š Cluster API uses MBR Signature or GPT GUID! š Enumerate “ Physical Disk ” resources in cluster with ClusterEnum š Add a discovery parameter for clustered/non-clustered disks Sample code: https://github.com/cavaliercoder/sysinv/blob/master/cluster.cpp

  39. Thank you! Ryan Armstrong š Blog: cavaliercoder.com š Twitter: @cavaliercoder š GitHub: cavaliercoder

Recommend


More recommend