Nagios Improvements
From UGCS
We should update Nagios to better reflect what the servers are doing:
Contents |
Services to watch
- apollo - AFS, rsync
- athena - AFS
- demeter - DHCP, TFTP
- hera - kerberos (probably not needed)
- I would definitely monitor kerberos
- hermes - AFS,
- You should also monitor "internal" mail services like amavis, spamassassin, etc. I know we have end-to-end email heartbeat, but checking individual services would be good too.
- amavis
- Mailman
- spamassassin
- You could also check for a "correct" number of imap-login processes.
- hestia - add NFS
- dionysus
- sks (PGP key server)
- nagios (yes, you need to make sure nagios is running... otherwise all of this may end up being for naught)
- persephone - bacula
- zeus - kerberos (probably not needed)
- Sometimes ldap replication breaks and we don't notice it. I think there are ways to check the status of the link, but I haven't been able to find them recently
- poseidon
- postgres
- mysql
- kabta
- postfix (secondary MX record)
Shellservers
- distcc
All Machines
- All machines have their own postfix that is used to send messages via sendmail. If it goes down, email on a host can pile up and not get delivered.
- (almost) All machines have raid that we should check with nagios in addition to mdadm
- All non-shellservers run apcupsd that should be checked
- rwhod
- ntpd (time drifts can cause kerberos to break)
- bacula-fd
Types of monitoring
There are a couple types ways to monitor services
- Make sure the process exists. This is the simplest way and can catch many problems. For some services it may be the only non-horribly complex way to check it
- The checks for multiplecron and svn work this way.
- You can also look for a "correct" number of processes. If we have fewer than 2 or 3 apache processes on poseidon, for example, something is probably wrong. This method can be good and catch subtler bugs, but has a higher false-positive rate.
- Make sure you can reach it over the network. This usually involves making a test tcp connection and making sure it doesn't get refused
- ssh, http, https, etc get tested this way
- Functionality test. These are the hardest to write but the most useful, as they can catch almost any problem. We have a few of these set up:
- Mail. There is a heartbeat script that sends an email every 5 minutes through a path that includes our incoming SMTP server as well as the various delivery mechanisms (both mailman and local delivery)
- Web. Nagios checks both the status of the http port (is it open?) as well as several test pages ( like http://jdtest.caltech.edu/test.php ) that test the scripts that let users run scripts as their own user (and database connectivity)