Nagios Improvements

From UGCS
Jump to: navigation, search

We should update Nagios to better reflect what the servers are doing:

Contents

Services to watch

  • apollo - AFS, rsync
  • athena - AFS
  • demeter - DHCP, TFTP
  • hera - kerberos (probably not needed)
    • I would definitely monitor kerberos
  • hermes - AFS,
    • You should also monitor "internal" mail services like amavis, spamassassin, etc. I know we have end-to-end email heartbeat, but checking individual services would be good too.
    • amavis
    • Mailman
    • spamassassin
    • You could also check for a "correct" number of imap-login processes.
  • hestia - add NFS
  • dionysus
    • sks (PGP key server)
    • nagios (yes, you need to make sure nagios is running... otherwise all of this may end up being for naught)
  • persephone - bacula
  • zeus - kerberos (probably not needed)
    • Sometimes ldap replication breaks and we don't notice it. I think there are ways to check the status of the link, but I haven't been able to find them recently
  • poseidon
    • postgres
    • mysql
  • kabta
    • postfix (secondary MX record)

Shellservers

  • distcc

All Machines

  • All machines have their own postfix that is used to send messages via sendmail. If it goes down, email on a host can pile up and not get delivered.
  • (almost) All machines have raid that we should check with nagios in addition to mdadm
  • All non-shellservers run apcupsd that should be checked
  • rwhod
  • ntpd (time drifts can cause kerberos to break)
  • bacula-fd

Types of monitoring

There are a couple types ways to monitor services

  1. Make sure the process exists. This is the simplest way and can catch many problems. For some services it may be the only non-horribly complex way to check it
    • The checks for multiplecron and svn work this way.
    • You can also look for a "correct" number of processes. If we have fewer than 2 or 3 apache processes on poseidon, for example, something is probably wrong. This method can be good and catch subtler bugs, but has a higher false-positive rate.
  2. Make sure you can reach it over the network. This usually involves making a test tcp connection and making sure it doesn't get refused
    • ssh, http, https, etc get tested this way
  3. Functionality test. These are the hardest to write but the most useful, as they can catch almost any problem. We have a few of these set up:
    • Mail. There is a heartbeat script that sends an email every 5 minutes through a path that includes our incoming SMTP server as well as the various delivery mechanisms (both mailman and local delivery)
    • Web. Nagios checks both the status of the http port (is it open?) as well as several test pages ( like http://jdtest.caltech.edu/test.php ) that test the scripts that let users run scripts as their own user (and database connectivity)
Personal tools