Alerting
(→Email Heartbeat) |
|||
| (One intermediate revision by one user not shown) | |||
| Line 5: | Line 5: | ||
Sometimes these alerts from nagios will get really spammy and you may get hundreds of texts (some tweaking of nagios alert rate limiting should be done). Make sure this will not bankrupt you if you put your phone in these config files. | Sometimes these alerts from nagios will get really spammy and you may get hundreds of texts (some tweaking of nagios alert rate limiting should be done). Make sure this will not bankrupt you if you put your phone in these config files. | ||
| + | |||
| + | We've tried to set up most of our other alerts so they go to sysadmins-alerts@ugcs.caltech.edu, instead of sysadmins@, so that people can be on sysadmins and not get all of the alert spam. | ||
=Nagios Alerting= | =Nagios Alerting= | ||
Most of our alerts come from [[Nagios]]. These include things like host down, service not running, or other problems. You should edit (in cfengine) nagios3/conf.d/contacts.cfg and nagios3/conf.d/critical_notices.cfg to add yourself to the list. There is also a list of 'sms-all' that contains mail aliases for pagers. | Most of our alerts come from [[Nagios]]. These include things like host down, service not running, or other problems. You should edit (in cfengine) nagios3/conf.d/contacts.cfg and nagios3/conf.d/critical_notices.cfg to add yourself to the list. There is also a list of 'sms-all' that contains mail aliases for pagers. | ||
| + | |||
| + | Some services in nagios have a separate alert like "Critical load", etc. These are the alerts that will get sent to paging devices, so they typically have higher thresholds or longer hold times before they fire. By default, they will go through IMSS's mail servers instead of ours so we can still get notified if our mail system is down. | ||
=Splunk Alerts= | =Splunk Alerts= | ||
| Line 13: | Line 17: | ||
=Kabta ping test= | =Kabta ping test= | ||
| − | There is a script running on [[Kabta]] that pings UGCS and complains if it can't. This should definitely go to a paging device if possible. You will have to ssh to kabta to edit it. | + | There is a script running on [[Kabta]] that pings UGCS and complains if it can't. This should definitely go to a paging device if possible. You will have to ssh to kabta to edit the script (it is called from a root crontab or /etc/cron.d entry). |
=Email Heartbeat= | =Email Heartbeat= | ||
| Line 19: | Line 23: | ||
By default these go through IMSS's mail server (this is pretty clear from the config file). You should probably send them to a paging device and a non-UGCS email. | By default these go through IMSS's mail server (this is pretty clear from the config file). You should probably send them to a paging device and a non-UGCS email. | ||
| + | |||
| + | =Cron Jobs= | ||
| + | While not technically alerts, cron job failures go to root@ which goes to sysadmins@. It would be nice if they were redirected to sysadmin-cron or something because they can be very spammy. | ||
[[Category:Sysadmin_Documentation]] | [[Category:Sysadmin_Documentation]] | ||
Latest revision as of 15:14, 12 September 2011
We have a variety of automated alerting at UGCS to let us know when things are breaking or already broken.
Contents |
Notes on alerts
Some alerts are critical, so it is nice if they go to a cell phone ("paging device"). Each carrier usually has their own way of sending an sms to a phone via an email address. As of 2011, Verizon is tendigitnumber@vtext.net (like 6505555555@vtext.net). AT&T is tendigitnumber@txt.att.net.
Sometimes these alerts from nagios will get really spammy and you may get hundreds of texts (some tweaking of nagios alert rate limiting should be done). Make sure this will not bankrupt you if you put your phone in these config files.
We've tried to set up most of our other alerts so they go to sysadmins-alerts@ugcs.caltech.edu, instead of sysadmins@, so that people can be on sysadmins and not get all of the alert spam.
Nagios Alerting
Most of our alerts come from Nagios. These include things like host down, service not running, or other problems. You should edit (in cfengine) nagios3/conf.d/contacts.cfg and nagios3/conf.d/critical_notices.cfg to add yourself to the list. There is also a list of 'sms-all' that contains mail aliases for pagers.
Some services in nagios have a separate alert like "Critical load", etc. These are the alerts that will get sent to paging devices, so they typically have higher thresholds or longer hold times before they fire. By default, they will go through IMSS's mail servers instead of ours so we can still get notified if our mail system is down.
Splunk Alerts
Splunk does regular scans of all of our logs and can alert based on log messages it sees. See Splunk Alerts for more information.
Kabta ping test
There is a script running on Kabta that pings UGCS and complains if it can't. This should definitely go to a paging device if possible. You will have to ssh to kabta to edit the script (it is called from a root crontab or /etc/cron.d entry).
Email Heartbeat
We have an end-to-end email testing system that sends a message through UGCS once every 5 minutes and complains if it is too late. You should edit the config file in hermes:/etc/email_heartbeat to add yourself.
By default these go through IMSS's mail server (this is pretty clear from the config file). You should probably send them to a paging device and a non-UGCS email.
Cron Jobs
While not technically alerts, cron job failures go to root@ which goes to sysadmins@. It would be nice if they were redirected to sysadmin-cron or something because they can be very spammy.