Alerting

From UGCS
(Difference between revisions)
Jump to: navigation, search
(Created page with "We have a variety of automated alerting at UGCS to let us know when things are breaking or already broken. ==Notes on alerts== Some alerts are critical, so it is nice if they go...")
 
(Email Heartbeat)
Line 17: Line 17:
 
=Email Heartbeat=
 
=Email Heartbeat=
 
We have an end-to-end email testing system that sends a message through UGCS once every 5 minutes and complains if it is too late.  You should edit the config file in hermes:/etc/email_heartbeat to add yourself.
 
We have an end-to-end email testing system that sends a message through UGCS once every 5 minutes and complains if it is too late.  You should edit the config file in hermes:/etc/email_heartbeat to add yourself.
 +
 +
By default these go through IMSS's mail server (this is pretty clear from the config file).  You should probably send them to a paging device and a non-UGCS email.
 
   
 
   
 
[[Category:Sysadmin_Documentation]]
 
[[Category:Sysadmin_Documentation]]

Revision as of 06:42, 12 September 2011

We have a variety of automated alerting at UGCS to let us know when things are breaking or already broken.

Contents

Notes on alerts

Some alerts are critical, so it is nice if they go to a cell phone ("paging device"). Each carrier usually has their own way of sending an sms to a phone via an email address. As of 2011, Verizon is tendigitnumber@vtext.net (like 6505555555@vtext.net). AT&T is tendigitnumber@txt.att.net.

Sometimes these alerts from nagios will get really spammy and you may get hundreds of texts (some tweaking of nagios alert rate limiting should be done). Make sure this will not bankrupt you if you put your phone in these config files.

Nagios Alerting

Most of our alerts come from Nagios. These include things like host down, service not running, or other problems. You should edit (in cfengine) nagios3/conf.d/contacts.cfg and nagios3/conf.d/critical_notices.cfg to add yourself to the list. There is also a list of 'sms-all' that contains mail aliases for pagers.

Splunk Alerts

Splunk does regular scans of all of our logs and can alert based on log messages it sees. See Splunk Alerts for more information.

Kabta ping test

There is a script running on Kabta that pings UGCS and complains if it can't. This should definitely go to a paging device if possible. You will have to ssh to kabta to edit it.

Email Heartbeat

We have an end-to-end email testing system that sends a message through UGCS once every 5 minutes and complains if it is too late. You should edit the config file in hermes:/etc/email_heartbeat to add yourself.

By default these go through IMSS's mail server (this is pretty clear from the config file). You should probably send them to a paging device and a non-UGCS email.

Personal tools