Splunk Alerts

From UGCS
(Difference between revisions)
Jump to: navigation, search
(New page: Splunk runs a bunch of saved searches that can activate alerts. Log in to splunk and go to "admin" and then "saved searches". Splunk saved search scripts are located at charon:/opt/s...)
 
m (moved Alerts to Splunk Alerts)
 
(3 intermediate revisions by one user not shown)
Line 1: Line 1:
[[Splunk]] runs a bunch of saved searches that can activate alerts.  Log in to splunk and go to "admin" and then "saved searches".  Splunk saved search scripts are located at charon:/opt/splunk/bin/scripts
+
[[Splunk]] runs a bunch of saved searches that can activate alerts.  Log in to splunk and go to "admin" and then "saved searches".  Splunk saved search scripts are located at charon:/opt/splunk/bin/scripts These alerts are designed to let us know about problems so they can be fixed quickly and we can improve our overall service level.  If you find a problem that can be found through a log search, please add a saved search.
  
 
=Users logging into coreservers=
 
=Users logging into coreservers=
This saved search runs once a minute and scans between one and two minutes ago for invalid users logging into coreservers.  It then runs a script (muggles_trying_coreservers.py) that sends the user a helpful email.  It has a couple of protections to protect people from getting notices about ssh brute-force attempts.
+
This saved search runs once a minute and scans between one and two minutes ago for invalid users logging into coreservers.  It then runs a script (muggles_trying_coreservers.py) that sends the user a helpful email.  It has a couple of protections to protect people from getting notices about ssh brute-force attempts.  This doesn't actually work and doesn't currently run.
  
 
=LDAP server down=
 
=LDAP server down=
Line 11: Line 11:
  
 
=Client key expired=
 
=Client key expired=
This alert lets you know if a [[Kerberos]] principal has expired.  If one has, you should go reset its expiration date.  This is especially important for server principals.
+
This alert lets you know if a [[Kerberos]] principal has expired.  If one has, you should go reset its expiration date.  This is especially important for server principals but also causes a lot of user pain.
  
 +
=IMAP Folder too full=
 +
When a user's mailbox fills up, they can't check their mailbox through IMAP.  This alert lets us know if we need to increase their quota a little bit so they can check their mailbox and clean it up.
  
 +
=Email heartbeat=
 +
Hermes runs a cron job that tries sending an email through the system, and seeing if it gets all the way through.  If there is too much delay, it sends alerts.  The code is in hermes:/usr/local/sbin/email_tester.py.  See [[Email Heartbeat]]
 +
 +
=See Also=
 +
* [[Nagios]]
 +
* [[Logging]]
 +
* [[Splunk]]
 
[[Category:Sysadmin_Documentation]]
 
[[Category:Sysadmin_Documentation]]

Latest revision as of 06:32, 12 September 2011

Splunk runs a bunch of saved searches that can activate alerts. Log in to splunk and go to "admin" and then "saved searches". Splunk saved search scripts are located at charon:/opt/splunk/bin/scripts These alerts are designed to let us know about problems so they can be fixed quickly and we can improve our overall service level. If you find a problem that can be found through a log search, please add a saved search.

Contents

Users logging into coreservers

This saved search runs once a minute and scans between one and two minutes ago for invalid users logging into coreservers. It then runs a script (muggles_trying_coreservers.py) that sends the user a helpful email. It has a couple of protections to protect people from getting notices about ssh brute-force attempts. This doesn't actually work and doesn't currently run.

LDAP server down

This alert sends sysadmins and sysadmins non-ugcs addresses notices if there are too many "ldap server down" messages. If you get it, double-check that Hera and Zeus are up and running correctly

Mail forwarded in past 15 min

This alert checks to see if we've forwarded any mail in the past 15min. If we haven't for a few periods, it is a likely indication of problems. It emails sysadmins and external sysadmins if it finds a problem. If you get one in the middle of the night, it's not a big deal. If you get 3 or 4 in a row, look through postfix logs for errors.

Client key expired

This alert lets you know if a Kerberos principal has expired. If one has, you should go reset its expiration date. This is especially important for server principals but also causes a lot of user pain.

IMAP Folder too full

When a user's mailbox fills up, they can't check their mailbox through IMAP. This alert lets us know if we need to increase their quota a little bit so they can check their mailbox and clean it up.

Email heartbeat

Hermes runs a cron job that tries sending an email through the system, and seeing if it gets all the way through. If there is too much delay, it sends alerts. The code is in hermes:/usr/local/sbin/email_tester.py. See Email Heartbeat

See Also

Personal tools