Splunk Alerts
(New page: Splunk runs a bunch of saved searches that can activate alerts. Log in to splunk and go to "admin" and then "saved searches". Splunk saved search scripts are located at charon:/opt/s...) |
m (moved Alerts to Splunk Alerts) |
||
| (3 intermediate revisions by one user not shown) | |||
| Line 1: | Line 1: | ||
| − | [[Splunk]] runs a bunch of saved searches that can activate alerts. Log in to splunk and go to "admin" and then "saved searches". Splunk saved search scripts are located at charon:/opt/splunk/bin/scripts | + | [[Splunk]] runs a bunch of saved searches that can activate alerts. Log in to splunk and go to "admin" and then "saved searches". Splunk saved search scripts are located at charon:/opt/splunk/bin/scripts These alerts are designed to let us know about problems so they can be fixed quickly and we can improve our overall service level. If you find a problem that can be found through a log search, please add a saved search. |
=Users logging into coreservers= | =Users logging into coreservers= | ||
| − | This saved search runs once a minute and scans between one and two minutes ago for invalid users logging into coreservers. It then runs a script (muggles_trying_coreservers.py) that sends the user a helpful email. It has a couple of protections to protect people from getting notices about ssh brute-force attempts. | + | This saved search runs once a minute and scans between one and two minutes ago for invalid users logging into coreservers. It then runs a script (muggles_trying_coreservers.py) that sends the user a helpful email. It has a couple of protections to protect people from getting notices about ssh brute-force attempts. This doesn't actually work and doesn't currently run. |
=LDAP server down= | =LDAP server down= | ||
| Line 11: | Line 11: | ||
=Client key expired= | =Client key expired= | ||
| − | This alert lets you know if a [[Kerberos]] principal has expired. If one has, you should go reset its expiration date. This is especially important for server principals. | + | This alert lets you know if a [[Kerberos]] principal has expired. If one has, you should go reset its expiration date. This is especially important for server principals but also causes a lot of user pain. |
| + | =IMAP Folder too full= | ||
| + | When a user's mailbox fills up, they can't check their mailbox through IMAP. This alert lets us know if we need to increase their quota a little bit so they can check their mailbox and clean it up. | ||
| + | =Email heartbeat= | ||
| + | Hermes runs a cron job that tries sending an email through the system, and seeing if it gets all the way through. If there is too much delay, it sends alerts. The code is in hermes:/usr/local/sbin/email_tester.py. See [[Email Heartbeat]] | ||
| + | |||
| + | =See Also= | ||
| + | * [[Nagios]] | ||
| + | * [[Logging]] | ||
| + | * [[Splunk]] | ||
[[Category:Sysadmin_Documentation]] | [[Category:Sysadmin_Documentation]] | ||
Latest revision as of 06:32, 12 September 2011
Splunk runs a bunch of saved searches that can activate alerts. Log in to splunk and go to "admin" and then "saved searches". Splunk saved search scripts are located at charon:/opt/splunk/bin/scripts These alerts are designed to let us know about problems so they can be fixed quickly and we can improve our overall service level. If you find a problem that can be found through a log search, please add a saved search.
Contents |
Users logging into coreservers
This saved search runs once a minute and scans between one and two minutes ago for invalid users logging into coreservers. It then runs a script (muggles_trying_coreservers.py) that sends the user a helpful email. It has a couple of protections to protect people from getting notices about ssh brute-force attempts. This doesn't actually work and doesn't currently run.
LDAP server down
This alert sends sysadmins and sysadmins non-ugcs addresses notices if there are too many "ldap server down" messages. If you get it, double-check that Hera and Zeus are up and running correctly
Mail forwarded in past 15 min
This alert checks to see if we've forwarded any mail in the past 15min. If we haven't for a few periods, it is a likely indication of problems. It emails sysadmins and external sysadmins if it finds a problem. If you get one in the middle of the night, it's not a big deal. If you get 3 or 4 in a row, look through postfix logs for errors.
Client key expired
This alert lets you know if a Kerberos principal has expired. If one has, you should go reset its expiration date. This is especially important for server principals but also causes a lot of user pain.
IMAP Folder too full
When a user's mailbox fills up, they can't check their mailbox through IMAP. This alert lets us know if we need to increase their quota a little bit so they can check their mailbox and clean it up.
Email heartbeat
Hermes runs a cron job that tries sending an email through the system, and seeing if it gets all the way through. If there is too much delay, it sends alerts. The code is in hermes:/usr/local/sbin/email_tester.py. See Email Heartbeat