Wish List
(→Large file hosting) |
(→Good starter Projects) |
||
| (22 intermediate revisions by 3 users not shown) | |||
| Line 1: | Line 1: | ||
| − | This page aims to list current improvements we would like to make to the cluster. | + | This page aims to list current improvements we would like to make to the cluster. Ask jdhutchin if you have any questions about them. |
| + | |||
| + | ==Good starter Projects== | ||
| + | |||
| + | ===Fix splunk=== | ||
| + | * The upgraded to 4.x broke it. | ||
| + | * Requires setting up access to charon (useful for other stuff too) | ||
| + | * Allows our log alerting, etc to get set up again. | ||
| + | |||
| + | ===Squeeze Upgrade=== | ||
| + | The following computers need to be upgraded to squeeze: | ||
| + | * Hermes (complicated, postfix needs to be rebuilt with a small patch) | ||
| + | * Hera (not too bad, dns CNAMES need to be changed ahead of time) | ||
| + | * Charon, enlil, kabta | ||
| + | |||
| + | ===Fix backups=== | ||
| + | * The drives on persephone are too small, so we run out of space | ||
| + | * This is urgent as we can't currently run a full backup cycle | ||
| + | * We might want to run bacula "Base" backups to save space and time (see the bacula documentation) | ||
| + | * Also, tape backups need to be run more often than never | ||
| + | |||
| + | ===Migrate to postgres 8.4=== | ||
| + | * Not too bad | ||
| + | * User notification required | ||
| + | * Test mediawiki with it. | ||
| + | |||
| + | ===Upgrade ugcs_libs=== | ||
| + | * The package is mostly built, just needs some testing | ||
| + | * Needs to be deployed to get a rid of deprecation warnings | ||
| + | |||
| + | ===Audit mailing lists=== | ||
| + | * People have signed up random accounts on them and are spying on our mail | ||
| + | |||
| + | ===Write auto-scanner for malware=== | ||
| + | * We need to look at our web serving and auto-detect when we are serving spam off of it. | ||
| + | |||
| + | ===Mediawiki upgrader=== | ||
| + | * The upgrade to 1.16 for users may break, esp if they tried to install >1 mediawiki on their account | ||
| + | |||
| + | ===More website auto-setup=== | ||
| + | * Write more utilities like setup-mediawiki for django, etc so people can easily set up their own web stuff under UGCS. | ||
| + | |||
| + | ===Add news / tip of the day=== | ||
| + | Even better, write your own nice little utilities and then let people know about them. | ||
| + | |||
| + | ===Upgrade the juniper switch=== | ||
| + | There is a new version if JOS out that we should upgrade to. | ||
| + | |||
| + | ===Move Kabta=== | ||
| + | Kabta currently sees a lot of intermittent packet loss in its currently location. | ||
| + | |||
| + | ===Autofixers=== | ||
| + | Set up nagios so it more aggressively auto-restarts stuff when it is down. | ||
| + | |||
| + | ==Maintenance== | ||
| + | These are things that we have to do even if there aren't full-time student sysadmins. | ||
| + | |||
| + | * Account requests and password resets SLA: 1day | ||
| + | ** How do we know: We get emails | ||
| + | |||
| + | * Fix it when it breaks: Server down | ||
| + | ** SLA: 1hr | ||
| + | ** How do we know: Email alerts for most things, sms to jdhutchin's phone for really urgent things. | ||
| + | ** Owner: jdhutchin | ||
| + | |||
| + | * Fix minor support requests for things that are broken: SLA: 5days | ||
| + | ** Sooner would be better | ||
| + | |||
| + | * Answer user questions: SLA: Best-effort | ||
| + | ** It would be nice if we could do this but it isn't a top priority | ||
| + | |||
| + | |||
| + | ==Software== | ||
| + | Fix mex (matlab compiler) | ||
| + | |||
| + | Add support for distributed Mathematica on mortals | ||
| + | |||
| + | ==Small fixes== | ||
| + | Small things that need to be fixed across various services/machines: | ||
| + | * Email heartbeat | ||
| + | * Hestia SSL cert | ||
| + | * Change kabta back to ssh keys after Alex/Raymond add theirs | ||
| + | * Find the sysadmins PGP key | ||
| + | * Fix the backup schedules to something sensible | ||
==Mail System== | ==Mail System== | ||
See [[Mail Improvements]] | See [[Mail Improvements]] | ||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
==Automatic group creation/management== | ==Automatic group creation/management== | ||
| Line 13: | Line 91: | ||
==Large file hosting== | ==Large file hosting== | ||
| − | + | Almost done! | |
See [[NFS servers]] | See [[NFS servers]] | ||
| − | + | Server is running and exporting things correctly. All we need now is disk quotas. | |
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
==Account creator / password reset== | ==Account creator / password reset== | ||
| Line 26: | Line 99: | ||
* Add exception reporting system (email to sysadmins) | * Add exception reporting system (email to sysadmins) | ||
* Write full test suite to ensure quality | * Write full test suite to ensure quality | ||
| + | * Fix bug where if you mis-enter your krb pw it half-creates the account anyway and is a pain to straighten out. | ||
==Network== | ==Network== | ||
| Line 32: | Line 106: | ||
* Improve firewalls | * Improve firewalls | ||
* Enable switch port security | * Enable switch port security | ||
| + | * Fix switch names | ||
==Hardware== | ==Hardware== | ||
| − | * | + | * Set up hestia to take over for dionysus - in progress |
| + | * network card flip: put one of the single gigabit cards into charon, and move its two-port gigabit cards into poseidon and persephone | ||
==Web hosting== | ==Web hosting== | ||
* Add a failover web server | * Add a failover web server | ||
| − | |||
==Global login records== | ==Global login records== | ||
Latest revision as of 14:37, 16 September 2011
This page aims to list current improvements we would like to make to the cluster. Ask jdhutchin if you have any questions about them.
Good starter Projects
Fix splunk
- The upgraded to 4.x broke it.
- Requires setting up access to charon (useful for other stuff too)
- Allows our log alerting, etc to get set up again.
Squeeze Upgrade
The following computers need to be upgraded to squeeze:
- Hermes (complicated, postfix needs to be rebuilt with a small patch)
- Hera (not too bad, dns CNAMES need to be changed ahead of time)
- Charon, enlil, kabta
Fix backups
- The drives on persephone are too small, so we run out of space
- This is urgent as we can't currently run a full backup cycle
- We might want to run bacula "Base" backups to save space and time (see the bacula documentation)
- Also, tape backups need to be run more often than never
Migrate to postgres 8.4
- Not too bad
- User notification required
- Test mediawiki with it.
Upgrade ugcs_libs
- The package is mostly built, just needs some testing
- Needs to be deployed to get a rid of deprecation warnings
Audit mailing lists
- People have signed up random accounts on them and are spying on our mail
Write auto-scanner for malware
- We need to look at our web serving and auto-detect when we are serving spam off of it.
Mediawiki upgrader
- The upgrade to 1.16 for users may break, esp if they tried to install >1 mediawiki on their account
More website auto-setup
- Write more utilities like setup-mediawiki for django, etc so people can easily set up their own web stuff under UGCS.
Add news / tip of the day
Even better, write your own nice little utilities and then let people know about them.
Upgrade the juniper switch
There is a new version if JOS out that we should upgrade to.
Move Kabta
Kabta currently sees a lot of intermittent packet loss in its currently location.
Autofixers
Set up nagios so it more aggressively auto-restarts stuff when it is down.
Maintenance
These are things that we have to do even if there aren't full-time student sysadmins.
- Account requests and password resets SLA: 1day
- How do we know: We get emails
- Fix it when it breaks: Server down
- SLA: 1hr
- How do we know: Email alerts for most things, sms to jdhutchin's phone for really urgent things.
- Owner: jdhutchin
- Fix minor support requests for things that are broken: SLA: 5days
- Sooner would be better
- Answer user questions: SLA: Best-effort
- It would be nice if we could do this but it isn't a top priority
Software
Fix mex (matlab compiler)
Add support for distributed Mathematica on mortals
Small fixes
Small things that need to be fixed across various services/machines:
- Email heartbeat
- Hestia SSL cert
- Change kabta back to ssh keys after Alex/Raymond add theirs
- Find the sysadmins PGP key
- Fix the backup schedules to something sensible
Mail System
Automatic group creation/management
See ugcs groups
Large file hosting
Almost done! See NFS servers Server is running and exporting things correctly. All we need now is disk quotas.
Account creator / password reset
- Re-work as necessary to ensure robustness
- Add exception reporting system (email to sysadmins)
- Write full test suite to ensure quality
- Fix bug where if you mis-enter your krb pw it half-creates the account anyway and is a pain to straighten out.
Network
- Write a system that shows us mac/ip/port number
- Add port mirroring to charon for deseriable traffic
- Improve firewalls
- Enable switch port security
- Fix switch names
Hardware
- Set up hestia to take over for dionysus - in progress
- network card flip: put one of the single gigabit cards into charon, and move its two-port gigabit cards into poseidon and persephone
Web hosting
- Add a failover web server
Global login records
We need to implement some stuff with ldap so we have global login records
Documentation
- We need a printed-out copy of critical wiki stuff
- We need to make more documentation about our services for disaster recovery.
- We need to update all of the core server pages with correct disk setups and currently running services.