Wish List

From UGCS
(Difference between revisions)
Jump to: navigation, search
(Good starter Projects)
 
(31 intermediate revisions by 3 users not shown)
Line 1: Line 1:
This page aims to list current improvements we would like to make to the cluster.
+
This page aims to list current improvements we would like to make to the cluster. Ask jdhutchin if you have any questions about them.
 +
 
 +
==Good starter Projects==
 +
 
 +
===Fix splunk===
 +
* The upgraded to 4.x broke it.
 +
* Requires setting up access to charon (useful for other stuff too)
 +
* Allows our log alerting, etc to get set up again.
 +
 
 +
===Squeeze Upgrade===
 +
The following computers need to be upgraded to squeeze:
 +
* Hermes (complicated, postfix needs to be rebuilt with a small patch)
 +
* Hera (not too bad, dns CNAMES need to be changed ahead of time)
 +
* Charon, enlil, kabta
 +
 
 +
===Fix backups===
 +
* The drives on persephone are too small, so we run out of space
 +
* This is urgent as we can't currently run a full backup cycle
 +
* We might want to run bacula "Base" backups to save space and time (see the bacula documentation)
 +
* Also, tape backups need to be run more often than never
 +
 
 +
===Migrate to postgres 8.4===
 +
* Not too bad
 +
* User notification required
 +
* Test mediawiki with it.
 +
 
 +
===Upgrade ugcs_libs===
 +
* The package is mostly built, just needs some testing
 +
* Needs to be deployed to get a rid of deprecation warnings
 +
 
 +
===Audit mailing lists===
 +
* People have signed up random accounts on them and are spying on our mail
 +
 
 +
===Write auto-scanner for malware===
 +
* We need to look at our web serving and auto-detect when we are serving spam off of it.
 +
 
 +
===Mediawiki upgrader===
 +
* The upgrade to 1.16 for users may break, esp if they tried to install >1 mediawiki on their account
 +
 
 +
===More website auto-setup===
 +
* Write more utilities like setup-mediawiki for django, etc so people can easily set up their own web stuff under UGCS.
 +
 
 +
===Add news / tip of the day===
 +
Even better, write your own nice little utilities and then let people know about them.
 +
 
 +
===Upgrade the juniper switch===
 +
There is a new version if JOS out that we should upgrade to.
 +
 
 +
===Move Kabta===
 +
Kabta currently sees a lot of intermittent packet loss in its currently location.
 +
 
 +
===Autofixers===
 +
Set up nagios so it more aggressively auto-restarts stuff when it is down.
 +
 
 +
==Maintenance==
 +
These are things that we have to do even if there aren't full-time student sysadmins.
 +
 
 +
* Account requests and password resets SLA: 1day
 +
** How do we know: We get emails
 +
 
 +
* Fix it when it breaks: Server down
 +
** SLA: 1hr
 +
** How do we know: Email alerts for most things, sms to jdhutchin's phone for really urgent things.
 +
** Owner: jdhutchin
 +
 
 +
* Fix minor support requests for things that are broken: SLA: 5days
 +
** Sooner would be better
 +
 
 +
* Answer user questions:  SLA: Best-effort
 +
** It would be nice if we could do this but it isn't a top priority
 +
 
 +
 
 +
==Software==
 +
Fix mex (matlab compiler)
 +
 
 +
Add support for distributed Mathematica on mortals
 +
 
 +
==Small fixes==
 +
Small things that need to be fixed across various services/machines:
 +
* Email heartbeat
 +
* Hestia SSL cert
 +
* Change kabta back to ssh keys after Alex/Raymond add theirs
 +
* Find the sysadmins PGP key
 +
* Fix the backup schedules to something sensible
  
 
==Mail System==
 
==Mail System==
 
See [[Mail Improvements]]
 
See [[Mail Improvements]]
  
==User Cron==
+
==Automatic group creation/management==
We want users to be able to run cron jobs.  See [[Cron]]
+
See [[ugcs groups]]
  
==Mortals==
+
==Large file hosting==
It would be nice if the mortals didn't suck.
+
Almost done!
 +
See [[NFS servers]]
 +
Server is running and exporting things correctly.  All we need now is disk quotas.
 +
 
 +
==Account creator / password reset==
 +
* Re-work as necessary to ensure robustness
 +
* Add exception reporting system (email to sysadmins)
 +
* Write full test suite to ensure quality
 +
* Fix bug where if you mis-enter your krb pw it half-creates the account anyway and is a pain to straighten out.
 +
 
 +
==Network==
 +
* Write a system that shows us mac/ip/port number
 +
* Add port mirroring to charon for deseriable traffic
 +
* Improve firewalls
 +
* Enable switch port security
 +
* Fix switch names
 +
 
 +
==Hardware==
 +
* Set up hestia to take over for dionysus - in progress
 +
* network card flip: put one of the single gigabit cards into charon, and move its two-port gigabit cards into poseidon and persephone
 +
 
 +
==Web hosting==
 +
* Add a failover web server
  
 
==Global login records==
 
==Global login records==
Line 16: Line 121:
 
* We need a printed-out copy of critical wiki stuff
 
* We need a printed-out copy of critical wiki stuff
 
* We need to make more documentation about our services for disaster recovery.
 
* We need to make more documentation about our services for disaster recovery.
 
+
* We need to update all of the core server pages with correct disk setups and currently running services.
==Security==
+
Anything?
+

Latest revision as of 14:37, 16 September 2011

This page aims to list current improvements we would like to make to the cluster. Ask jdhutchin if you have any questions about them.

Contents

Good starter Projects

Fix splunk

  • The upgraded to 4.x broke it.
  • Requires setting up access to charon (useful for other stuff too)
  • Allows our log alerting, etc to get set up again.

Squeeze Upgrade

The following computers need to be upgraded to squeeze:

  • Hermes (complicated, postfix needs to be rebuilt with a small patch)
  • Hera (not too bad, dns CNAMES need to be changed ahead of time)
  • Charon, enlil, kabta

Fix backups

  • The drives on persephone are too small, so we run out of space
  • This is urgent as we can't currently run a full backup cycle
  • We might want to run bacula "Base" backups to save space and time (see the bacula documentation)
  • Also, tape backups need to be run more often than never

Migrate to postgres 8.4

  • Not too bad
  • User notification required
  • Test mediawiki with it.

Upgrade ugcs_libs

  • The package is mostly built, just needs some testing
  • Needs to be deployed to get a rid of deprecation warnings

Audit mailing lists

  • People have signed up random accounts on them and are spying on our mail

Write auto-scanner for malware

  • We need to look at our web serving and auto-detect when we are serving spam off of it.

Mediawiki upgrader

  • The upgrade to 1.16 for users may break, esp if they tried to install >1 mediawiki on their account

More website auto-setup

  • Write more utilities like setup-mediawiki for django, etc so people can easily set up their own web stuff under UGCS.

Add news / tip of the day

Even better, write your own nice little utilities and then let people know about them.

Upgrade the juniper switch

There is a new version if JOS out that we should upgrade to.

Move Kabta

Kabta currently sees a lot of intermittent packet loss in its currently location.

Autofixers

Set up nagios so it more aggressively auto-restarts stuff when it is down.

Maintenance

These are things that we have to do even if there aren't full-time student sysadmins.

  • Account requests and password resets SLA: 1day
    • How do we know: We get emails
  • Fix it when it breaks: Server down
    • SLA: 1hr
    • How do we know: Email alerts for most things, sms to jdhutchin's phone for really urgent things.
    • Owner: jdhutchin
  • Fix minor support requests for things that are broken: SLA: 5days
    • Sooner would be better
  • Answer user questions: SLA: Best-effort
    • It would be nice if we could do this but it isn't a top priority


Software

Fix mex (matlab compiler)

Add support for distributed Mathematica on mortals

Small fixes

Small things that need to be fixed across various services/machines:

  • Email heartbeat
  • Hestia SSL cert
  • Change kabta back to ssh keys after Alex/Raymond add theirs
  • Find the sysadmins PGP key
  • Fix the backup schedules to something sensible

Mail System

See Mail Improvements

Automatic group creation/management

See ugcs groups

Large file hosting

Almost done! See NFS servers Server is running and exporting things correctly. All we need now is disk quotas.

Account creator / password reset

  • Re-work as necessary to ensure robustness
  • Add exception reporting system (email to sysadmins)
  • Write full test suite to ensure quality
  • Fix bug where if you mis-enter your krb pw it half-creates the account anyway and is a pain to straighten out.

Network

  • Write a system that shows us mac/ip/port number
  • Add port mirroring to charon for deseriable traffic
  • Improve firewalls
  • Enable switch port security
  • Fix switch names

Hardware

  • Set up hestia to take over for dionysus - in progress
  • network card flip: put one of the single gigabit cards into charon, and move its two-port gigabit cards into poseidon and persephone

Web hosting

  • Add a failover web server

Global login records

We need to implement some stuff with ldap so we have global login records

Documentation

  • We need a printed-out copy of critical wiki stuff
  • We need to make more documentation about our services for disaster recovery.
  • We need to update all of the core server pages with correct disk setups and currently running services.
Personal tools