Wish List

From UGCS
(Difference between revisions)
Jump to: navigation, search
(Summer 2008)
(Good starter Projects)
 
(35 intermediate revisions by 3 users not shown)
Line 1: Line 1:
==Summer 2008==
+
This page aims to list current improvements we would like to make to the clusterAsk jdhutchin if you have any questions about them.
* User scripts
+
** Nice wrappers around mailman stuff- Done [[User:Jdhutchin@ugcs.caltech.edu|Jdhutchin@ugcs.caltech.edu]] 20:06, 24 July 2008 (PDT)
+
** Nice wrappers for vhost- Mostly done, not documented (Joshua)
+
** Wiki setup scripts- Done [[User:Jdhutchin@ugcs.caltech.edu|Jdhutchin@ugcs.caltech.edu]] 22:00, 4 August 2008 (PDT)
+
** Remctl to set up tsearch2- Done [[User:Jdhutchin@ugcs.caltech.edu|Jdhutchin@ugcs.caltech.edu]] 22:38, 25 July 2008 (PDT)
+
** Automated quota notices
+
** Cron stuff for users
+
* Fix nullmailer- it doesn't handle errors very wellMost likely, you need to modify it so that it will send a bounce message (or maybe just configure it differently? or replace with postfix?)- Mostly done replacing it with postfix
+
* Heartbeat script for mail- it sends mail through various different paths every few minutes, and sends alerts (to non-UGCS mail or through a different channel) if they don't go through in a timely manner
+
* per-user apache logs
+
* Get to.ugcs working- perhaps dynamic dns updates through nsupdate?
+
* Better rwho/rupdate
+
* Documentation
+
  
* Redundancy
+
==Good starter Projects==
** HTTP
+
** Mail
+
*** SMTP- kabta is listed as a secondary MX, but isn't always running and needs some additional tweaking
+
*** IMAP- We need to add AFS failover for this to be of any use
+
** NFS- Done [[User:Jdhutchin@ugcs.caltech.edu|Jdhutchin@ugcs.caltech.edu]] 22:28, 20 July 2008 (PDT)
+
* Off-site mail/monitoring
+
* Documentation
+
  
* Power stuff- see [June 2008 Downtime Plan]
+
===Fix splunk===
* Fix cron notices and tweak Nagios settings
+
* The upgraded to 4.x broke it.
* Set up snort do so something useful
+
* Requires setting up access to charon (useful for other stuff too)
* Log monitoring
+
* Allows our log alerting, etc to get set up again.
* Documentation
+
  
My goal is to have the cluster set up for 99.9% uptime from the end of the summer.  This means we have ~9hrs of downtime for critical services throughout the year.  I won't count things outside of our control (namely power and network issues), as fixing those would take more money than we have.  Some things that will help us meet this goal:
+
===Squeeze Upgrade===
* Failover on critical services.  We already have this for ldap and kerberos, we need to add mail and web to this mix.
+
The following computers need to be upgraded to squeeze:
* Checks in places to prevent sysadmin mistakes.  Having redundant systems will help so that we can test changes on one system first.  Setting up testing services on hephaestus might help with this.
+
* Hermes (complicated, postfix needs to be rebuilt with a small patch)
 +
* Hera (not too bad, dns CNAMES need to be changed ahead of time)
 +
* Charon, enlil, kabta
  
==Stuff for Spring Break==
+
===Fix backups===
* Get lab set up to use projector and sound system
+
* The drives on persephone are too small, so we run out of space
* Bacula backup for servers
+
* This is urgent as we can't currently run a full backup cycle
* Start on buildserver? I'd put a bit of time into this (Assign: alexr, ?)
+
* We might want to run bacula "Base" backups to save space and time (see the bacula documentation)
* Get iodine reliable and publicized to lusers (Assign: alexr)
+
* Also, tape backups need to be run more often than never
* Get shellservers listening to SSH on ports 80 and 53 for people behind restrictive firewalls (I'll do when I get a few mins) (Assign: alexr)
+
* Start looking into some of the warnings we're receiving from cron, nagios, etc, and silence (fix or set to ignore) (Assign: alexr, ?)
+
* Script to let people remove job listings automatically (Assign: alexr)
+
** This should go in a remctl command
+
* Netboot UGCS disc
+
* SVN for software in ugcs-admin
+
  
 +
===Migrate to postgres 8.4===
 +
* Not too bad
 +
* User notification required
 +
* Test mediawiki with it.
  
* Virtual hosting (did you mean via some sort of dropfile-esque interface? -alexr)
+
===Upgrade ugcs_libs===
* Spamassassin/ldap settings- Done before break [[User:Jdhutchin@ugcs.caltech.edu|Jdhutchin@ugcs.caltech.edu]] 05:05, 16 March 2008 (PDT)
+
* The package is mostly built, just needs some testing
* Debug ldap and mail- this will involve some downtime.
+
* Needs to be deployed to get a rid of deprecation warnings
** We should try to get ldaps working again if possible, or at least figure out why it doesn't work anymore. I know we're not using ldap for anything secure right now but I don't want to rule it out in the future. Also, I suspect it's behind the Alpine failures. (alexr)
+
** It seems to be working- ldaps will have to wait for later versions of libldap that don't suck
+
** Basically done [[User:Jdhutchin@ugcs.caltech.edu|Jdhutchin@ugcs.caltech.edu]] 05:05, 16 March 2008 (PDT)
+
  
==Stuff for Winter break==
+
===Audit mailing lists===
* Fix account creation system- cgi principals and postgres databases 
+
* People have signed up random accounts on them and are spying on our mail
** Done [[User:Jdhutchin@ugcs.caltech.edu|Jdhutchin@ugcs.caltech.edu]] 21:03, 8 December 2007 (PST)
+
* Get cfengine to do dns/bind
+
** Done [[User:Jdhutchin@ugcs.caltech.edu|Jdhutchin@ugcs.caltech.edu]] 22:05, 8 December 2007 (PST)
+
* [[Setup gale]]
+
* Apache logs for users (Joshua)
+
* Get Spamassassin to let users set their own settings
+
* Scripts to set up wikis
+
* Tripwire
+
** Binaries are sitting on hephaestus in /var/local
+
* Tabulate cluster usage statistics
+
* Netboot UGCS Disk
+
  
==Remaining==
+
===Write auto-scanner for malware===
===Critical===
+
* We need to look at our web serving and auto-detect when we are serving spam off of it.
* Allow user creation of mailing lists, automated mailing list updates
+
* Clean up pagsh entries on poseidon to avoid needing to periodically reboot
+
  
===High===
+
===Mediawiki upgrader===
* [[Sysadmin:Security_Todo|Security To-Do]]
+
* The upgrade to 1.16 for users may break, esp if they tried to install >1 mediawiki on their account
* Add to debian [http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=417917 bug#417917] regarding AFS module unload oops
+
* Get to.ugcs working, suggested to use http://www.stanford.edu/~riepel/lbnamed/Stanford-DNSserver/ or http://twistedmatrix.com/trac/wiki/TwistedNames
+
* New account scripts
+
** Create postgres user and database upon user creation
+
* Mailman list creation by users (this will be done with Mailman SSO)
+
* Write documentation for users- partially done, what other specific topics do we need?
+
** Online man pages / software documentation
+
* redoing the build daemon (maurer)
+
* Set up poseidon with services
+
* Gale (packaging 90% complete, need to test maintainer scripts)
+
* More restrictive iptables rules on Charon
+
* Fix issues with swap not mounting on muses
+
* Debug dovecot hanging issue
+
* Increase AFS Cache sizes--we're not using the disks for much else
+
* Write a script so users can update their ldap info
+
* Backups
+
  
===Medium===
+
===More website auto-setup===
* mailAlternateAddress scripts
+
* Write more utilities like setup-mediawiki for django, etc so people can easily set up their own web stuff under UGCS.
* Let users see their own Apache logs
+
* Create list of standard required packages- Mostly done in cfengine package.conf
+
** Better solution: pkgsync
+
* Mailman SSO
+
* Automated quota notices for mail and home
+
  
===Low===
+
===Add news / tip of the day===
* Install power to center (pending MHF grant result)
+
Even better, write your own nice little utilities and then let people know about them.
* Set up Persephone for backup
+
* Web based manuals for software
+
** Do we really want to duplicate manpages online?
+
** You can do things like hyperlink and have an index with apropros info that man doesn't have. Also, I think the info pages are nicer to browse. Also, some software only has HTML docs / vastly superior HTML docs next to the other documentation --[[User:Goldstei@ugcs.caltech.edu|Goldstei@ugcs.caltech.edu]] 09:43, 8 October 2007 (PDT)
+
* Various old UGCS niceities
+
** finger
+
** dictd running somewhere
+
** global login records
+
** sl
+
** configure and (auto nice daemon)
+
** tools for viewing global login records
+
** review old '/ug/adm/scripts' for useful stuff and and ask Josh Goldstein if what is going on makes no sense
+
* Put all configs in cfengine to ease rebuilding machines from known good state
+
* General application-specific tweaks
+
* Get distcc working properly
+
  
==Completed==
+
===Upgrade the juniper switch===
* Decide whether to go with AFS or NFS
+
There is a new version if JOS out that we should upgrade to.
** Decided to do NFS for root filesystem, AFS for user homedirs and maildirs.
+
 
* Do NFS setup for netboot root
+
===Move Kabta===
* Get DNS on Demeter up so we can properly reference 'task' hostnames
+
Kabta currently sees a lot of intermittent packet loss in its currently location.
* Agree on IP allocation
+
 
* Set up Hermes for mail
+
===Autofixers===
* Set up core switches
+
Set up nagios so it more aggressively auto-restarts stuff when it is down.
* Set up Charon for routing, get snort running
+
 
* Budget planning
+
==Maintenance==
* PXE setup on Demeter
+
These are things that we have to do even if there aren't full-time student sysadmins.
* Start migrating over Pukes to serve as test client machines
+
 
* Start using CFEngine to manage sudo so that all machines stay in sync and we can setup different sudoers for different machines (e.g. donut, new sysadmins, etc. when the time comes)
+
* Account requests and password resets SLA: 1day
* Investigate pam_access group restrictions to prevent non-sysadmin login into core machine
+
** How do we know: We get emails
** used pam_access and cfengine
+
 
* Migrate user data from NIS to LDAP
+
* Fix it when it breaks: Server down
* Set up password migration frontend
+
** SLA: 1hr
* Send administrative e-mails warning users
+
** How do we know: Email alerts for most things, sms to jdhutchin's phone for really urgent things.
* Update hostmaster with new IP allocations for rDNS
+
** Owner: jdhutchin
* Set up and migrate Kerberos/LDAP to Zeus
+
 
** In progress - Zeus is physically up on the [[Compaq_Proliant_3U]].
+
* Fix minor support requests for things that are broken: SLA: 5days
* Set up new CA properly
+
** Sooner would be better
* Order muses/naiads
+
 
* Migrate mailing lists
+
* Answer user questions:  SLA: Best-effort
** Groundwork completed
+
** It would be nice if we could do this but it isn't a top priority
* Set up nullmailer to redirect mail to hermes
+
 
* Set up poseidon
+
 
* Migrate network cabling
+
==Software==
* Set up hephaestus
+
Fix mex (matlab compiler)
* Migrate Apollo IP
+
 
* Convert to task-based CNAMES i.e. ldap-head
+
Add support for distributed Mathematica on mortals
* Configure pipermail on hermes
+
 
* Mount old UGCS NFS on Apollo
+
==Small fixes==
* Contact alumni association i.e. andy shaindlin and karen carlson
+
Small things that need to be fixed across various services/machines:
* Set up charon to record logging data
+
* Email heartbeat
* chsh script
+
* Hestia SSL cert
* Migrate mail data
+
* Change kabta back to ssh keys after Alex/Raymond add theirs
* Prepare to migrate user public html data
+
* Find the sysadmins PGP key
* mailForwardingAddress script: mail_forward (available on pukes)
+
* Fix the backup schedules to something sensible
* work out suexec/afs/kerberos interactions - finally freaking done
+
 
* secure and test pseudo-suexec
+
==Mail System==
* mailForwardingAddress debugging - found problem (needed to apply the alias map twice; once in virtual alias maps, second time in local alias maps)
+
See [[Mail Improvements]]
* Pine SSL certificate issue
+
 
* Finish tweaking netboot- should be pretty much done
+
==Automatic group creation/management==
* Reformat old pukes to be used as muses- takes about 10min/machine
+
See [[ugcs groups]]
* Bring down Purchase, migrate demeter IP
+
 
* Fix POP/IMAP issue with full resend and relogins for every request
+
==Large file hosting==
** http://www.dovecot.org/list/dovecot/2007-January/018653.html and friends -
+
Almost done!
* Migrate homedir data
+
See [[NFS servers]]
* Investigate LDAP slave mirror problems
+
Server is running and exporting things correctly.  All we need now is disk quotas.
* Mount hardware on racks and remove obsolete hardware
+
 
* Send in RMA athena disk
+
==Account creator / password reset==
* Fixed mail_forward script
+
* Re-work as necessary to ensure robustness
* [[MHF grant]] (October 6 deadline)
+
* Add exception reporting system (email to sysadmins)
** submit progress report on previous funds
+
* Write full test suite to ensure quality
** ask for splunk, projector, money for constructing overhead power drop, money for [[general lab improvements]]
+
* Fix bug where if you mis-enter your krb pw it half-creates the account anyway and is a pain to straighten out.
** People to touch base with for 'context' section: Elizabeth Allen (alumni); Ruthanne Bevier (imss security); Michael Vanier (cs); Chris Gonzales (ascit); Michael Woods (ihc); Wenyee Lo (imss houserep program); Marissa Cevallos (the tech); Craig Montuori (donut)
+
 
* memory rebate (Liz)
+
==Network==
* purchase video cards (Liz)
+
* Write a system that shows us mac/ip/port number
* Perhaps a more sane way to access webmail, i.e. webmail.ugcs?
+
* Add port mirroring to charon for deseriable traffic
* Splunk configuration, apache logs
+
* Improve firewalls
* reimbursement for purchases (Liz)
+
* Enable switch port security
* Job posting scripts
+
* Fix switch names
* Restore Athena to operation
+
 
* Set up Hera with backup KDC and LDAP
+
==Hardware==
* Migrate nfs to hestia
+
* Set up hestia to take over for dionysus - in progress
* set up postgres on poseidon, set up databases for each user
+
* network card flip: put one of the single gigabit cards into charon, and move its two-port gigabit cards into poseidon and persephone
* mrsh replacement (mssh <classname> <command> <arg1> ...)
+
 
 +
==Web hosting==
 +
* Add a failover web server
 +
 
 +
==Global login records==
 +
We need to implement some stuff with ldap so we have global login records
 +
 
 +
==Documentation==
 +
* We need a printed-out copy of critical wiki stuff
 +
* We need to make more documentation about our services for disaster recovery.
 +
* We need to update all of the core server pages with correct disk setups and currently running services.

Latest revision as of 14:37, 16 September 2011

This page aims to list current improvements we would like to make to the cluster. Ask jdhutchin if you have any questions about them.

Contents

Good starter Projects

Fix splunk

  • The upgraded to 4.x broke it.
  • Requires setting up access to charon (useful for other stuff too)
  • Allows our log alerting, etc to get set up again.

Squeeze Upgrade

The following computers need to be upgraded to squeeze:

  • Hermes (complicated, postfix needs to be rebuilt with a small patch)
  • Hera (not too bad, dns CNAMES need to be changed ahead of time)
  • Charon, enlil, kabta

Fix backups

  • The drives on persephone are too small, so we run out of space
  • This is urgent as we can't currently run a full backup cycle
  • We might want to run bacula "Base" backups to save space and time (see the bacula documentation)
  • Also, tape backups need to be run more often than never

Migrate to postgres 8.4

  • Not too bad
  • User notification required
  • Test mediawiki with it.

Upgrade ugcs_libs

  • The package is mostly built, just needs some testing
  • Needs to be deployed to get a rid of deprecation warnings

Audit mailing lists

  • People have signed up random accounts on them and are spying on our mail

Write auto-scanner for malware

  • We need to look at our web serving and auto-detect when we are serving spam off of it.

Mediawiki upgrader

  • The upgrade to 1.16 for users may break, esp if they tried to install >1 mediawiki on their account

More website auto-setup

  • Write more utilities like setup-mediawiki for django, etc so people can easily set up their own web stuff under UGCS.

Add news / tip of the day

Even better, write your own nice little utilities and then let people know about them.

Upgrade the juniper switch

There is a new version if JOS out that we should upgrade to.

Move Kabta

Kabta currently sees a lot of intermittent packet loss in its currently location.

Autofixers

Set up nagios so it more aggressively auto-restarts stuff when it is down.

Maintenance

These are things that we have to do even if there aren't full-time student sysadmins.

  • Account requests and password resets SLA: 1day
    • How do we know: We get emails
  • Fix it when it breaks: Server down
    • SLA: 1hr
    • How do we know: Email alerts for most things, sms to jdhutchin's phone for really urgent things.
    • Owner: jdhutchin
  • Fix minor support requests for things that are broken: SLA: 5days
    • Sooner would be better
  • Answer user questions: SLA: Best-effort
    • It would be nice if we could do this but it isn't a top priority


Software

Fix mex (matlab compiler)

Add support for distributed Mathematica on mortals

Small fixes

Small things that need to be fixed across various services/machines:

  • Email heartbeat
  • Hestia SSL cert
  • Change kabta back to ssh keys after Alex/Raymond add theirs
  • Find the sysadmins PGP key
  • Fix the backup schedules to something sensible

Mail System

See Mail Improvements

Automatic group creation/management

See ugcs groups

Large file hosting

Almost done! See NFS servers Server is running and exporting things correctly. All we need now is disk quotas.

Account creator / password reset

  • Re-work as necessary to ensure robustness
  • Add exception reporting system (email to sysadmins)
  • Write full test suite to ensure quality
  • Fix bug where if you mis-enter your krb pw it half-creates the account anyway and is a pain to straighten out.

Network

  • Write a system that shows us mac/ip/port number
  • Add port mirroring to charon for deseriable traffic
  • Improve firewalls
  • Enable switch port security
  • Fix switch names

Hardware

  • Set up hestia to take over for dionysus - in progress
  • network card flip: put one of the single gigabit cards into charon, and move its two-port gigabit cards into poseidon and persephone

Web hosting

  • Add a failover web server

Global login records

We need to implement some stuff with ldap so we have global login records

Documentation

  • We need a printed-out copy of critical wiki stuff
  • We need to make more documentation about our services for disaster recovery.
  • We need to update all of the core server pages with correct disk setups and currently running services.
Personal tools