Wish List

From UGCS
(Difference between revisions)
Jump to: navigation, search
(Stuff for Spring Break)
(Good starter Projects)
 
(45 intermediate revisions by 3 users not shown)
Line 1: Line 1:
==Stuff for Spring Break==
+
This page aims to list current improvements we would like to make to the cluster.  Ask jdhutchin if you have any questions about them.
* Get lab set up to use projector and sound system
+
* Bacula backup for servers
+
* Start on buildserver? I'd put a bit of time into this (Assign: alexr, ?)
+
* Get iodine reliable and publicized to lusers (Assign: alexr)
+
* Get shellservers listening to SSH on ports 80 and 53 for people behind restrictive firewalls (I'll do when I get a few mins) (Assign: alexr)
+
* Start looking into some of the warnings we're receiving from cron, nagios, etc, and silence (fix or set to ignore) (Assign: alexr, ?)
+
* Script to let people remove job listings automatically (Assign: alexr)
+
** This should go in a remctl command
+
* Netboot UGCS disc
+
* SVN for software in ugcs-admin
+
  
 +
==Good starter Projects==
  
* Virtual hosting (did you mean via some sort of dropfile-esque interface? -alexr)
+
===Fix splunk===
* Spamassassin/ldap settings- Done before break [[User:Jdhutchin@ugcs.caltech.edu|Jdhutchin@ugcs.caltech.edu]] 05:05, 16 March 2008 (PDT)
+
* The upgraded to 4.x broke it.
* Debug ldap and mail- this will involve some downtime.
+
* Requires setting up access to charon (useful for other stuff too)
** We should try to get ldaps working again if possible, or at least figure out why it doesn't work anymore. I know we're not using ldap for anything secure right now but I don't want to rule it out in the future. Also, I suspect it's behind the Alpine failures. (alexr)
+
* Allows our log alerting, etc to get set up again.
** It seems to be working- ldaps will have to wait for later versions of libldap that don't suck
+
** Basically done [[User:Jdhutchin@ugcs.caltech.edu|Jdhutchin@ugcs.caltech.edu]] 05:05, 16 March 2008 (PDT)
+
  
==Stuff for Winter break==
+
===Squeeze Upgrade===
* Fix account creation system- cgi principals and postgres databases 
+
The following computers need to be upgraded to squeeze:
** Done [[User:Jdhutchin@ugcs.caltech.edu|Jdhutchin@ugcs.caltech.edu]] 21:03, 8 December 2007 (PST)
+
* Hermes (complicated, postfix needs to be rebuilt with a small patch)
* Get cfengine to do dns/bind
+
* Hera (not too bad, dns CNAMES need to be changed ahead of time)
** Done [[User:Jdhutchin@ugcs.caltech.edu|Jdhutchin@ugcs.caltech.edu]] 22:05, 8 December 2007 (PST)
+
* Charon, enlil, kabta
* [[Setup gale]]
+
* Apache logs for users (Joshua)  
+
* Get Spamassassin to let users set their own settings
+
* Scripts to set up wikis
+
* Tripwire
+
** Binaries are sitting on hephaestus in /var/local
+
* Tabulate cluster usage statistics
+
* Netboot UGCS Disk
+
  
==Remaining==
+
===Fix backups===
===Critical===
+
* The drives on persephone are too small, so we run out of space
* Allow user creation of mailing lists, automated mailing list updates
+
* This is urgent as we can't currently run a full backup cycle
* Clean up pagsh entries on poseidon to avoid needing to periodically reboot
+
* We might want to run bacula "Base" backups to save space and time (see the bacula documentation)
 +
* Also, tape backups need to be run more often than never
  
===High===
+
===Migrate to postgres 8.4===
* [[Sysadmin:Security_Todo|Security To-Do]]
+
* Not too bad
* Add to debian [http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=417917 bug#417917] regarding AFS module unload oops
+
* User notification required
* Get to.ugcs working, suggested to use http://www.stanford.edu/~riepel/lbnamed/Stanford-DNSserver/ or http://twistedmatrix.com/trac/wiki/TwistedNames
+
* Test mediawiki with it.
* New account scripts
+
** Create postgres user and database upon user creation
+
* Mailman list creation by users (this will be done with Mailman SSO)
+
* Write documentation for users- partially done, what other specific topics do we need?
+
** Online man pages / software documentation
+
* redoing the build daemon (maurer)
+
* Set up poseidon with services
+
* Gale (packaging 90% complete, need to test maintainer scripts)
+
* More restrictive iptables rules on Charon
+
* Fix issues with swap not mounting on muses
+
* Debug dovecot hanging issue
+
* Increase AFS Cache sizes--we're not using the disks for much else
+
* Write a script so users can update their ldap info
+
* Backups
+
  
===Medium===
+
===Upgrade ugcs_libs===
* mailAlternateAddress scripts
+
* The package is mostly built, just needs some testing
* Let users see their own Apache logs
+
* Needs to be deployed to get a rid of deprecation warnings
* Create list of standard required packages- Mostly done in cfengine package.conf
+
** Better solution: pkgsync
+
* Mailman SSO
+
* Automated quota notices for mail and home
+
  
===Low===
+
===Audit mailing lists===
* Install power to center (pending MHF grant result)
+
* People have signed up random accounts on them and are spying on our mail
* Set up Persephone for backup
+
* Web based manuals for software
+
** Do we really want to duplicate manpages online?
+
** You can do things like hyperlink and have an index with apropros info that man doesn't have. Also, I think the info pages are nicer to browse. Also, some software only has HTML docs / vastly superior HTML docs next to the other documentation --[[User:Goldstei@ugcs.caltech.edu|Goldstei@ugcs.caltech.edu]] 09:43, 8 October 2007 (PDT)
+
* Various old UGCS niceities
+
** finger
+
** dictd running somewhere
+
** global login records
+
** sl
+
** configure and (auto nice daemon)
+
** tools for viewing global login records
+
** review old '/ug/adm/scripts' for useful stuff and and ask Josh Goldstein if what is going on makes no sense
+
* Put all configs in cfengine to ease rebuilding machines from known good state
+
* General application-specific tweaks
+
* Get distcc working properly
+
  
==Completed==
+
===Write auto-scanner for malware===
* Decide whether to go with AFS or NFS
+
* We need to look at our web serving and auto-detect when we are serving spam off of it.
** Decided to do NFS for root filesystem, AFS for user homedirs and maildirs.
+
 
* Do NFS setup for netboot root
+
===Mediawiki upgrader===
* Get DNS on Demeter up so we can properly reference 'task' hostnames
+
* The upgrade to 1.16 for users may break, esp if they tried to install >1 mediawiki on their account
* Agree on IP allocation
+
 
* Set up Hermes for mail
+
===More website auto-setup===
* Set up core switches
+
* Write more utilities like setup-mediawiki for django, etc so people can easily set up their own web stuff under UGCS.
* Set up Charon for routing, get snort running
+
 
* Budget planning
+
===Add news / tip of the day===
* PXE setup on Demeter
+
Even better, write your own nice little utilities and then let people know about them.
* Start migrating over Pukes to serve as test client machines
+
 
* Start using CFEngine to manage sudo so that all machines stay in sync and we can setup different sudoers for different machines (e.g. donut, new sysadmins, etc. when the time comes)
+
===Upgrade the juniper switch===
* Investigate pam_access group restrictions to prevent non-sysadmin login into core machine
+
There is a new version if JOS out that we should upgrade to.
** used pam_access and cfengine
+
 
* Migrate user data from NIS to LDAP
+
===Move Kabta===
* Set up password migration frontend
+
Kabta currently sees a lot of intermittent packet loss in its currently location.
* Send administrative e-mails warning users
+
 
* Update hostmaster with new IP allocations for rDNS
+
===Autofixers===
* Set up and migrate Kerberos/LDAP to Zeus
+
Set up nagios so it more aggressively auto-restarts stuff when it is down.
** In progress - Zeus is physically up on the [[Compaq_Proliant_3U]].
+
 
* Set up new CA properly
+
==Maintenance==
* Order muses/naiads
+
These are things that we have to do even if there aren't full-time student sysadmins.
* Migrate mailing lists
+
 
** Groundwork completed
+
* Account requests and password resets SLA: 1day
* Set up nullmailer to redirect mail to hermes
+
** How do we know: We get emails
* Set up poseidon
+
 
* Migrate network cabling
+
* Fix it when it breaks: Server down
* Set up hephaestus
+
** SLA: 1hr
* Migrate Apollo IP
+
** How do we know: Email alerts for most things, sms to jdhutchin's phone for really urgent things.
* Convert to task-based CNAMES i.e. ldap-head
+
** Owner: jdhutchin
* Configure pipermail on hermes
+
 
* Mount old UGCS NFS on Apollo
+
* Fix minor support requests for things that are broken: SLA: 5days
* Contact alumni association i.e. andy shaindlin and karen carlson
+
** Sooner would be better
* Set up charon to record logging data
+
 
* chsh script
+
* Answer user questions: SLA: Best-effort
* Migrate mail data
+
** It would be nice if we could do this but it isn't a top priority
* Prepare to migrate user public html data
+
 
* mailForwardingAddress script: mail_forward (available on pukes)
+
 
* work out suexec/afs/kerberos interactions - finally freaking done
+
==Software==
* secure and test pseudo-suexec
+
Fix mex (matlab compiler)
* mailForwardingAddress debugging - found problem (needed to apply the alias map twice; once in virtual alias maps, second time in local alias maps)
+
 
* Pine SSL certificate issue
+
Add support for distributed Mathematica on mortals
* Finish tweaking netboot- should be pretty much done
+
 
* Reformat old pukes to be used as muses- takes about 10min/machine
+
==Small fixes==
* Bring down Purchase, migrate demeter IP
+
Small things that need to be fixed across various services/machines:
* Fix POP/IMAP issue with full resend and relogins for every request
+
* Email heartbeat
** http://www.dovecot.org/list/dovecot/2007-January/018653.html and friends -
+
* Hestia SSL cert
* Migrate homedir data
+
* Change kabta back to ssh keys after Alex/Raymond add theirs
* Investigate LDAP slave mirror problems
+
* Find the sysadmins PGP key
* Mount hardware on racks and remove obsolete hardware
+
* Fix the backup schedules to something sensible
* Send in RMA athena disk
+
 
* Fixed mail_forward script
+
==Mail System==
* [[MHF grant]] (October 6 deadline)
+
See [[Mail Improvements]]
** submit progress report on previous funds
+
 
** ask for splunk, projector, money for constructing overhead power drop, money for [[general lab improvements]]
+
==Automatic group creation/management==
** People to touch base with for 'context' section: Elizabeth Allen (alumni); Ruthanne Bevier (imss security); Michael Vanier (cs); Chris Gonzales (ascit); Michael Woods (ihc); Wenyee Lo (imss houserep program); Marissa Cevallos (the tech); Craig Montuori (donut)
+
See [[ugcs groups]]
* memory rebate (Liz)
+
 
* purchase video cards (Liz)
+
==Large file hosting==
* Perhaps a more sane way to access webmail, i.e. webmail.ugcs?
+
Almost done!
* Splunk configuration, apache logs
+
See [[NFS servers]]
* reimbursement for purchases (Liz)
+
Server is running and exporting things correctly.  All we need now is disk quotas.
* Job posting scripts
+
 
* Restore Athena to operation
+
==Account creator / password reset==
* Set up Hera with backup KDC and LDAP
+
* Re-work as necessary to ensure robustness
* Migrate nfs to hestia
+
* Add exception reporting system (email to sysadmins)
* set up postgres on poseidon, set up databases for each user
+
* Write full test suite to ensure quality
* mrsh replacement (mssh <classname> <command> <arg1> ...)
+
* Fix bug where if you mis-enter your krb pw it half-creates the account anyway and is a pain to straighten out.
 +
 
 +
==Network==
 +
* Write a system that shows us mac/ip/port number
 +
* Add port mirroring to charon for deseriable traffic
 +
* Improve firewalls
 +
* Enable switch port security
 +
* Fix switch names
 +
 
 +
==Hardware==
 +
* Set up hestia to take over for dionysus - in progress
 +
* network card flip: put one of the single gigabit cards into charon, and move its two-port gigabit cards into poseidon and persephone
 +
 
 +
==Web hosting==
 +
* Add a failover web server
 +
 
 +
==Global login records==
 +
We need to implement some stuff with ldap so we have global login records
 +
 
 +
==Documentation==
 +
* We need a printed-out copy of critical wiki stuff
 +
* We need to make more documentation about our services for disaster recovery.
 +
* We need to update all of the core server pages with correct disk setups and currently running services.

Latest revision as of 14:37, 16 September 2011

This page aims to list current improvements we would like to make to the cluster. Ask jdhutchin if you have any questions about them.

Contents

Good starter Projects

Fix splunk

  • The upgraded to 4.x broke it.
  • Requires setting up access to charon (useful for other stuff too)
  • Allows our log alerting, etc to get set up again.

Squeeze Upgrade

The following computers need to be upgraded to squeeze:

  • Hermes (complicated, postfix needs to be rebuilt with a small patch)
  • Hera (not too bad, dns CNAMES need to be changed ahead of time)
  • Charon, enlil, kabta

Fix backups

  • The drives on persephone are too small, so we run out of space
  • This is urgent as we can't currently run a full backup cycle
  • We might want to run bacula "Base" backups to save space and time (see the bacula documentation)
  • Also, tape backups need to be run more often than never

Migrate to postgres 8.4

  • Not too bad
  • User notification required
  • Test mediawiki with it.

Upgrade ugcs_libs

  • The package is mostly built, just needs some testing
  • Needs to be deployed to get a rid of deprecation warnings

Audit mailing lists

  • People have signed up random accounts on them and are spying on our mail

Write auto-scanner for malware

  • We need to look at our web serving and auto-detect when we are serving spam off of it.

Mediawiki upgrader

  • The upgrade to 1.16 for users may break, esp if they tried to install >1 mediawiki on their account

More website auto-setup

  • Write more utilities like setup-mediawiki for django, etc so people can easily set up their own web stuff under UGCS.

Add news / tip of the day

Even better, write your own nice little utilities and then let people know about them.

Upgrade the juniper switch

There is a new version if JOS out that we should upgrade to.

Move Kabta

Kabta currently sees a lot of intermittent packet loss in its currently location.

Autofixers

Set up nagios so it more aggressively auto-restarts stuff when it is down.

Maintenance

These are things that we have to do even if there aren't full-time student sysadmins.

  • Account requests and password resets SLA: 1day
    • How do we know: We get emails
  • Fix it when it breaks: Server down
    • SLA: 1hr
    • How do we know: Email alerts for most things, sms to jdhutchin's phone for really urgent things.
    • Owner: jdhutchin
  • Fix minor support requests for things that are broken: SLA: 5days
    • Sooner would be better
  • Answer user questions: SLA: Best-effort
    • It would be nice if we could do this but it isn't a top priority


Software

Fix mex (matlab compiler)

Add support for distributed Mathematica on mortals

Small fixes

Small things that need to be fixed across various services/machines:

  • Email heartbeat
  • Hestia SSL cert
  • Change kabta back to ssh keys after Alex/Raymond add theirs
  • Find the sysadmins PGP key
  • Fix the backup schedules to something sensible

Mail System

See Mail Improvements

Automatic group creation/management

See ugcs groups

Large file hosting

Almost done! See NFS servers Server is running and exporting things correctly. All we need now is disk quotas.

Account creator / password reset

  • Re-work as necessary to ensure robustness
  • Add exception reporting system (email to sysadmins)
  • Write full test suite to ensure quality
  • Fix bug where if you mis-enter your krb pw it half-creates the account anyway and is a pain to straighten out.

Network

  • Write a system that shows us mac/ip/port number
  • Add port mirroring to charon for deseriable traffic
  • Improve firewalls
  • Enable switch port security
  • Fix switch names

Hardware

  • Set up hestia to take over for dionysus - in progress
  • network card flip: put one of the single gigabit cards into charon, and move its two-port gigabit cards into poseidon and persephone

Web hosting

  • Add a failover web server

Global login records

We need to implement some stuff with ldap so we have global login records

Documentation

  • We need a printed-out copy of critical wiki stuff
  • We need to make more documentation about our services for disaster recovery.
  • We need to update all of the core server pages with correct disk setups and currently running services.
Personal tools