Project

General

Profile

On-Call Guide

This is a draft of an on-call guide for POMS; please feel free to ask questions, add updates, etc.

Overall

  • POMS Processes / Services currently runs
    • on pomsgpvm01.fnal.gov
    • on pomsgpvm02.fnal.gov
    • under the poms account
    • managed by supervisor
  • most web usage is fronted by the system Apache, of which we have limited config ability, by editing files in /etc/httpd/conf.d and using sudo to restart apache (see below).
  • supervisor config is in $HOME/supervisord.conf
  • Poms software (except including supervisord) is currently installed via virtualenv areas in /home/poms/poms_versions, supervisor itself via ups/upd $HOME/products.
  • logs are under $HOME/private/logs/poms
  • config files are under $HOME/private/config/poms

Checklist

  • check main page https://pomsgpvm01.fnal.gov/poms/
  • check: "ssh poms@pomsgpvm01 bin/health_check" should have
    • 4-digits free memory
    • 2-digits idle cpu
    • Ok: recent job updates
    • Backtraces hopefully zero, ut in 3 digits probably ok.
  • check "ssh poms@pomsgpvm02 bin/health_check" should have
    • 4-digits free memory
    • 2-digits idle cpu
    • Ok: recent job updates
    • Backtraces hopefully zero, but in 3 digits probably ok.

Backtrace reason:

sqlalchemy.exc.UnboundExecutionError: Could not locate a bind configured on SQL expression or this Session

Usually needs an application restart to clear (db handle leak?)

If things aren't running, or database session errors

  1. make a Servicedesk ticket, or mark existing one "Work in Progress"
  2. Log into the server ssh -l poms pomsgpvm01.fnal.gov; if you can't get in; try and ping pomsgpvm01.fnal.gov, and in any case make a ticket to Scientific Server Support tog get it restarted.
  3. check memory usage top: we ought to have about 1/6 of our memory free, and have double digit idle cpu.
  4. check if services are running cd $HOME; supervisorctl status, if not supervisorctl start service -- NOTE: on pomsgpvm01 the poms_josub_q_scraper should not be running ,and on pomsgpvm02 only poms_jobsub_q_scraper and the poms_webservice should be running.
  5. check for exceptions, etc. cd $HOME/private/logs/poms ; grep ' line ' error.log
  6. try to restart webservice cd $HOME; supervisorctl restart poms_webservice
    • if it doesn't start, try killall uwsgi and supervisorctl start poms_webservice
  7. try to restart apache sudo /etc/init.d/httpd restart
  8. if all else fails, cut a ticket to Scientific Server Support and ask them to reboot the VM.