SISPI Database Guide¶
The SISPI database uses PostgreSQL as the underlying database server. Instructions on how to install and initially configure the database can be found in the DES DocDB 4029. The notes below describe the SISPi-specfic configuration of the database. They assume you are familiar with performing admin tasks on a PostgreSQL database.
Table of Contents¶
- Database Configuration
- Database Replication
- Database Web Viewers
- Database Monitoring
- Database Backups
Database Configuration¶
The database software is installed as a eups product in the SISPI products tree, /usr/remote/software/products
. We are currently using 9.0.13. The master database is currently running at port 5442 on server3 (referred to as the master). The database directory on server3 is /raid3/database/sispi_db
. The CTIO mirror is currently running at port 5442 on server2 (referred to as the standby). The database directory on server2 is /raid2/database/sispi_db
. The notes below reflect the current configuration. The database runs as user postgres
. It is started at boot time by a System V style start script in /etc/init.d/postgresql
. The database directories are owned by the postgres
account.
The following configuration options in postgresql.conf
are set differently from the defaults:
# CONNECTIONS AND AUTHENTICATION listen_addresses = '*' max_connections = 500 # ERROR REPORTING AND LOGGING logging_collector = on log_directory = 'pg_log' log_filename = 'postgresql-%Y-%m.log' log_truncate_on_rotation = on log_rotation_age = 43200 log_line_prefix = '%t %h %d %u' # CLIENT CONNECTION DEFAULTS search_path = 'constants,telemetry,alarms,users,exposure,bss,obstac,"$user",public' timezone = 'UTC'
pg_hba.conf
# All SISPI machines in subnet 139.229.14 (addresses < 128) host all all 139.229.14.0/25 md5 # Access from the raspberrypi/anemometer host decam_prd decam_writer 139.229.13.55/32 md5 # Access from the LN2 accelerometer computer host decam_prd decam_writer 139.229.14.95/32 md5 # Access for Marco Bonati host decam_prd decam_reader 139.229.3.77/32 md5
If you add a new entry to the pg_hba.conf file on server3 then also add the same entry to the file on server2 but leave it commented out. This simplifies the failover process.
Database Replication¶
We are using PostgreSQL 9 streaming replication to make a standby copy of the database. We are also streaming a copy to Fermilab to des61 (referred to as the mirror). Here are the notes on setting up the replication and doing the failover. The master is running on server3 and the standby on server2. Note that the configuration of each database must be the same,e.g., if the primary has max_connections = 500 then the standby must also. These notes apply to v9 of PostgreSQL, they need to be modified slightly for the latest version of the product.
Configuration of the master¶
- In postgresql.conf on the master at CTIO
wal_level = hot_standby max_wal_senders = 4 wal_keep_segments = 400 archive_mode = on # allows archiving to be done # (change requires restart) archive_command = 'test ! -f /raid3/database/wal_archive/%f && cp %p /raid3/database/wal_archive/%f' # command to use to archive a logfile segment
We are also saving the WAL segments in a archive directory/raid3/database/wal_archive
which allows the mirror to be re-synched without copying the entire database. If a new master is created by triggering the failover (see below) then these lines need to be activated in the configuration for the new master (they are in the CTIO standby configuration but are commented out). - In pg_hba.conf on the master
# Replication for the standby server2.ctio.noao.edu host replication dbadmin 139.229.14.37/32 md5 # Replication for the standby server at FNAL host replication dbadmin 131.225.58.50/32 md5 # Access from des61.fnal.gov host decam_prd dbadmin 131.225.58.50/32 md5
Setting up the initial standby¶
- On the master (server3) make a base backup by copying the master database data directory to the standby.
cd /raid3/database/sispi_db psql -c "SELECT pg_start_backup('label', true)" -U dbadmin postgres rsync -a * server2:/raid3/database/sispi_db/ --exclude postmaster.pid --exclude pg_log --exclude=postmaster.opt psql -c "SELECT pg_stop_backup()" -U dbadmin postgres
- The standby will get a copy of the postgresql.conf from the master. Edit the following lines to change
raid3
toraid2
and comment them out#archive_mode = on # allows archiving to be done # (change requires restart) #archive_command = 'test ! -f /raid2/database/wal_archive/%f && cp %p /raid2/database/wal_archive/%f' # command to use to archive a logfile segment
- The standby will also get a copy of the pga_hba.conf file. This should be edited to comment out all entries EXCEPT
# "local" is for Unix domain socket connections only local all all md5 # IPv4 local connections: host all all 127.0.0.1/32 md5 # IPv6 local connections: host all all ::1/128 md5 # All SISPI machines in subnet 139.229.14 (addresses < 128) host all all 139.229.14.0/25 md5
- Create a recovery.conf file in the data directory on the standby (/raid2/database/sispi_db).
# This parameter must to be set to on. standby_mode = 'on' # Specifies a connection string which is used for the standby server to connect # with the primary. primary_conninfo = 'host=server3.ctio.noao.edu port=5442 user=dbadmin' # Specifies a trigger file whose presence should cause streaming replication to # end (i.e., failover). trigger_file = '/raid2/database/failover_trigger'
- Start the standby server.
- You should see a wal receiver process on the standby and a wal sender on the master
Setting up the initial FNAL mirror¶
- On the master (server3) make a tar file for transport to FNAL
cd /raid3/database/sispi_db psql -c "SELECT pg_start_backup('label', true)" -U dbadmin postgres tar -c -f ../sispi_db_dump.tar --exclude postmaster.pid* --exclude=pg_log --exclude=postmaster.opt * psql -c "SELECT pg_stop_backup()" -U dbadmin postgres
This file will be very large so you should run gzip on it first. - Copy the gzipped tar file to des61 using scp.
- As user postgres cd to /decam_database/tmp and unwind the file. Then cd to sispi_db and rsync
cd sispi_db rsync -av * ../../sispi_db/.
- The standby will get a copy of the postgresql.conf from the master. Edit the file to add in the
# - Standby Servers -
sectionhot_standby = on
This allows the FNAL database mirror to support read-only queries. The CTIO standby is not configured to do this. - Create a recovery.conf file in the data directory on the mirror (/decam_database/sispi_db). This does not need the trigger file location.
# this parameter must to be set to on. standby_mode = 'on' # Specifies a connection string which is used for the standby server to connect # with the primary. primary_conninfo = 'host=server3.ctio.noao.edu port=5442 user=dbadmin'
- Start the mirror.
- You should see a wal receiver process on the mirror and a wal sender on the master
Triggering failover¶
The current master is server3 and the current standby is server2. So we will assume that server3 fails and we need to failover to server2 which will become the master. We assume that server3 will then become the standby.- On server2 create the trigger file in the specified path by running the
failover.sh
script in/usr/remote/user/postgres
. The standby will detect this file and switch out of replication mode. It will copy therecovery.conf
file torecovery.done
in the data directory. - The pg_hba.conf file on the standby contains all of the necessary entries to be a master but they need to be uncommented, namely
# Access from the raspberrypi/anemometer host decam_prd decam_writer 139.229.13.55/32 md5 # Access from the LN2 accelerometer computer host decam_prd decam_writer 139.229.14.95/32 md5 # Access for Marco Bonati host decam_prd decam_reader 139.229.3.77/32 md5
- Activate the WAL archiving by un-commenting out the necessary lines in the postgresql.conf file on the new master (see above).
- Remove the
recovery.done
file. - You should now be able to start the new master server
pg_ctl start
. - You will need to switch the various web viewers to point to the new server. To do this run the script switch_web_tools.sh in ~codemanager/bin (you will need to be user codemanager) to switch the links. You will also need to point the ECL at the new server. Go to the directory /usr/remote/software/www/wsgi/ECL/decam and do
mv settings.py settings.py_server3 mv settings.py_server2 settings.py
- The Site package needs to be updated to use the new master.
- The alarmserver and the telemetryserver will need to be restarted so that they send data to the new master. See the ICS page for instructions on how to do this (the Site package must be updated first).
- Once the machine containing the old master has been restored you will need to start up a new CTIO standby. The instructions assume that you are starting with a clean disk and so you need to recreate the database completely.
- On server3 execute the command
initdb -D /raid3/database/sispi_db -U dbadmin -W
- Create a recovery.conf file in /raid3/database/sispi_db on server3
# this parameter must to be set to on. standby_mode = 'on' # Specifies a connection string which is used for the standby server to connect # with the primary. primary_conninfo = 'host=server2.ctio.noao.edu port=5442 user=dbadmin' # Specifies a trigger file whose presence should cause streaming replication to # end (i.e., failover). trigger_file = '/raid3/database/failover_trigger'
- On server2 make a base backup by copying the master database data directory to the standby.
cd /raid2/database/sispi_db psql -c "SELECT pg_start_backup('label', true)" -U dbadmin postgres rsync -a * server3:/raid3/database/sispi_db/ --exclude postmaster.pid --exclude pg_log psql -c "SELECT pg_stop_backup()" -U dbadmin postgres
- Edit pg_hba.conf to comment out all lines except
# "local" is for Unix domain socket connections only local all all md5 # IPv4 local connections: host all all 127.0.0.1/32 md5 # IPv6 local connections: host all all ::1/128 md5 # All SISPI machines in subnet 139.229.14 (addresses < 128) host all all 139.229.14.0/25 md5
- On the standby change the address for replication from server3 (139.229.14.36) to server2 (139.229.14.37). This prepares the file for the next failover.
# Replication for the standby server2.ctio.noao.edu #host replication dbadmin 139.229.14.37/32 md5
- On the master edit the pg_hba.conf file to uncomment the lines
# Replication for the standby server3.ctio.noao.edu host replication dbadmin 139.229.14.36/32 md5 # Replication for the standby server at FNAL host replication dbadmin 131.225.58.50/32 md5 # Access from des20.fnal.gov host decam_prd dbadmin 131.225.58.50/32 md5
- Start the standby server
- On server3 execute the command
Re-establishing the FNAL mirror after a failover event¶
If the FNAL mirror needs to be re-established because failover was triggered at CTIO then it must be done using a backup of the database similar to the procedure described above because the master has changed. The example assumes that the new master is server2.- On the master (server3)
cd /raid2/database/sispi_db
psql -c "SELECT pg_start_backup('label', true)" -U dbadmin postgres
tar -c -f ../sispi_db_dump.tar --exclude postmaster.pid* --exclude=pg_log --exclude=pg_hba.conf* --exclude=postgresql.conf* --exclude=postmaster.opt *
psql -c "SELECT pg_stop_backup()" -U dbadmin postgres
This file will be very large so you should run gzip in it first. - Copy the gzipped tar file to des61 using scp.
- As user postgres cd to /decam_database/tmp and unwind the file. Then cd to sispi_db and rsync
cd sispi_db rsync -av * ../../sispi_db/.
- Edit the recovery.conf file in the data directory on the mirror to point to server2 (/decam_database/sispi_db).
# this parameter must to be set to on. standby_mode = 'on' # Specifies a connection string which is used for the standby server to connect # with the primary. primary_conninfo = 'host=server2.ctio.noao.edu port=5442 user=dbadmin'
- Start the mirror.
Re-synching the standby or the FNAL mirror¶
The standby or the FNAL mirror on des61 will halt if a WAL segment file that it needs has already been removed from the master. This will be indicated by an error of the form
2014-07-14 15:52:45 CDTLOG: streaming replication successfully connected to primary 2014-07-14 15:52:45 CDTFATAL: could not receive data from WAL stream: FATAL: requested W AL segment 000000030000005F000000A7 has already been removed
in the logfile in the pg_log directory. In order for replication to resume you need to fetch the missing segment files
from the master and put them in the pg_xlog directory.
- Login to the master (server3) at CTIO and go to the /raid3/database/wal_archive directory.
- Make a tar file containing the segment files starting from the one that the mirror is complaining about. It should hopefully be quite small. For the FNAL mirror, copy it back to des20 and unwind it in the pg_xlog directory. For the standby, copy it to the standby and unwind it in the pg_xlog directory.
- It is not necessary to restart the server, it will automatically find the files. Recovery will commence and the error messages should disappear.
Database Web Viewers¶
Configuring Apache
The web space is in /usr/remote/software/www
. The Apache server runs on system1. The following configuration options are set in httpd.conf
:
# This needs to be on a local disk LockFile /data_local/web_logs/accept.lock # Pick up our version of mod_wsgi LoadModule wsgi_module /usr/remote/software/products/mod_wsgi-3.3/lib/mod_wsgi.so
The server should be started/stopped/restarted by running
/etc/init.d/httpd start/stop/restart
It should be started as codemanager. This should normally be taken care of when the machine boots. The script /etc/init.d/httpd
has been modified from the default. It runs a script runApache.sh which lives in ~codemanager/bin. It also allows codemanager to restart Apache using the /etc/init.d/httpd
script which is usually restricted to root.
runApache.sh
#! /bin/bash currDir=$PWD user=`whoami` # Location of apache binaries apacheBin=/usr/sbin APACHECTL=${apacheBin}/apachectl # Where user stuff is apacheHome=/usr/remote/software/www confFile=${apacheHome}/conf/httpd.conf myDocRoot=${apacheHome}/html pythonLib=/usr/remote/software/dbwebtools/webpylib . /usr/remote/software/products/eups/bin/setups.sh setup python export PYTHONHOME=/usr/remote/software/products/python-2.7.3-Linux64 setup PIL setup postgresql setup psycopg2 ################################################## # This section simply setups all products needed # to construct the appropriate environment for # apache. ################################################## ################################################## ############# # If you must add anything else to PYTHONPATH your doing something wrong. export PYTHONPATH=${pythonLib}/dbiapp_pylib:${pythonLib}:${myDocRoot}:${PYTHONPATH} #echo PYTHONPATH = $PYTHONPATH ############# SSL= #-DSSL ############################################ # Now we can do what the user requsted ############################################ ARGV="$@" ERROR=0 case $ARGV in start) ${APACHECTL} ${SSL} -d ${apacheHome} -f ${confFile} -k start ERROR=$? ;; stop) ${APACHECTL} ${SSL} -d ${apacheHome} -f ${confFile} -k stop ERROR=$? ;; restart) ${APACHECTL} ${SSL} -d ${apacheHome} -f ${confFile} -k restart ERROR=$? ;; graceful) ${APACHECTL} ${SSL} -d ${apacheHome} -f ${confFile} -k graceful ERROR=$? ;; *) echo " " echo "runApache.sh start|stop|restart|graceful" echo " " ERROR=1 esac exit $ERROR
/etc/init.d/httpd
#!/bin/bash # # httpd Startup script for the Apache HTTP Server # # chkconfig: - 85 15 # description: Apache is a World Wide Web server. It is used to serve \ # HTML files and CGI. # processname: httpd # config: /etc/httpd/conf/httpd.conf # config: /etc/sysconfig/httpd # pidfile: /var/run/httpd.pid # Source function library. . /etc/rc.d/init.d/functions if [ -f /etc/sysconfig/httpd ]; then . /etc/sysconfig/httpd fi # Start httpd in the C locale by default. HTTPD_LANG=${HTTPD_LANG-"C"} # This will prevent initlog from swallowing up a pass-phrase prompt if # mod_ssl needs a pass-phrase from the user. INITLOG_ARGS="" # Set HTTPD=/usr/sbin/httpd.worker in /etc/sysconfig/httpd to use a server # with the thread-based "worker" MPM; BE WARNED that some modules may not # work correctly with a thread-based MPM; notably PHP will refuse to start. # Path to the apachectl script, server binary, and short-form for messages. apachectl=/usr/sbin/apachectl httpd=${HTTPD-/usr/sbin/httpd} prog=httpd #pidfile=${PIDFILE-/var/run/httpd.pid} pidfile=${PIDFILE-/usr/remote/software/www/run/httpd.pid} lockfile=${LOCKFILE-/usr/remote/software/www/run/httpd} RETVAL=0 su_codemanager() { user=`/usr/bin/whoami` if [ "${user}" = "codemanager" ]; then /bin/bash -c "${1}" else /bin/su codemanager --login --command="${1}" fi return $? } # check for 1.3 configuration check13 () { CONFFILE=/usr/remote/software/www/conf/httpd.conf GONE="(ServerType|BindAddress|Port|AddModule|ClearModuleList|" GONE="${GONE}AgentLog|RefererLog|RefererIgnore|FancyIndexing|" GONE="${GONE}AccessConfig|ResourceConfig)" if LANG=C grep -Eiq "^[[:space:]]*($GONE)" $CONFFILE; then echo echo 1>&2 " Apache 1.3 configuration directives found" echo 1>&2 " please read /usr/share/doc/httpd-2.2.3/migration.html" failure "Apache 1.3 config directives test" echo exit 1 fi } # The semantics of these two functions differ from the way apachectl does # things -- attempting to start while running is a failure, and shutdown # when not running is also a failure. So we just do it the way init scripts # are expected to behave here. start() { echo -n $"Starting $prog: " check13 || exit 1 /usr/bin/logger "Starting $prog" #LANG=$HTTPD_LANG daemon --pidfile=${pidfile} $httpd $OPTIONS su_codemanager "/usr/remote/user/codemanager/bin/runApache.sh start" RETVAL=$? echo [ $RETVAL = 0 ] && touch ${lockfile} chown codemanager:sispi ${lockfile} return $RETVAL } # When stopping httpd a delay of >10 second is required before SIGKILLing the # httpd parent; this gives enough time for the httpd parent to SIGKILL any # errant children. stop() { echo -n $"Stopping $prog: " #killproc -p ${pidfile} -d 10 $httpd /usr/remote/user/codemanager/bin/runApache.sh stop RETVAL=$? echo [ $RETVAL = 0 ] && rm -f ${lockfile} } reload() { echo -n $"Reloading $prog: " if ! LANG=$HTTPD_LANG $httpd $OPTIONS -t >&/dev/null; then RETVAL=$? echo $"not reloading due to configuration syntax error" failure $"not reloading $httpd due to configuration syntax error" else killproc -p ${pidfile} $httpd -HUP RETVAL=$? fi echo } # See how we were called. case "$1" in start) start ;; stop) stop ;; status) status -p ${pidfile} $httpd RETVAL=$? ;; restart) stop start ;; condrestart) if [ -f ${pidfile} ] ; then stop start fi ;; reload) reload ;; graceful|help|configtest|fullstatus) $apachectl $@ RETVAL=$? ;; *) echo $"Usage: $prog {start|stop|restart|condrestart|reload|status|fullstatus|graceful|help|configtest}" exit 1 esac exit $RETVAL
The
*.conf
files for TV, ConstantsDB, QE, AV and ECL are in the conf.d
directory. These applications all use wsgi.
Installing a mod_wsgi Web Viewer
Install robots.txt
in html
.
User-agent: * Disallow: /
The application configurations are in html/cfg
. There is one file for the primary database server and one for the standby. There is a link to the current version.
constants.cfg -> constants.cfg_server3 constants.cfg_server2 constants.cfg_server3 des_alarms.cfg -> des_alarms.cfg_server3 des_alarms.cfg_server2 des_alarms.cfg_server3 des_tv.cfg -> des_tv.cfg_server3 des_tv.cfg_server2 des_tv.cfg_server3 QE_decam.cfg -> QE_decam.cfg_server3 QE_decam.cfg_server2
In the event of the database failing over to the standby on server2 you will need to run the script switch_web_tools.sh
in ~codemanager/bin
(you will need to be user codemanager) to switch the links to the server2 version.
The applications are installed in /usr/remote/software/dbwebtools
. The support applications ChartDirector
, dbiapp_pylib
and wsgi_py
are in the webpylib
subdirectory:
Install the latest version of dbiapp_pylib
in /software/dbwebtools/webpylib
. Make a link to the lib directory of the product in webpylib
.
ln -s dbiapp_pylib_v1_4a/lib dbiapp_pylib
We use a package called ChartDirector to render the plots in the TV. Download it from here (we are using 5.0.2 64 bit). Install the contents of the lib
directory into a directory called ChartDirector
. It was necessary to add a python __init__.py
file
from pychartdir import *
ChartDirector also requires a license file (
chartdir.lic
) which should be installed in the ChartDirector directory.
Versions
ConstantsDB 1.5d.wsgi TV 3.3a AV 1.1 dbiapi_pylib v1_4a wsgi_py v1_5a ChartDirector 5.0.2 (64 bit version)
The current version of each product has a link to current. In the www/html
directory there is a link to this current version
AV -> /usr/remote/software/dbwebtools/des_alarms/current/product ConstantsDB -> /usr/remote/software/dbwebtools/ConstantsDB/current TV -> /usr/remote/software/dbwebtools/tv/current/product QE/DECam -> /usr/remote/software/dbwebtools/QEngine/current
The tar files for the various products are located at the Files tab
Exposure Browser¶
The Exposure Browser is based on the Query Engine product. The Query Engine should be installed using the instructions for installing a mod_wsgi viewer. You then need to install the EXPO.tar package into the www/html
directory as it contains the actual html pages. These pages use the QueryEngine to process the html forms.
Database Monitoring¶
On server2 and server3
# # Every hour, check if database is running. If not send mail to postgres-status alias. 1 * * * * /usr/remote/user/postgres/bin/db_status.sh alias postgres-status buckley@fnal.gov elliotta@mps.ohio-state.edu dgomez@ctio.noao.edu
On des20 check to see if the mirror is running
2 * * * * /var/lib/pgsql/bin/db_status.sh
which sends mail to
alias decam-mirror buckley@fnal.gov neilsen@fnal.gov yanny@fnal.gov
The scripts also check that the standby and CTIO and the mirror at FNAL are updating. They do this by looking for the wal_receiver process. Occasional errors from this check are okay but persistent errors indicate a problem and should not be ignored.
Database Backups¶
The SISPI database is backed up on a weekly basis. The eups package pgTools contains the script used for the backup as well as a set of utilities that are used by the script. The backup currently runs via cron every Sunday morning at 11:02 am local CTIO time. On completion it sends email to the alias postgres-status defined above.
# # Run weekly database backups at 2 minutes past 11 am on Sunday # 2 11 * * Sun source /usr/remote/software/products/eups/bin/setups.sh; setup pgTools; run_backup.shThe backup has the following features:
- We use pg_dump with the -F c option to reduce the size of the dump files - this is the PostgreSQL custom format, these are binary dumps and must be restored using pg_restore.
- The backup directory is /raid3/database/db_backup/server3. Each backup has its own directory.
- The script dumps out the current size of the database tables.
- Large tables are backed up into individual dump files. The tables in this list currently are:
photo_diodes focal_plane_temperature_cntrl image_health facility_utilities ion_pump ln2_process environmental_data cage_temperatures telescope_data donut_fn1 donut_fn2 donut_fn3 donut_fn4 donut_fs1 donut_fs2 donut_fs3 donut_fs4 cm_telemetry vacuum rasicam_data ccd_readoutnoise ln2_pump_accelerometer
- The monsoon telemetry tables are put into another dump file as are all the rest of the of the small telemetry tables.
- If new tables are added to the database then they need to be added to the script, either as their own dump file or as part of the small telemetry tables.
- The other schema are fairly small and are each put into their own dump file (obstac, constants, alarms, exposure).
- The desbrdev (Quick Reduce) database dump the job_runs, processes and ccd tables into their own dump file along with another dump of the rest.
- The decamcrl (ECL) database has its own dump file.
- The user accounts and other administrative things are put into another dump file.
- A copy of the configuration files is made.
- Old dump files are cleaned up.
- The backup directory are rsynced to /raid2/database/db_backup on server2