PostgreSQL replication

This page is under construction.


This page documents the set up and administration of streaming replication for the chimera PostgreSQL database cluster. It does not presently document (1) usage of the replica by applications (2) general progress with or working notes for this project.


  • Two hosts - one to host the Primary (master) installation of a database and the other to host a Replica (hot standby). Note that the Replica host can serve more than one database cluster as long as they are listening on separate ports.
  • An operational database cluster on the Primary host - this is to be replicated, e.g. chimera.

PostgreSQL 9.2

PostgreSQL 9.2 is required on the Replica host.

  • Install the current packages on SLF6. Note that specifying the rpm --nodeps option is one way to work around the unsatisfied sl-release package dependency.
    [root@gccensrv1 ~]# rpm --nodeps -ivh
    [root@gccensrv1 ~]# yum -y install postgresql92 postgresql92-devel postgresql92-libs postgresql92-server postgresql92-contrib
  • Configure the service:
    [root@gccensrv1 ~]# sed -i'-original' -e 's/postgres:postgres/enstore:enstore/g' -e 's/l postgres /l enstore /g' /etc/init.d/postgresql-9.2
  • Configure the data directory:
    [root@gccensrv1 ~]# echo "PGDATA=/db/database/chimera/psql-data" > /etc/sysconfig/pgsql/postgresql-9.2
  • Ensure the service remains stopped and disabled. It can be enabled later after the replication setup is otherwise complete.


The location of the configuration files noted below is the respective database directory.

With the documented exception of the recovery file, the replication related settings in postgresql.conf and pg_hba.conf are identical on the Primary and Replica. The backup and restore procedure provisions these files from the Primary to the Replica.


# Applies to both Primary and Replica:
bytea_output = 'escape'  # Replication could break with default value of 'hex', although this is unconfirmed.

# Applies to Primary. Ignored on Replica:
wal_level = hot_standby  # Superset of 'archive'.
max_wal_senders = 8  # At least 2 per replica will suffice.
wal_keep_segments = 256  # Based on rate of ~1/min on stken, thereby lasting ~4 hrs and using 16M*256 = ~4G.

# Applies to Replica. Ignored on Primary:
hot_standby = on
max_standby_streaming_delay = 600s  # Running queries on replica that conflict with pending unapplied WAL are canceled after this wait.

# Unnecessary:
#archive_timeout = 600
#max_standby_archive_delay = 600s


This file must contain entries as below, corresponding to the IP address of both the Primary and Replica hosts. This permits swapping their roles without needing to then update this file. In other words, if only the IP address of the Replica was present, and not of the Primary, swapping would then require an update to this file.

In column 2 below, "replication" is not included within "all", and so it has its own entry.

With dmsen01 as Primary and gccensrv1 as Replica,

host    all,replication all      trust  # dmsen01
host    all,replication all     trust  # gccensrv1


The presence of a configured recovery.conf file is what fundamentally distinguishes a Replica from a Primary. If a Replica is promoted to become a Primary, the recovery.conf file on it is automatically renamed to recovery.done.

As such, the recovery.conf file must not exist on the Primary with this name. It may however exist on the Primary with the alternate name, i.e. recovery.done, etc.

The file's owner and permissions must at minimum match that of other configuration files in the directory.

With dmsen01 as Primary,

standby_mode = 'on'
primary_conninfo = 'host=dmsen01 dbname=chimera user=enstore'
recovery_target_timeline = 'latest'



With dmsen01 as Primary and gccensrv1 as the intended Replica, to initialize the Replica:

  • Confirm that the postgresql-9.2 service is running on the Primary host and is stopped on the Replica host.
  • Prepare the database directory on the Replica host:
    [root@gccensrv1 ~]# source /etc/sysconfig/pgsql/postgresql-9.2  # Sets $PGDATA
    [root@gccensrv1 ~]# rm -rf $PGDATA
    [root@gccensrv1 ~]# ln -ns /srv1 /db  # As needed, assuming /srv1 is the actual storage and /db is root dir of $PGDATA
    [root@gccensrv1 ~]# mkdir -p $PGDATA
    [root@gccensrv1 ~]# chown enstore:enstore $PGDATA
    [root@gccensrv1 ~]# cd $(dirname $PGDATA)
  • Restore the most recent data backup of the Primary database cluster on to the Replica host, noting that pg_xlog directory is excluded from the backup:
    [root@gccensrv1 chimera]# BKPFILE=$(ssh enstore@dmsen03 ls -t /srv3/enstore/backups/pnfs-backup/psql-data.*.tar.Z | head -1)
    [root@gccensrv1 chimera]# rsync -av enstore@dmsen03:${BKPFILE} .
    [root@gccensrv1 chimera]# tar xzvf $BKPFILE --preserve-permissions --same-owner
  • Provision the recovery.conf file in the psql-data directory on the Replica. The value of host in this file must be the Primary host.
  • Optimize postgresql.conf for the Replica host.
    [enstore@gccensrv1 ~]$ rm -rf ./pgtune
    [enstore@gccensrv1 ~]$ git clone  # If SSL error, retry with "export GIT_SSL_NO_VERIFY=true" 
    [enstore@gccensrv1 ~]# source /etc/sysconfig/pgsql/postgresql-9.2  # Sets $PGDATA
    [enstore@gccensrv1 ~]$ PGCONF=${PGDATA}/postgresql.conf
    [enstore@gccensrv1 ~]$ MAXCON=$(grep ^max_connections $PGCONF | tail -1 | awk '{print $3}')
    [enstore@gccensrv1 ~]$ ./pgtune/src/ -c=${MAXCON} -f=0.5 > ${PGDATA}/postgresql.tuned.conf  # f=0.5 since it'll host 2 db clusters
    [enstore@gccensrv1 ~]$ echo "include 'postgresql.tuned.conf'" >> $PGCONF
    [enstore@gccensrv1 ~]$ echo "log_directory = 'pg_log'" >> $PGCONF  # can be something else otherwise
    [enstore@gccensrv1 ~]$ echo "archive_mode = off" >> $PGCONF
  • Copy the pg_xlog directory from the Primary to the Replica. Failing this, the intended Replica will not start.
    [root@gccensrv1 chimera]# cd $PGDATA
    [root@gccensrv1 psql-data]# rsync -av enstore@dmsen01:/srv1/pnfs/db/psql-data/pg_xlog .
  • Confirm that the oldest WAL file on the Primary is older than the newest WAL file on the intended Replica. Failing this assertion, the intended Replica cannot catch up with the Primary – this error state can happen if wal_keep_segments is too low, the backup is too old, or too much time passed between the backup and its restoration.
    [enstore@dmsen01 ~]$ source /etc/sysconfig/pgsql/postgresql-9.2
    [enstore@dmsen01 ~]$ ls -lhrt ${PGDATA}/pg_xlog/ | grep -v ^total | grep -v archive_status | head -1
    -rw------- 1 enstore enstore  16M Jul 29 02:12 00000001000000800000002A
    [enstore@gccensrv1 ~]$ source /etc/sysconfig/pgsql/postgresql-9.2
    [enstore@gccensrv1 ~]$ ls -lhrt ${PGDATA}/pg_xlog/ | grep -v ^total | grep -v archive_status | tail -1
    -rw------- 1 enstore enstore  16M Aug 20 12:15 000000010000008000000023
  • Start the postgresql service on the Replica host.
  • Check the startup log of the Replica database. Confirm per the log that the database is connected to primary and is read only. The Replica is now ready to use.
    [enstore@gccensrv1 ~]$ ls -t ${PGDATA}/pg_log/* | head -1 | xargs cat
    2015-02-13 12:53:53.382 CSTLOG:  database system was shut down in recovery at 2015-02-13 12:53:52 CST
    2015-02-13 12:53:53.394 CSTLOG:  entering standby mode
    2015-02-13 12:53:53.400 CSTLOG:  consistent recovery state reached at 83/4800FEF0
    2015-02-13 12:53:53.400 CSTLOG:  redo starts at 83/4800FEF0
    2015-02-13 12:53:53.400 CSTLOG:  invalid record length at 83/4800FF88
    2015-02-13 12:53:53.402 CSTLOG:  database system is ready to accept read only connections
    2015-02-13 12:53:53.408 CSTLOG:  streaming replication successfully connected to primary
  • Testing the Replica to confirm that it is working is not strictly necessary, given that SQL commands for monitoring it via the Primary are a sufficient substitute. Having said this, below is a simple test of changes mirrored from the Primary to the Replica.
    [enstore@dmsen01 ~]$ psql -qt chimera -c 'CREATE TABLE test1 (t time); INSERT INTO test1 VALUES (now()); SELECT * FROM test1;'
    [enstore@gccensrv1 ~]$ psql -qt chimera -c 'SELECT * FROM test1;'
    [enstore@dmsen01 ~]$ psql -qt chimera -c 'DROP TABLE test1;'
    [enstore@gccensrv1 ~]$ psql -qt chimera -c 'SELECT * FROM test1;'
    ERROR:  relation "test1" does not exist


If the Replica loses its connection with the Primary for an extended period of time, and if it later reconnects to the Primary, insufficient WAL files may then exist on the Primary for the Replica to catch up. In this case, the Replica will have to be reinitialized.




$ ps aux | grep '[p]ostgres: wal'
enstore  27047  0.0  0.0 2287648 1316 ?        Ss   Feb09   0:00 postgres: wal writer process                                       
enstore  29291  0.0  0.0 2288628 2824 ?        Ss   12:53   0:00 postgres: wal sender process enstore streaming 83/48010D48

The sender process is indicative of a connection to the noted Replica.


=# \x
Expanded display is on.

=# select pg_is_in_recovery(), pg_last_xlog_receive_location(), pg_last_xlog_replay_location(), pg_last_xact_replay_timestamp();
-[ RECORD 1 ]-----------------+--
pg_is_in_recovery             | f
pg_last_xlog_receive_location |
pg_last_xlog_replay_location  |
pg_last_xact_replay_timestamp | 

=# select * from pg_stat_replication;
-[ RECORD 1 ]----+------------------------------
pid              | 29291
usesysid         | 10
usename          | enstore
application_name | walreceiver
client_addr      |
client_hostname  | 
client_port      | 41859
backend_start    | 2015-02-13 12:53:53.421209-06
state            | streaming
sent_location    | 83/48010DE0
write_location   | 83/48010DE0
flush_location   | 83/48010DE0
replay_location  | 83/48010DE0
sync_priority    | 0
sync_state       | async

=# select txid_current(), txid_current_snapshot();
-[ RECORD 1 ]---------+-----------------
txid_current          | 6826558
txid_current_snapshot | 6826558:6826558:

=# select * from pg_stat_database_conflicts where datname = current_database();
-[ RECORD 1 ]----+--------
datid            | 16387
datname          | chimera
confl_tablespace | 0
confl_lock       | 0
confl_snapshot   | 0
confl_bufferpin  | 0
confl_deadlock   | 0



$ ps aux | grep '[p]ostgres: wal'
enstore  30859  0.0  0.0 710056  2976 ?        Ss   12:53   0:00 postgres: wal receiver process   streaming 83/48011070


=# \x
Expanded display is on.

=# select pg_is_in_recovery(), pg_last_xlog_receive_location(), pg_last_xlog_replay_location(), pg_last_xact_replay_timestamp(), pg_is_xlog_replay_paused();
-[ RECORD 1 ]-----------------+------------------------------
pg_is_in_recovery             | t
pg_last_xlog_receive_location | 83/48011070
pg_last_xlog_replay_location  | 83/48011070
pg_last_xact_replay_timestamp | 2015-02-13 14:55:25.006476-06
pg_is_xlog_replay_paused      | f

=# select txid_current_snapshot();
-[ RECORD 1 ]---------+-----------------
txid_current_snapshot | 6826559:6826559:


If the Primary fails, the Replica can be promoted to become the new Primary per the steps below:

  1. Ensure that the failed Primary database is stopped and remains stopped.
  2. On the Replica host, with the Replica database running, pg_ctl can be used to promote the read-only Replica database to a read-write Primary database. To do this, change to the respective pg_data directory, and run "pg_ctl promote -D .". For this action to succeed, it must be performed as the same user (enstore) that is running the database, and not as root. The recovery.conf file then automatically gets renamed by PostgreSQL to recovery.done.
  3. Check the most recent log file for the database to confirm that it has been promoted successfully and that there are no errors of concern.
  4. As relevant, the original failed Primary can be initialized as the new Replica. For this, reference the Initialization section.

This section does not discuss (automatic) failover given the risk of a false-positive trigger.