I currently have 1 master server (a) and 3 replication servers (b, c, d) that come directly from the master, the archive_command I use is the following script: https://gist.github.com/Geesu/8640616
All servers are Ubuntu 12.04.4 running PostgreSQL 9.2.6
And here is the recovery.conf for each of the replication servers: https://gist.github.com/Geesu/8640635
What's strange is about 6 hours after I started the replication servers, 2 of them immediately fell behind and are now stuck trying to catch up, but they keep getting further behind. Here is how far behind they are compared to master:
a 20287.825072 b 2.521136 c 19994.51653 Does anyone have any ideas as to why one of the servers is nearly caught up completely, but the others keep falling behind? I have verified that a and c are processing the WAL segments, it's just not able to do it fast enough.
Some log examples from a and c:
cp: cannot stat `/var/lib/postgresql/9.2/archive/000000080000109E0000009A': No such file or directory 2014-01-26 23:02:14 GMT LOG: record with zero length at 109E/9AE622D8 cp: cannot stat `/var/lib/postgresql/9.2/archive/000000080000109E0000009A': No such file or directory 2014-01-26 23:02:14 GMT LOG: streaming replication successfully connected to primary 2014-01-26 23:03:36 GMT FATAL: could not receive data from WAL stream: SSL error: sslv3 alert unexpected message cp: cannot stat `/var/lib/postgresql/9.2/archive/000000080000109E000000B9': No such file or directory 2014-01-26 23:03:41 GMT LOG: record with zero length at 109E/B9E797E0 cp: cannot stat `/var/lib/postgresql/9.2/archive/000000080000109E000000B9': No such file or directory 2014-01-26 23:03:41 GMT LOG: streaming replication successfully connected to primary Maybe this is related? Eventually it will get the appropriate WAL segment and be processed.
Any suggestions on how I can further debug this?