WHO WANTS A SERVICE WITH ZERO DOWNTIME?
… EVERYBODY
IS IT THAT GOOD?
NOT JUST TECHNOLOGY. RISKS, PROCEDURES, PEOPLE
FROM 0 TO ~100: BUSINESS CONTINUITY WITH POSTGRESQL Gabriele Bartolini Head of Support @ 2ndQuadrant PgDay.IT 2017, Milan
2ndquadrant.com @_GBartolini_ #PGDayIT ABOUT MYSELF ▸ Open Source passionate and programmer since 1995 ▸ First time with Postgres in 1997, regular from ~2000 ▸ Lean and DevOps practitioner ▸ Co-Founder of ITPUG and PostgreSQL Europe ▸ Entrepreneur, with 2ndQuadrant since 2008 ▸ Co-Author of “PostgreSQL Administration Cookbook” ▸ Came up with the name “Barman”
2ndquadrant.com @_GBartolini_ #PGDayIT BUSINESS CONTINUITY ▸ Disaster Recovery ▸ High Availability ▸ Types of disaster/failures ▸ Availability = Uptime / (Uptime + Downtime)
2ndquadrant.com @_GBartolini_ #PGDayIT OBJECTIVES ▸ Recovery Point Objective (RPO) ▸ How much data can I afford to lose? ▸ Recovery Time Objective (RTO) ▸ How long will it take me to recover?
2ndquadrant.com @_GBartolini_ #PGDayIT SERVICE RELIABILITY ▸ Cost of downtime ▸ How many €/$/£/AUD/…? ▸ Risk management ▸ SLI, SLO and SLA
2ndquadrant.com @_GBartolini_ #PGDayIT SOME NOTES FOR THIS PRESENTATION ▸ PostgreSQL on Linux ▸ Servers can be either physical or virtual ▸ Storage must be redundant ▸ RAID is required ▸ VOLUME: redundant disk mounted on a system
LET’S START
0. ONE POSTGRES SERVER
2ndquadrant.com @_GBartolini_ #PGDayIT ARCHITECTURE Server name: hope
2ndquadrant.com @_GBartolini_ #PGDayIT RECAP ▸ Why is RPO = ∞? ▸ Why is RTO = n/a? ▸ “Hope is not a strategy” (cit. Google) ▸ More common than you’d expect
10. ONE POSTGRES SERVER + LOGICAL BACKUPS
2ndquadrant.com @_GBartolini_ #PGDayIT ARCHITECTURE Add systematic backups with pg_dump LOGICAL BACKUP LOGICAL BACKUP LOGICAL BACKUP … Day 0
 4AM Day -1 4AM Day -2 4AM
2ndquadrant.com @_GBartolini_ #PGDayIT RECAP ▸ How do you feel now? ▸ Still: RPO = ∞ and RTO = n/a. Why? ▸ A backup is valid only if you have tested it ▸ Unfortunately, this is very common
20. ONE POSTGRES SERVER + LOGICAL BACKUPS + LOGICAL RESTORES
2ndquadrant.com @_GBartolini_ #PGDayIT ARCHITECTURE Test your backups with pg_restore LOGICAL BACKUP Day 0
 4AM
2ndquadrant.com @_GBartolini_ #PGDayIT DEFINING SOME OBJECTIVES ▸ Measure time for pg_restore ▸ RPO = backup frequency ▸ RTO = maximum time of recovery ▸ Provision another server ▸ Configure another server (automated, right?) ▸ Time to restore the last backup (measure it)
HAVE WE REALLY THOUGHT ABOUT EVERYTHING?
TIME OF REACTION
2ndquadrant.com @_GBartolini_ #PGDayIT RECAP ▸ Can this architecture work for you? ▸ We need reliable monitoring ▸ From now on, we assume we have it in place! ▸ We need to reduce both RPO and RTO
HOW? POINT-IN-TIME-RECOVERY
2ndquadrant.com @_GBartolini_ #PGDayIT POSTGRESQL’S PITR ▸ Part of core (fully open source) ▸ Rebuild a cluster at a point in time ▸ From crash recovery to sync streamrep (physical/logical) ▸ RPO = 0 (zero data loss) ▸ Hot base backup, continuous WAL archiving, Recovery ▸ API
2ndquadrant.com @_GBartolini_ #PGDayIT BASIC CONCEPTS ▸ Continuous copy of WAL data (continuous archiving) ▸ Physical base backups ▸ Recovery: ▸ copy base backup to another location ▸ recovery mode (replay of WALs until target)
2ndquadrant.com @_GBartolini_ #PGDayIT BARMAN ▸ In this presentation: Barman 2.3 ▸ Open Source (GNU GPL 3) ▸ Written in Python ▸ Developed and maintained by 2ndQuadrant ▸ Available at www.pgbarman.org
40. ONE POSTGRES SERVER + ONE BARMAN SERVER
2ndquadrant.com @_GBartolini_ #PGDayIT ARCHITECTURE Continuous backup
2ndquadrant.com @_GBartolini_ #PGDayIT BASIC CONCEPTS ▸ Remote backup and recovery ▸ Multiple server management ▸ Backup catalogue and WAL archive ▸ Retention policies
2ndquadrant.com @_GBartolini_ #PGDayIT COPY METHOD ▸ PostgreSQL streaming ▸ Practical/Windows/Docker ▸ Rsync/SSH ▸ Incremental backup and recovery (via hard links) ▸ Parallel backup and recovery ▸ Network compression and bandwidth limitation
2ndquadrant.com @_GBartolini_ #PGDayIT WAL SHIPPING METHOD ▸ “archiving”, through “archive_command”: ▸ RPO ~ 16MB of WAL data, or ▸ “archive_timeout” ▸ “streaming”, through streaming replication: ▸ “pg_receivewal” or “pg_receivexlog” ▸ continuous stream, RPO ~ 0 ▸ PostgreSQL 9.2+ required
2ndquadrant.com @_GBartolini_ #PGDayIT EXAMPLE FROM POSTGRESQL.CONF archive_mode = on wal_level = logical max_wal_senders = 10 max_replication_slots = 10 archive_command = 'rsync -a %p barman@HOST:/var/lib/barman/ID/incoming'
2ndquadrant.com @_GBartolini_ #PGDayIT EXAMPLE FROM BARMAN.CONF [angus] description = “Angus Young database" ssh_command = ssh postgres@angus conninfo = user=barman-acdc dbname=postgres host=angus retention_policy = RECOVERY WINDOW OF 6 MONTHS copy_method = rsync reuse_backup = link parallel_jobs = 4 archiver = true streaming_archiver = true slot_name = barman_streaming_acdc
2ndquadrant.com @_GBartolini_ #PGDayIT RECAP ▸ How do you feel now? ▸ Still: RPO = ∞ and RTO = n/a. Why? ▸ A backup is valid only if you have tested it ▸ Barman reduces backup risks, does not exclude them ▸ Systematic tests (especially custom scripts) ▸ Business risk is very high
60. ONE POSTGRES SERVER + ONE BARMAN SERVER + ONE RECOVERY SERVER
2ndquadrant.com @_GBartolini_ #PGDayIT ARCHITECTURE Test your backups with barman recover
WHAT A WASTE!
TESTING OR BI? HAVE YOU EVER THOUGHT OF USING IT FOR
2ndquadrant.com @_GBartolini_ #PGDayIT HOOK SCRIPTS ▸ Barman has hook scripts: ▸ pre and post backup ▸ pre and post archiving ▸ with retry option (until the script returns SUCCESS)
2ndquadrant.com @_GBartolini_ #PGDayIT EXAMPLE OF RECOVERY SCRIPT ▸ Write a bash script that: ▸ connects to a remote server via SSH ▸ stops the PostgreSQL server ▸ issues a “barman recover” with target “immediate” ▸ starts the PostgreSQL ▸ Set it as post-backup script
2ndquadrant.com @_GBartolini_ #PGDayIT SOME FOOD FOR THOUGHT ▸ Outcomes: ▸ Systematically test your backup ▸ Measure your recovery time ▸ Identical server? This is a backup server ready to start ▸ You can use a different data centre ▸ Be creative, PostgreSQL gives you infinite freedom!
2ndquadrant.com @_GBartolini_ #PGDayIT RECAP ▸ RPO ~ 0 (your backups work, every time) ▸ RTO = Time of reaction + Recovery time ▸ Example: RPO ~0 and RTO < 1 day ▸ Acceptable or not acceptable? ▸ Entry level architecture for business continuity ▸ Priority now: improve RTO
HOW? REPLICATION
2ndquadrant.com @_GBartolini_ #PGDayIT POSTGRESQL’S REPLICATION ▸ Part of core (fully open source) ▸ One master, multiple standby servers ▸ Evolution of PITR ▸ Standby server is in continuous recovery mode ▸ Hot standby (read-only) ▸ Both streaming (9.0+) and file based pulling of WAL ▸ Cascading from a standby
2ndquadrant.com @_GBartolini_ #PGDayIT SYNCHRONOUS REPLICATION ▸ Fine control (from global down to transaction level) ▸ 2-safe replication ▸ COMMIT of a write transactions waits until written on both the master and a standby (or more from 9.6) ▸ More than a synchronous client is required ▸ Read consistency of a cluster ▸ RPO = 0 (zero data loss)
80. TWO POSTGRES SERVERS + ONE BARMAN SERVER + ONE RECOVERY SERVER
2ndquadrant.com @_GBartolini_ #PGDayIT ARCHITECTURE barman_restore_wal barman recover Symmetric Cluster master standbyANGUS MALCOLM
2ndquadrant.com @_GBartolini_ #PGDayIT EXCERPT FROM POSTGRESQL’S CONFIGURATION postgresql.conf: hot_standby = on recovery.conf: standby_mode = ‘on' # Streaming primary_conninfo = 'host=angus user=replica application_name=ha sslmode=require’ # Fallback via Barman restore_command = 'barman-wal-restore -U barman acdc angus %f %p'
2ndquadrant.com @_GBartolini_ #PGDayIT SWITCHOVER (PLANNED) ▸ Applications are paused (start of downtime) ▸ Shut down the master ▸ Allow the standby to catch up with the master ▸ Promote the standby ▸ Switch virtual IPs ▸ Resume applications (end of downtime) ▸ Reconfigure the former master as standby
2ndquadrant.com @_GBartolini_ #PGDayIT FAILOVER (UNPLANNED) ▸ The master is down (start of downtime) ▸ Promote the standby ▸ Change the virtual IP ▸ DEGRADED SYSTEM
2ndquadrant.com @_GBartolini_ #PGDayIT MANUAL SWITCHOVER AND FAILOVER ▸ Manual switchover != manual switchover procedure ▸ Manual switchover = manually triggered ▸ Automate the procedure!!! ▸ bash (good) ▸ Ansible (better) ▸ Enhance gradually
2ndquadrant.com @_GBartolini_ #PGDayIT RECAP ▸ RPO ~ 0 (your backups work, every time) ▸ RTO = Time of reaction + Time of promotion ▸ Criticality: manual intervention ▸ Reliable monitoring ▸ Trained people (practice & docs!)
2ndquadrant.com @_GBartolini_ #PGDayIT MANUAL FAILOVER VS AUTOMATED FAILOVER ▸ Risk management ▸ Split brain nightmare ▸ Automated is built on manual (test!) ▸ Your choice ▸ Very good solution for business continuity ▸ Uptime > 99.99% in a year
90. TWO POSTGRES SYNC SERVERS + ONE BARMAN SERVER + ONE RECOVERY SERVER
2ndquadrant.com @_GBartolini_ #PGDayIT ARCHITECTURE barman_restore_wal barman recover Potential synchronous Synchronous ZERO DATA LOSS
2ndquadrant.com @_GBartolini_ #PGDayIT SYNCHRONOUS REPLICATION ▸ Primary: Barman ▸ Zero data loss backup ▸ Primary: Standby ▸ Zero data loss cluster (reduce RTO) ▸ Just one configuration line in PostgreSQL ▸ synchronous_standby_names = '1 (ha, barman_receive_wal)'
~100. TWO POSTGRES SYNC SERVERS + ONE BARMAN SERVER + ONE RECOVERY SERVER + REPMGR (AUTO-FAILOVER)
2ndquadrant.com @_GBartolini_ #PGDayIT ARCHITECTURE Potential synchronous Synchronous repmgr repmgr repmgr witness
WHAT’S MORE?
2ndquadrant.com @_GBartolini_ #PGDayIT PUSH THE BOUNDARIES ▸ Repeatable architectures ▸ PgBouncer ▸ Virtual IPs ▸ S3 relay via Barman hook scripts ▸ Multiple standby servers and cascading replication ▸ Docker containers ▸ Logical replication backups
2ndquadrant.com @_GBartolini_ #PGDayIT CONCLUSIONS ▸ Babysteps and KISS ▸ New? Explore and learn ▸ Practice is the only way to mastery (drills) ▸ Plan regular healthy downtimes ▸ Use switchovers to perform PostgreSQL updates ▸ Smart downtimes increase long-term uptime
2ndquadrant.com @_GBartolini_ #PGDayIT ANY QUESTIONS? ▸ PostgreSQL: www.postgresql.org ▸ Barman: www.pgbarman.org ▸ Barman Cli: github.com/2ndquadrant-it/barman-cli ▸ PgBouncer: pgbouncer.github.io ▸ Repmgr: www.repmgr.org ▸ Our blog: blog.2ndquadrant.com
2ndquadrant.com @_GBartolini_ #PGDayIT LICENCE Attribution 4.0 International (CC BY 4.0) You are free to: ▸ Share — copy and redistribute the material in any medium or format ▸ Adapt — remix, transform, and build upon the material for any purpose, even commercially. The licensor cannot revoke these freedoms as long as you follow the license terms.

From 0 to ~100: Business Continuity with PostgreSQL