Maintenance with Zero-Downtime

Usually maintenance task for Postgres requires some degree of disruption of the service that may convert, in some cases, into an issue for your business.

In this runbook we’ll demonstrate a technique to achieve Zero-Downtime for your maintenance task using StackGres and an external PgBouncer instance in transaction pooling mode.

IMPORTANT: When transaction pooling is used clients must not use any session-based features, since each transaction ends up in a different connection and thus gets a different session state. From version 1.21.0 PgBouncer added some support to allow use of prepared statement in transaction pooling mode.

Restart and Switchover with Zero-Downtime

In some cases Postgres have to be restarted in order to allow some parameter changes. This can be performed manually using patronictl restart command. In other cases you need to re-create the Pod in order for some changes in the configuration to take place like after an upgrade of the StackGres operator or when any change in the SGCluster modify the Pod. To re-create a Pod you may simply delete it. But in both cases, a manual operation, may be dangerous and not taking into account the order of how to re-create each Pod of the SGCluster. In general a restart, security upgrade or minor version upgrade SGDbOps are the way to go. They will handle smoothly the operation performing a controlled switchover of the primary instance when needed. And following the right order to update all of your Pods that needs to be updated.

The switchover operation is not perfect since it disconnect all the Postgres clients forcing them to return an error to the final user. PgBouncer offer a mechanism to avoid that when configured with transaction pooling mode with the PAUSE and RESUME commands. Sending the PAUSE command with transaction pooling will wait for all the connection to complete the current transaction and pause all of them. Sending the RESUME command will resume the connection in order for them to continue to send transactions to the target Postgres instance. While connections are paused we can change the target Postgres instance achieving a zero-downtime experience for our users.

To achieve this feature we create a PgBouncer instance (using the same image used by SGCluster for the connection pooling) that will target the read-write Service of an SGCluster. The following snippet allow to create it together with a Service that will be used by applications to connect to PgBouncer:

cat << 'EOF' | kubectl replace --force -f - --- apiVersion: apps/v1 kind: Deployment metadata: name: pgbouncer spec: selector: matchLabels: app: pgbouncer template: metadata: labels: app: pgbouncer spec: terminationGracePeriodSeconds: 0 containers: - name: pgbouncer image: quay.io/ongres/pgbouncer:v1.22.1-build-6.33 command: - sh - /usr/local/bin/start-pgbouncer.sh ports: - containerPort: 5432 name: pgbouncer protocol: TCP volumeMounts: - name: dynamic mountPath: /etc/pgbouncer - name: config mountPath: /etc/pgbouncer/pgbouncer.ini subPath: pgbouncer.ini - name: config mountPath: /usr/local/bin/start-pgbouncer.sh subPath: start-pgbouncer.sh volumes: - name: dynamic emptyDir: {} - name: config configMap: defaultMode: 0444 name: pgbouncer optional: false --- apiVersion: v1 kind: Service metadata: name: pgbouncer spec: type: ClusterIP selector: app: pgbouncer ports: - name: pgbouncer port: 5432 protocol: TCP targetPort: pgbouncer EOF 

The PgBouncer instance reference a ConfigMap containing the configuration that targets the SGCluster primary Service, it also contains an initialization scripts that create the credentials for pgbouncer users and to query the Postgres instance in order to authenticate other users:

cat << 'EOF' | kubectl replace --force -f - apiVersion: v1 kind: ConfigMap metadata: name: pgbouncer data: pgbouncer.ini: |  [databases]   * = host=cluster port=5432   [pgbouncer]  listen_addr=0.0.0.0  listen_port=5432   pool_mode=transaction  max_client_conn=1000  default_pool_size=100  max_db_connections=0  max_user_connections=0   auth_type=md5  auth_file=/etc/pgbouncer/userlist.txt  auth_user=postgres  auth_query=SELECT usename, passwd FROM pg_shadow WHERE usename=$1   admin_users=pgbouncer_admin  stats_users=pgbouncer_stats  application_name_add_host=1  ignore_startup_parameters=extra_float_digits   server_check_query=; start-pgbouncer.sh: |  #!/bin/sh  printf '"%s" "%s"\n' "postgres" "sup3rus3r" >> /etc/pgbouncer/userlist.txt  printf '"%s" "%s"\n' "pgbouncer_admin" "pgb0unc3r" >> /etc/pgbouncer/userlist.txt  printf '"%s" "%s"\n' "pgbouncer_stats" "pgb0unc3r" >> /etc/pgbouncer/userlist.txt  exec pgbouncer /etc/pgbouncer/pgbouncer.ini EOF 

We then create an SGCluster that has the logic to send the PAUSE and RESUME command thanks to the before_stop guard script and on_role_change callback:

cat << 'EOF' | kubectl replace --force -f - --- apiVersion: stackgres.io/v1 kind: SGCluster metadata: name: cluster spec: configurations: patroni: initialConfig: postgresql: callbacks: on_role_change: /callbacks/on_role_change before_stop: /callbacks/before_stop credentials: users: superuser: password: name: credentials key: superuser-password instances: 2 profile: development pods: customVolumeMounts: patroni: - name: custom-callbacks mountPath: /callbacks customVolumes: - name: callbacks configMap: defaultMode: 0775 name: callbacks optional: false persistentVolume: size: 5Gi postgres: version: latest --- apiVersion: v1 kind: Secret metadata: name: credentials stringData: superuser-password: sup3rus3r EOF 

The scripts are mounted in a custom volume mount from a ConfigMap. The before_stop script is executed by Patroni synchronously and blocks the primary instance from being stopped by a switchover or a restart until the PAUSE command is sent to the PgBouncer instance. This allows the connection to complete the ongoing transactions before the primary goes offline. The on_role_change script is executed asynchronically by Patroni and do not block the promotion of a primary. It actually waits for the instance to be converted to primary and then sends the RESUME command so that connection sent to the instance will be able to write to the primary:

cat << 'EOF' | kubectl replace --force -f - apiVersion: v1 kind: ConfigMap metadata: name: callbacks data: before_stop: |  #!/bin/sh  set -x  PATRONI_NAME="$(cat /etc/hostname)"  PATRONI_HISTORY="$(patronictl history -f tsv | tail -n +2)"  PATRONI_LIST="$(patronictl list -f tsv | tail -n +2)"  if {  [ "x$PATRONI_HISTORY" = x ] \  && ! printf %s "$PATRONI_LIST" | grep -v $'^[^\t]\+\t'"$PATRONI_NAME"$'\t' | grep -q $'^[^\t]\+\t[^\t]\+\t[^\t]\+\tLeader\t'  } \  || printf %s "$PATRONI_HISTORY" | grep -q $'^[^\t]\+\t[^\t]\+\t[^\t]\+\t[^\t]\+\t'"$PATRONI_NAME"'$'  then  psql postgresql://pgbouncer_admin:pgb0unc3r@pgbouncer/pgbouncer -c PAUSE  fi  exit 0 on_role_change: |  #!/bin/sh  set -x  if [ "$#" = 0 ] || [ "x$2" = xmaster ]  then  until psql -tA -c 'SELECT pg_is_in_recovery()' | grep -qxF f  do  true  done  psql postgresql://pgbouncer_admin:pgb0unc3r@pgbouncer/pgbouncer -c RESUME  psql postgresql://pgbouncer_admin:pgb0unc3r@pgbouncer/pgbouncer -tA -c 'SHOW STATE' | grep -q 'paused|no'  fi EOF 

Now, to demonstrate the effectivity of this deployment let’s create the following Job that will launch a pgbench against the PgBouncer instance:

cat << 'EOF' | kubectl replace --force -f - apiVersion: batch/v1 kind: Job metadata: name: pgbench spec: template: spec: restartPolicy: OnFailure terminationGracePeriodSeconds: 0 containers: - name: pgbench image: quay.io/ongres/postgres-util:v16.3-build-6.34 command: - sh - -c - |  pgbench postgresql://postgres:sup3rus3r@pgbouncer/postgres -i  pgbench postgresql://postgres:sup3rus3r@pgbouncer/postgres -T 300 -c 4 -j 4 -P 2 --progress-timestamp EOF 

Wait for the Job to be started and print some progress of the benchmark, then create an restart SGDbOps that will restart the replica, perform a switchover and then will restart the primary.

cat << 'EOF' | kubectl replace --force -f - apiVersion: stackgres.io/v1 kind: SGDbOps metadata: name: restart spec: op: restart sgCluster: cluster restart: method: InPlace EOF 

After the operation is completed wait for the completion of the Job and check no errors were raised:

kubectl wait sgdbops restart --for=condition=Completed kubectl wait job pgbench --for=condition=Completed 

Check the pgbench Job’s logs and you should not be able to find any failed connection!