StackGres Docs > Runbooks > Postgres Alerts

Postgres Alerts

This runbook will show you how to add Postgres alerts to your Prometheus installation, and will give you examples of useful alerts.

Environment

This runbook will show you how to add Postgres alerts using a default Prometheus installation with kube-prometheus-stack.

The Prometheus installation will add some CRDs to your Kubernetes cluster:

$ kubectl get crd -o name | grep monitoring customresourcedefinition.apiextensions.k8s.io/alertmanagerconfigs.monitoring.coreos.com customresourcedefinition.apiextensions.k8s.io/alertmanagers.monitoring.coreos.com customresourcedefinition.apiextensions.k8s.io/podmonitors.monitoring.coreos.com customresourcedefinition.apiextensions.k8s.io/probes.monitoring.coreos.com customresourcedefinition.apiextensions.k8s.io/prometheuses.monitoring.coreos.com customresourcedefinition.apiextensions.k8s.io/prometheusrules.monitoring.coreos.com customresourcedefinition.apiextensions.k8s.io/servicemonitors.monitoring.coreos.com customresourcedefinition.apiextensions.k8s.io/thanosrulers.monitoring.coreos.com

Defining Alerts

The Prometheus resources allow you to add Prometheus instances and Prometheus rules. The Prometheus rules define your alerts using promql with some other required parameters. This is an example rule:

- alert: PostgresInstanceDown expr: pg_up == 0 for: 1m labels: severity: critical service: "PostgreSQL" annotations: summary: "Postgres server instance is down" description: "Postgres has not been responding for the past 1 minutes on {{ $labels.instance }}" title: "Postgres server instance {{ $labels.instance }} is down "

The expr parameter defines the Prometheus query how to retrieve the data for the alert. The pg_up metric would return 0 if the instance is down and 1 if it is up. The for parameter defines the threshold time evaluating the query. You can also add labels and annotations to describe your alert more precisely. So, the above example means If the pg_up metric value is equal to zero for 1m, then fire the alert.

Check the alerts section below for the full alerts description and YAML definitions.

Useful Alerts

These are some useful alerts that you can add to your StackGres cluster.

Alert	Severity	Threshod	Description
PostgresExporterErrors	critical	5m	Check the last scrape from postgres-exporter
PostgresInstanceDown	critical	1m	Check is postgres service is up
PostgresSplitBrain	critical	1m	Check if is more than one Read/Write instance on the cluster
PostgresPromotedNode	critical	5m	Check if the replication leader was changed
PostgresInactiveReplicationSlots	warning	30m	Check if there is a inactive replication slot
PostgresReplicationLagSizeTooLarge	warning	5m	Check if the replication lag is increasing
PostgresTooManyDeadTuples	warning	30m	Check if there are to many dead tuples
PostgresTooManyConnections	warning	5m	Check if postgres connections are above 90% of max_connections
PostgresNotEnoughConnections	warning	5m	Check if postgres available connections are less than 5
PgBouncerWaitingClients	crititcal	1m	Check if pgbouncer has waiting clients
PgBouncerNotEnoughConnections	critical	5m	Check if pool size is not enough for the current connections
PgBouncerPoolFillingUp	warning	5m	Check if pgBouncer pool is filling up
PgBouncerAvgWaitTimeTooHigh	warning	5m	Check if time spent by clients waiting for a connections > 1s
PgBouncerQueryTimeTooHigh	warning	5m	Check if average query duration more than 5 seconds
DatabaseLowDiskAvailable	warning	15m	Check the database available disk size <= 20%

For a default Prometheus stack installation, create the YAML file stackgres-alerts.yaml:

apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: labels: app: kube-prometheus-stack release: prometheus-operator name: stackgres-rules namespace: monitoring spec: groups: - name: StackGres rules: - alert: PostgresInstanceDown expr: pg_up == 0 for: 1m labels: severity: critical service: "PostgreSQL" cluster: "StackGres" annotations: summary: "Postgres server instance is down" description: "Postgres has not been responding for the past 1 minutes on {{ $labels.instance }}" title: "Postgres server instance {{ $labels.instance }} is down " - alert: PostgresExporterErrors expr: pg_exporter_last_scrape_error == 1 for: 10m labels: severity: critical service: "PostgreSQL" cluster: "StackGres" annotations: summary: "Postgres Exporter is down or is showing errors" description: "postgres-exporter is not running or it is showing errors {{ $labels.instance }}" - alert: PostgresReplicationLagSizeTooLarge expr: pg_replication_status_lag_size > 1e+09 for: 5m labels: severity: critical service: "PostgreSQL" cluster: "StackGres" annotations: summary: "Postgres replication lag size is to large" description: "Replication lag size on server {{$labels.instance}} ({{$labels.application_name}}) is currently {{ $value | humanize1024}}B behind the leader in cluster {{$labels.cluster_name}}" - alert: PostgresTooManyDeadTuples expr: ((pg_stat_user_tables_n_dead_tup > 1e+06) / (pg_stat_user_tables_n_live_tup + pg_stat_user_tables_n_dead_tup)) >= 0.05 for: 30m labels: severity: warning service: "PostgreSQL" cluster: "StackGres" annotations: summary: "PostgreSQL dead tuples is too large" description: "The dead tuple ratio of {{$labels.relname}} on database {{$labels.datname}} is greater than 5% in cluster {{$labels.cluster_name}}" - alert: PostgresInactiveReplicationSlots expr: pg_replication_slots_active == 0 for: 30m labels: severity: warning service: "PostgreSQL" cluster: "StackGres" annotations: summary: "There are inactive replications slots" description: "The are some inactive replication slots on {{$labels.instance}} in cluster {{$labels.cluster_name}}" - alert: PostgresSplitBrain expr: count by(cluster_name) (pg_replication_is_replica == 0) > 1 for: 1m labels: severity: critical service: "PostgreSQL" cluster: "StackGres" annotations: summary: "There are more than one instance in read-write mode" description: "Split Brain: too many postgres databases in cluster {{$labels.cluster_name}} in read-write mode" - alert: PostgresTooManyConnections expr: sum by (datname) (pg_stat_activity_count{datname!~"template.*|postgres"}) > pg_settings_max_connections * 0.9 for: 5m labels: severity: warning service: "PostgreSQL" cluster: "StackGres" annotations: summary: Postgresql too many connections (instance {{ $labels.instance }} in cluster {{$labels.cluster_name}}) description: "PostgreSQL instance has too many connections\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: PostgresNotEnoughConnections expr: sum by (datname) (pg_stat_activity_count{datname!~"template.*|postgres"}) < 5 for: 5m labels: severity: warning service: "PostgreSQL" cluster: "StackGres" annotations: summary: Postgresql not enough connections (instance {{ $labels.instance }} in cluster {{$labels.cluster_name}}) description: "PostgreSQL instance should have more connections (> 5)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: PostgresPromotedNode expr: pg_replication_is_replica and changes(pg_replication_is_replica[1m]) > 0 for: 5m labels: severity: warning service: "PostgreSQL" cluster: "StackGres" annotations: summary: "Postgresql promoted node (instance {{ $labels.instance }}, cluster {{ $labels.cluster_name }})" description: "Postgresql standby server has been promoted as primary node\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" # Connection Pooling alerts - alert: PgBouncerWaitingClients expr: pgbouncer_show_pools_cl_waiting > 0 for: 5m labels: severity: warning service: "PgBouncer" cluster: "StackGres" annotations: summary: PgBouncer has waiting clients on instance {{ $labels.instance }} in cluster {{$labels.cluster_name}}) description: "PgBouncer instance has waiting clients\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: PgBouncerNotEnoughConnections expr: (sum by (database,instance) (pgbouncer_show_pools_cl_active{database!~"template.*|postgres|pgbouncer"}) + sum by (database, instance) (pgbouncer_show_pools_cl_waiting{database!~"template.*|postgres|pgbouncer"})) - on (database,instance) (pgbouncer_show_databases_pool_size{database!~"template.*|postgres|pgbouncer"}) > 0 for: 10m labels: severity: critical service: "PgBouncer" cluster: "StackGres" annotations: summary: PgBouncer pool size is not enough for the current connections on {{ $labels.instance }} in cluster {{$labels.cluster_name}}) description: "PgBouncer is getting more connections than the pool size, extra connections = {{ $value }}" - alert: PgBouncerPoolFillingUp expr: (sum by (database,instance) (pgbouncer_show_databases_pool_size{database!~"template.*|postgres|pgbouncer"}) - on (database,instance) pgbouncer_show_databases_current_connections) <= 15 for: 5m labels: severity: warning service: "PgBouncer" cluster: "StackGres" annotations: summary: PgBouncer pool is filling up on {{ $labels.instance }} in cluster {{$labels.cluster_name}}) description: "PgBouncer pool is filling up, remaining connections = {{ $value }}" - alert: PgBouncerAvgWaitTimeTooHigh expr: pgbouncer_show_stats_avg_wait_time > 1e+6 for: 5m labels: severity: warning service: "PgBouncer" cluster: "StackGres" annotations: summary: PgBouncer time spent by clients waiting for a connections is too high on {{ $labels.instance }} in cluster {{$labels.cluster_name}}) description: "PgBouncer wait for a server connections is too high = {{ $value }}" - alert: PgBouncerQueryTimeTooHigh expr: pgbouncer_show_stats_avg_query_time > 5e+6 for: 5m labels: severity: warning service: "PgBouncer" cluster: "StackGres" annotations: summary: PgBouncer average query duration more than 5 seconds on {{ $labels.instance }} in cluster {{$labels.cluster_name}}) description: "PgBouncer average query duration more than 5 seconds = {{ $value }}" - alert: DatabaseLowDiskAvailable expr: (1.0 - node_filesystem_avail_bytes{mountpoint="/var/lib/postgresql",fstype!=""} / node_filesystem_size_bytes{mountpoint="/var/lib/postgresql",fstype!=""}) * 100 >= 80 for: 15m labels: severity: warning cluster: "StackGres" annotations: summary: Database disk is filling up currently have less than 20% available on {{ $labels.instance }} in cluster {{$labels.cluster_name}}) description: "Database disk is filling up currently have less than 20%, currently occupied {{ $value }} %"

and deploy it to Kubernetes:

kubectl apply -f stackgres-alerts.yaml

Prometheus Alert Discovery

The following describes how Prometheus discovers the alerts.

The default Prometheus installation contains a few deployments:

$ kubectl get deployments.apps -n monitoring NAME READY UP-TO-DATE AVAILABLE AGE prometheus-operator-grafana 1/1 1 1 25d prometheus-operator-kube-p-operator 1/1 1 1 25d prometheus-operator-kube-state-metrics 1/1 1 1 25d

and a Prometheus instance:

$ kubectl get pods -n monitoring -l app=prometheus NAME READY STATUS RESTARTS AGE prometheus-prometheus-operator-kube-p-prometheus-0 2/2 Running 0 25d

The instance defines a ruleSelector section with some labels:

$ kubectl get -n monitoring prometheus prometheus-operator-kube-p-prometheus -o yaml | grep -A3 ruleSelector ruleSelector: matchLabels: app: kube-prometheus-stack release: prometheus-operator

You need to make sure the metadata labels from your Prometheus rule match with the ruleSelector labels of your Prometheus instance. For our example:

apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: labels: app: kube-prometheus-stack release: prometheus-operator name: stackgres-rules namespace: monitoring ... ...

Prometheus will auto-discover the alerts according to the matching labels and add them to the instance:

Alerts

Email Alerts With Alert Manager

Once you have configured the alerts on Prometheus, depending on your infrastructure, you can send messages when an alert fires. You could use email, Slack, Pagerduty, Opsgenie, and others. Check the Alert Manager Configuration for more details.

In this example, you’ll see how to configure alerts with email messages.

The Prometheus stack installation includes the Alert Manager:

$ kubectl get pods -n monitoring -l app=alertmanager NAME READY STATUS RESTARTS AGE alertmanager-prometheus-operator-kube-p-alertmanager-0 2/2 Running 0 68m

The alerting is configured using a secret:

$ kubectl get secrets -n monitoring alertmanager-prometheus-operator-kube-p-alertmanager NAME TYPE DATA AGE alertmanager-prometheus-operator-kube-p-alertmanager Opaque 1 30d

To update the configuration, you need to first create a file alertmanager.yaml with the following content. Replace the users and credentials according to your needs:

global: resolve_timeout: 5m route: receiver: 'email-alert' group_by: ['datname'] routes: - receiver: 'email-alert' group_wait: 50s group_interval: 5m repeat_interval: 12h receivers: - name: email-alert email_configs: - to: email-group@ongres.com from: stackgres-alerts@ongres.com # Your smtp server address smarthost: mail.ongres.com:587 auth_username: myuser@ongres.com auth_identity: myuser@ongres.com auth_password: XXXXXXX

Then, you need to encrypt the file contents:

$ cat alertmanager.yaml | base64 -w0

This returns a string like:

Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0Kcm91dGU6CiAgcmVjZWl2ZXI6ICdlbWFpbC1hbGVydCcK

Edit the Alert Manager secret with the string generated with the command above:

$ kubectl patch secret -n monitoring alertmanager-prometheus-operator-kube-p-alertmanager -p='{"data":{"alertmanager.yaml": "Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0Kcm91dGU6CiAgcmVjZWl2ZXI6ICdlbWFpbC1hbGVydCcK"}}' -v=1

After a few seconds, the new configuration will be applied. You can verify it by accessing the Alert Manager console, with:

kubectl port-forward -n monitoring alertmanager-prometheus-operator-kube-p-alertmanager-0 9093:9093

Then open your web browser at: http://localhost:9093

Alert Manager

Now, go to the Status section and verify that your changes were applied:

Alert Manager Conf

If there’s an active alert on your cluster, you should see it in the Alert Manager console:

Alert Manager Alert

And if the email configuration and credentials you provide are OK, you’ll receive an email:

Alert Manager Message