This runbook will show you how to add Postgres alerts to your Prometheus installation, and will give you examples of useful alerts.
This runbook will show you how to add Postgres alerts using a default Prometheus installation with kube-prometheus-stack.
The Prometheus installation will add some CRDs to your Kubernetes cluster:
$ kubectl get crd -o name | grep monitoring customresourcedefinition.apiextensions.k8s.io/alertmanagerconfigs.monitoring.coreos.com customresourcedefinition.apiextensions.k8s.io/alertmanagers.monitoring.coreos.com customresourcedefinition.apiextensions.k8s.io/podmonitors.monitoring.coreos.com customresourcedefinition.apiextensions.k8s.io/probes.monitoring.coreos.com customresourcedefinition.apiextensions.k8s.io/prometheuses.monitoring.coreos.com customresourcedefinition.apiextensions.k8s.io/prometheusrules.monitoring.coreos.com customresourcedefinition.apiextensions.k8s.io/servicemonitors.monitoring.coreos.com customresourcedefinition.apiextensions.k8s.io/thanosrulers.monitoring.coreos.com
The Prometheus resources allow you to add Prometheus instances and Prometheus rules. The Prometheus rules define your alerts using promql
with some other required parameters. This is an example rule:
- alert: PostgresInstanceDown expr: pg_up == 0 for: 1m labels: severity: critical service: "PostgreSQL" annotations: summary: "Postgres server instance is down" description: "Postgres has not been responding for the past 1 minutes on {{ $labels.instance }}" title: "Postgres server instance {{ $labels.instance }} is down "
The expr
parameter defines the Prometheus query how to retrieve the data for the alert. The pg_up
metric would return 0
if the instance is down and 1
if it is up. The for
parameter defines the threshold time evaluating the query. You can also add labels and annotations to describe your alert more precisely. So, the above example means If the pg_up metric value is equal to zero for 1m, then fire the alert
.
Check the alerts section below for the full alerts description and YAML definitions.
These are some useful alerts that you can add to your StackGres cluster.
Alert | Severity | Threshod | Description |
---|---|---|---|
PostgresExporterErrors | critical | 5m | Check the last scrape from postgres-exporter |
PostgresInstanceDown | critical | 1m | Check is postgres service is up |
PostgresSplitBrain | critical | 1m | Check if is more than one Read/Write instance on the cluster |
PostgresPromotedNode | critical | 5m | Check if the replication leader was changed |
PostgresInactiveReplicationSlots | warning | 30m | Check if there is a inactive replication slot |
PostgresReplicationLagSizeTooLarge | warning | 5m | Check if the replication lag is increasing |
PostgresTooManyDeadTuples | warning | 30m | Check if there are to many dead tuples |
PostgresTooManyConnections | warning | 5m | Check if postgres connections are above 90% of max_connections |
PostgresNotEnoughConnections | warning | 5m | Check if postgres available connections are less than 5 |
PgBouncerWaitingClients | crititcal | 1m | Check if pgbouncer has waiting clients |
PgBouncerNotEnoughConnections | critical | 5m | Check if pool size is not enough for the current connections |
PgBouncerPoolFillingUp | warning | 5m | Check if pgBouncer pool is filling up |
PgBouncerAvgWaitTimeTooHigh | warning | 5m | Check if time spent by clients waiting for a connections > 1s |
PgBouncerQueryTimeTooHigh | warning | 5m | Check if average query duration more than 5 seconds |
DatabaseLowDiskAvailable | warning | 15m | Check the database available disk size <= 20% |
For a default Prometheus stack installation, create the YAML file stackgres-alerts.yaml
:
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: labels: app: kube-prometheus-stack release: prometheus-operator name: stackgres-rules namespace: monitoring spec: groups: - name: StackGres rules: - alert: PostgresInstanceDown expr: pg_up == 0 for: 1m labels: severity: critical service: "PostgreSQL" cluster: "StackGres" annotations: summary: "Postgres server instance is down" description: "Postgres has not been responding for the past 1 minutes on {{ $labels.instance }}" title: "Postgres server instance {{ $labels.instance }} is down " - alert: PostgresExporterErrors expr: pg_exporter_last_scrape_error == 1 for: 10m labels: severity: critical service: "PostgreSQL" cluster: "StackGres" annotations: summary: "Postgres Exporter is down or is showing errors" description: "postgres-exporter is not running or it is showing errors {{ $labels.instance }}" - alert: PostgresReplicationLagSizeTooLarge expr: pg_replication_status_lag_size > 1e+09 for: 5m labels: severity: critical service: "PostgreSQL" cluster: "StackGres" annotations: summary: "Postgres replication lag size is to large" description: "Replication lag size on server {{$labels.instance}} ({{$labels.application_name}}) is currently {{ $value | humanize1024}}B behind the leader in cluster {{$labels.cluster_name}}" - alert: PostgresTooManyDeadTuples expr: ((pg_stat_user_tables_n_dead_tup > 1e+06) / (pg_stat_user_tables_n_live_tup + pg_stat_user_tables_n_dead_tup)) >= 0.05 for: 30m labels: severity: warning service: "PostgreSQL" cluster: "StackGres" annotations: summary: "PostgreSQL dead tuples is too large" description: "The dead tuple ratio of {{$labels.relname}} on database {{$labels.datname}} is greater than 5% in cluster {{$labels.cluster_name}}" - alert: PostgresInactiveReplicationSlots expr: pg_replication_slots_active == 0 for: 30m labels: severity: warning service: "PostgreSQL" cluster: "StackGres" annotations: summary: "There are inactive replications slots" description: "The are some inactive replication slots on {{$labels.instance}} in cluster {{$labels.cluster_name}}" - alert: PostgresSplitBrain expr: count by(cluster_name) (pg_replication_is_replica == 0) > 1 for: 1m labels: severity: critical service: "PostgreSQL" cluster: "StackGres" annotations: summary: "There are more than one instance in read-write mode" description: "Split Brain: too many postgres databases in cluster {{$labels.cluster_name}} in read-write mode" - alert: PostgresTooManyConnections expr: sum by (datname) (pg_stat_activity_count{datname!~"template.*|postgres"}) > pg_settings_max_connections * 0.9 for: 5m labels: severity: warning service: "PostgreSQL" cluster: "StackGres" annotations: summary: Postgresql too many connections (instance {{ $labels.instance }} in cluster {{$labels.cluster_name}}) description: "PostgreSQL instance has too many connections\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: PostgresNotEnoughConnections expr: sum by (datname) (pg_stat_activity_count{datname!~"template.*|postgres"}) < 5 for: 5m labels: severity: warning service: "PostgreSQL" cluster: "StackGres" annotations: summary: Postgresql not enough connections (instance {{ $labels.instance }} in cluster {{$labels.cluster_name}}) description: "PostgreSQL instance should have more connections (> 5)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: PostgresPromotedNode expr: pg_replication_is_replica and changes(pg_replication_is_replica[1m]) > 0 for: 5m labels: severity: warning service: "PostgreSQL" cluster: "StackGres" annotations: summary: "Postgresql promoted node (instance {{ $labels.instance }}, cluster {{ $labels.cluster_name }})" description: "Postgresql standby server has been promoted as primary node\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" # Connection Pooling alerts - alert: PgBouncerWaitingClients expr: pgbouncer_show_pools_cl_waiting > 0 for: 5m labels: severity: warning service: "PgBouncer" cluster: "StackGres" annotations: summary: PgBouncer has waiting clients on instance {{ $labels.instance }} in cluster {{$labels.cluster_name}}) description: "PgBouncer instance has waiting clients\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: PgBouncerNotEnoughConnections expr: (sum by (database,instance) (pgbouncer_show_pools_cl_active{database!~"template.*|postgres|pgbouncer"}) + sum by (database, instance) (pgbouncer_show_pools_cl_waiting{database!~"template.*|postgres|pgbouncer"})) - on (database,instance) (pgbouncer_show_databases_pool_size{database!~"template.*|postgres|pgbouncer"}) > 0 for: 10m labels: severity: critical service: "PgBouncer" cluster: "StackGres" annotations: summary: PgBouncer pool size is not enough for the current connections on {{ $labels.instance }} in cluster {{$labels.cluster_name}}) description: "PgBouncer is getting more connections than the pool size, extra connections = {{ $value }}" - alert: PgBouncerPoolFillingUp expr: (sum by (database,instance) (pgbouncer_show_databases_pool_size{database!~"template.*|postgres|pgbouncer"}) - on (database,instance) pgbouncer_show_databases_current_connections) <= 15 for: 5m labels: severity: warning service: "PgBouncer" cluster: "StackGres" annotations: summary: PgBouncer pool is filling up on {{ $labels.instance }} in cluster {{$labels.cluster_name}}) description: "PgBouncer pool is filling up, remaining connections = {{ $value }}" - alert: PgBouncerAvgWaitTimeTooHigh expr: pgbouncer_show_stats_avg_wait_time > 1e+6 for: 5m labels: severity: warning service: "PgBouncer" cluster: "StackGres" annotations: summary: PgBouncer time spent by clients waiting for a connections is too high on {{ $labels.instance }} in cluster {{$labels.cluster_name}}) description: "PgBouncer wait for a server connections is too high = {{ $value }}" - alert: PgBouncerQueryTimeTooHigh expr: pgbouncer_show_stats_avg_query_time > 5e+6 for: 5m labels: severity: warning service: "PgBouncer" cluster: "StackGres" annotations: summary: PgBouncer average query duration more than 5 seconds on {{ $labels.instance }} in cluster {{$labels.cluster_name}}) description: "PgBouncer average query duration more than 5 seconds = {{ $value }}" - alert: DatabaseLowDiskAvailable expr: (1.0 - node_filesystem_avail_bytes{mountpoint="/var/lib/postgresql",fstype!=""} / node_filesystem_size_bytes{mountpoint="/var/lib/postgresql",fstype!=""}) * 100 >= 80 for: 15m labels: severity: warning cluster: "StackGres" annotations: summary: Database disk is filling up currently have less than 20% available on {{ $labels.instance }} in cluster {{$labels.cluster_name}}) description: "Database disk is filling up currently have less than 20%, currently occupied {{ $value }} %"
and deploy it to Kubernetes:
kubectl apply -f stackgres-alerts.yaml
The following describes how Prometheus discovers the alerts.
The default Prometheus installation contains a few deployments:
$ kubectl get deployments.apps -n monitoring NAME READY UP-TO-DATE AVAILABLE AGE prometheus-operator-grafana 1/1 1 1 25d prometheus-operator-kube-p-operator 1/1 1 1 25d prometheus-operator-kube-state-metrics 1/1 1 1 25d
and a Prometheus instance:
$ kubectl get pods -n monitoring -l app=prometheus NAME READY STATUS RESTARTS AGE prometheus-prometheus-operator-kube-p-prometheus-0 2/2 Running 0 25d
The instance defines a ruleSelector
section with some labels:
$ kubectl get -n monitoring prometheus prometheus-operator-kube-p-prometheus -o yaml | grep -A3 ruleSelector ruleSelector: matchLabels: app: kube-prometheus-stack release: prometheus-operator
You need to make sure the metadata
labels from your Prometheus rule match with the ruleSelector
labels of your Prometheus instance. For our example:
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: labels: app: kube-prometheus-stack release: prometheus-operator name: stackgres-rules namespace: monitoring ... ...
Prometheus will auto-discover the alerts according to the matching labels and add them to the instance:
Once you have configured the alerts on Prometheus, depending on your infrastructure, you can send messages when an alert fires. You could use email, Slack, Pagerduty, Opsgenie, and others. Check the Alert Manager Configuration for more details.
In this example, you’ll see how to configure alerts with email messages.
The Prometheus stack installation includes the Alert Manager:
$ kubectl get pods -n monitoring -l app=alertmanager NAME READY STATUS RESTARTS AGE alertmanager-prometheus-operator-kube-p-alertmanager-0 2/2 Running 0 68m
The alerting is configured using a secret:
$ kubectl get secrets -n monitoring alertmanager-prometheus-operator-kube-p-alertmanager NAME TYPE DATA AGE alertmanager-prometheus-operator-kube-p-alertmanager Opaque 1 30d
To update the configuration, you need to first create a file alertmanager.yaml
with the following content. Replace the users and credentials according to your needs:
global: resolve_timeout: 5m route: receiver: 'email-alert' group_by: ['datname'] routes: - receiver: 'email-alert' group_wait: 50s group_interval: 5m repeat_interval: 12h receivers: - name: email-alert email_configs: - to: email-group@ongres.com from: stackgres-alerts@ongres.com # Your smtp server address smarthost: mail.ongres.com:587 auth_username: myuser@ongres.com auth_identity: myuser@ongres.com auth_password: XXXXXXX
Then, you need to encrypt the file contents:
$ cat alertmanager.yaml | base64 -w0
This returns a string like:
Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0Kcm91dGU6CiAgcmVjZWl2ZXI6ICdlbWFpbC1hbGVydCcK
Edit the Alert Manager secret with the string generated with the command above:
$ kubectl patch secret -n monitoring alertmanager-prometheus-operator-kube-p-alertmanager -p='{"data":{"alertmanager.yaml": "Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0Kcm91dGU6CiAgcmVjZWl2ZXI6ICdlbWFpbC1hbGVydCcK"}}' -v=1
After a few seconds, the new configuration will be applied. You can verify it by accessing the Alert Manager console, with:
kubectl port-forward -n monitoring alertmanager-prometheus-operator-kube-p-alertmanager-0 9093:9093
Then open your web browser at: http://localhost:9093
Now, go to the Status
section and verify that your changes were applied:
If there’s an active alert on your cluster, you should see it in the Alert Manager console:
And if the email configuration and credentials you provide are OK, you’ll receive an email: