@LukasFittl Monitoring PostgreSQL At Scale
@LukasFittl @LukasFittl pganalyze
@LukasFittl HostedPostgreSQLMonitoring pganalyze.com
@LukasFittl For the purposes of this talk, Monitoring=
 Monitoring+Observability
@LukasFittl BasicsofPostgresMonitoring
@LukasFittl “We had an outage yesterday at 10am - whathappened?”
@LukasFittl Keeping Historic Statistics Data IsEssential
@LukasFittl DIYMonitoringHack: Save pg_stat_activity and pg_stat_database every 10 seconds into a separate monitoring database
@LukasFittl
@LukasFittl
@LukasFittl High-Level Metrics
@LukasFittl AppServers PostgreSQL DataDirectory Storage WAL Storage CPU Memory Connection Pooler
@LukasFittl AppServers PostgreSQL DataDirectory Storage WAL Storage CPU Memory Connection Pooler
@LukasFittl Workload Metrics vs Operational Metrics
@LukasFittl Workload Metrics vs Operational Metrics
@LukasFittl PostgreSQL ConnectionProcessConnectionProcess DataDirectory Storage WAL Storage CPU Memory Checkpointer Background Writer AppServers/ ConnectionPooler Shared Buffers ConnectionProcess QueryExecution QueryPlanning autovacuum WALSender WALWriter
@LukasFittl WorkloadMetrics Data Sources
@LukasFittl Postgres Statistics Tables
@LukasFittl EXPLAIN Plans
@LukasFittl OperationalMetrics Data Sources
@LukasFittl Postgres Statistics Tables
@LukasFittl System Statistics (CPU Util %, I/O Util %, etc)
@LukasFittl Log Files
@LukasFittl OperationalMetrics Can Be Understood Better Through A Focus on Workload
@LukasFittl 3Quick WorkloadMetrics
 ToMonitor
@LukasFittl AgeOfOldestTransaction SELECT MAX(now() - xact_start)
 FROM pg_stat_activity
 WHERE state <> ‘idle’
@LukasFittl SELECT sum(heap_blks_hit) / nullif(sum(heap_blks_hit + heap_blks_read),0) FROM pg_statio_user_tables TableCacheHitRate Target: >= 99%
@LukasFittl SELECT relname, n_live_tup, seq_scan + idx_scan, 100 * idx_scan / (seq_scan + idx_scan) FROM pg_stat_user_tables ORDER BY n_live_tup DESC IndexHitRate Target: >= 95% on large, active tables
@LukasFittl Whatarethecausesofproblems?
@LukasFittl Workload BadDataModel/Indices WrongPostgresConfig InsufficientHardware
@LukasFittl Workload BadDataModel/Indices WrongPostgresConfig InsufficientHardware
@LukasFittl Did the workloadchange?
@LukasFittl New code deployed New customers App config change
@LukasFittl Ability to Drill Down From “HighCPUUtilization” To Specific Set of Queries
@LukasFittl Workload BadDataModel/Indices WrongPostgresConfig InsufficientHardware
@LukasFittl Did the schema change recently?
@LukasFittl Did the queryplan change recently?
@LukasFittl Workload BadDataModel/Indices WrongPostgresConfig InsufficientHardware
@LukasFittl AutomaticPostgresConfig
 Management/Validation
@LukasFittl shared_buffers max_connections (maintenance_)work_mem autovacuum settings checkpointer settings planner settings
@LukasFittl Workload BadDataModel/Indices WrongPostgresConfig InsufficientHardware
@LukasFittl On Cloud Providers, Closely Watch Your I/OMetrics
@LukasFittl
@LukasFittl Drill-Down IntoWorkload Using Query Metrics
@LukasFittl WhichQueriesAreRunning?
@LukasFittl userid | 10 dbid | 1397527 query | SELECT * FROM x WHERE y = $1 calls | 5 total_time | 15.249 rows | 0 shared_blks_hit | 451 shared_blks_read | 41 shared_blks_dirtied | 26 shared_blks_written | 0 local_blks_hit | 0 local_blks_read | 0 local_blks_dirtied | 0 local_blks_written | 0 temp_blks_read | 0 temp_blks_written | 0 blk_read_time | 0 blk_write_time | 0 pg_stat_statements
@LukasFittl Supportedoncloudplatforms
@LukasFittl SELECT * FROM pg_stat_statements LIMIT 1; userid | 10 dbid | 17025 queryid | 1720234670 query | SELECT * FROM x WHERE y = $1 calls | 14 total_time | 0.151 rows | 28 shared_blks_hit | 14 shared_blks_read | 0 shared_blks_dirtied | 0 shared_blks_written | 0 local_blks_hit | 0 local_blks_read | 0 local_blks_dirtied | 0 local_blks_written | 0 temp_blks_read | 0 temp_blks_written | 0 blk_read_time | 0 blk_write_time | 0
@LukasFittl queryid | 1720234670 query | SELECT * FROM x WHERE y = ? calls | 5 total_time | 15.249 Query+No.ofCalls+AvgTime
@LukasFittl shared_blks_hit | 2447215 shared_blks_read | 55335 Avg.SharedBufferHitRate:97% hit_rate = shared_blks_hit / (shared_blks_hit + shared_blks_read)
@LukasFittl blk_read_time | 14.594 blk_write_time | 465.661 Timespentreading/writingtodisk track_io_timing = on
pg_qtop Simple top-like tool that shows pg_stat_statements data https://github.com/lfittl/pg_qtop
AVG | QUERY -------------------------------------------------------------------------------- 10.7ms | SELECT oid, typname, typelem, typdelim, typinput FROM pg_type 3.0ms | SET time zone 'UTC' 0.4ms | SELECT a.attname, format_type(a.atttypid, a.atttypmod), pg_get_expr(d.adbin, d.adrelid), a.attnotnull, a.atttypid, a.atttypmod FROM pg_attribute a LEFT JOIN pg_attrdef d ON a.attrelid = d.adrelid AND a.attnum = d.adnum WHERE a.attrelid = ?::regclass AND a.attnum > ? AND NOT a.attisdropped ORDER BY a.attnum 0.2ms | SELECT pg_stat_statements_reset() 0.1ms | SELECT query, calls, total_time FROM pg_stat_statements 0.1ms | SELECT attr.attname FROM pg_attribute attr INNER JOIN pg_constraint cons ON attr.attrelid = cons.conrelid AND attr.attnum = cons.conkey[?] WHERE cons.contype = ? AND cons.conrelid = ?: :regclass 0.0ms | SELECT COUNT(*) FROM pg_class c LEFT JOIN pg_namespace n ON n.oid = c.relnamespace WHERE c.relkind in (?,?) AND c.relname = ? AND n.nspname = ANY (current_schemas(?)) 0.0ms | SELECT * FROM posts JOIN users ON (posts.author_id = users.id) WHERE users.login = ?; 0.0ms | SET client_min_messages TO 'panic' 0.0ms | set client_encoding to 'UTF8' 0.0ms | SHOW client_min_messages 0.0ms | SELECT * FROM ad_reels WHERE id = ?; 0.0ms | SELECT * FROM posts WHERE guid = ?; 0.0ms | SELECT ? 0.0ms | SET client_min_messages TO 'warning' 0.0ms | SET standard_conforming_strings = on 0.0ms | SELECT "posts".* FROM "posts" ORDER BY "posts"."id" DESC LIMIT ? 0.0ms | SHOW TIME ZONE pg_qtop -d testdb
AVG | QUERY -------------------------------------------------------------------------------- 0.0ms | SELECT * FROM posts JOIN users ON (posts.author_id = users.id) WHERE users.login = ?; 0.0ms | SELECT * FROM posts WHERE guid = ?; 0.0ms | SELECT "posts".* FROM "posts" ORDER BY "posts"."id" DESC LIMIT ? pg_qtop -d testdb -t posts
AVG | CALLS | HIT RATE | QUERY -------------------------------------------------------------------------------- 0.1ms | 1 | 100.0 | SELECT * FROM users; 0.1ms | 1 | - | SELECT * FROM databases; 0.0ms | 1 | - | SELECT * FROM invoices; 0.0ms | 1 | - | SELECT * FROM query_snapshots; pg_qtop -d testdb -s select
pganalyze.com
@LukasFittl QueryPlans
@LukasFittl
@LukasFittl log_min_duration_statement gives you queries withparameters
@LukasFittl
@LukasFittl
@LukasFittl auto_explain logs the query plan
 for specific slow queries
@LukasFittl
@LukasFittl Scaling # of Servers Scaling # of Applications Scaling # of Developers
@LukasFittl Scaling # of Servers Scaling # of Applications Scaling # of Developers
@LukasFittl Read Replicas: Monitor ReplicaLag
 & CPUUtilization
@LukasFittl Sharding: Monitor Everything*
 
 * Each Data/Worker Node Should
 Be Treated Like Single-Node Postgres
@LukasFittl Scaling # of Servers Scaling # of Applications Scaling # of Developers
@LukasFittl Monitoring Data: application_name
@LukasFittl pg_stat_activity
@LukasFittl LogFiles Add %a to your log_line_prefix
@LukasFittl Annotate Queries With Their QueryOrigin
@LukasFittl
@LukasFittl
@LukasFittl
@LukasFittl github.com/basecamp/marginalia Automatic QueryAnnotationsForRubyonRails
@LukasFittl Make sure all apps are
 monitored the same way through Checklists
@LukasFittl Scaling # of Servers Scaling # of Applications Scaling # of Developers
@LukasFittl Make Tooling Accessible To AppDevelopers,NotJustDBAs
@LukasFittl It Only Takes 1Developer To
 
 …IntroduceASlowQueryWithoutAnIndex …WriteaMigrationThatTakesALongExclusiveLock …DropATable
@LukasFittl Code Review Needs To Include SchemaChangeReview &
 MissingIndexReview
@LukasFittl SETstatement_timeout=30s When Your (Web) App Connects To Protect Against Unexpected Slow Queries
@LukasFittl log_statement=ddl
@LukasFittl Alerting
@LukasFittl LevelsofNotification: - Page (Wake Up On-Call) - Instant Email Notification - Daily/Weekly Report Email
@LukasFittl NobodyWantsToBePaged NobodyReadsEmails
@LukasFittl Tiedatabasealerts backtobusinessimpact
@LukasFittl Onlypagewhen thedatabaseisactuallydown (orabouttobedown)
@LukasFittl FocusonQueryPerformance Distinguish user-visible slowness from secondary systems
 (e.g. background workers)
@LukasFittl Establish Weekly/Monthly DatabaseReviewPractice
@LukasFittl TopQueriesByTotalRuntime
@LukasFittl UnusedIndices
@LukasFittl UnusualLogEvents
@LukasFittl 3Take-Aways 1.CollectHistoricMetrics 2.FocusonDrill-DownToQueryLevel 3.AnnotateYourQueriesWith TheirOrigin
@LukasFittl Monitor YourPostgres: pganalyze.com
 
 Shard Your Postgres: citusdata.com Thanks!
@LukasFittl
@LukasFittl 🐜 Lock Contention
@LukasFittl LongHeldLocksinTransactions Rails Counter Cache & Timestamps BEGIN SELECT 1 AS one FROM "post_votes" WHERE (…) LIMIT 1 SELECT "posts".* FROM "posts" WHERE "posts"."id" = $1 LIMIT 1 INSERT INTO "notifications" (…) VALUES (…) RETURNING "id" SELECT "users".* FROM "users" WHERE "users"."id" = $1 LIMIT 1 UPDATE "users" SET "updated_at" = ? WHERE "users"."id" = ? INSERT INTO "post_votes" (…) VALUES (…) RETURNING "id" UPDATE "posts" SET "votes" = COALESCE("votes", 0) + 1 WHERE "posts"."id" = ? UPDATE "posts" SET "credible_post_votes_count" = … WHERE "posts"."id" = ? UPDATE "users" SET "updated_at" = ? WHERE "users"."id" = ? UPDATE "posts" SET "updated_at" = ? WHERE "posts"."id" = ? COMMIT
@LukasFittl log_lock_waits=on deadlock_timeout = 1000 ms process 20679 still waiting for ExclusiveLock on tuple (566,1) of relation 16421 after 1000.115 ms process 20678 still waiting for ExclusiveLock on tuple (566,1) of relation 16421 after 1000.126 ms process 15533 still waiting for ExclusiveLock on tuple (566,1) of relation 16421 1000.129 ms process 20663 still waiting for ExclusiveLock on tuple (566,1) of relation 16421 1000.100 ms process 15537 still waiting for ExclusiveLock on tuple (566,1) of relation 16421 1000.130 ms process 15536 still waiting for ExclusiveLock on tuple (566,1) of relation 16421 1000.222 ms process 20734 still waiting for ExclusiveLock on tuple (566,1) of relation 16421 1000.130 ms process 15538 still waiting for ExclusiveLock on tuple (566,1) of relation 16421 1000.136 ms process 15758 still waiting for ShareLock on transaction 250175899 after 1000.073 ms
@LukasFittl pg_stat_activity lock information https://github.com/postgrespro/pg_wait_sampling
@LukasFittl IdleTransactions
@LukasFittl RED: Long Queries Pid Duration Query ----- --------------- ----------------------------------------------------------------------------------------- 31868 00:10:52.165883 UPDATE "posts" SET "link_visits" = COALESCE("link_visits", 0) + 1 WHERE "posts"."id" = $1 28835 01:06:16.505554 UPDATE "posts" SET "link_visits" = COALESCE("link_visits", 0) + 1 WHERE "posts"."id" = $1 28836 01:06:16.486394 UPDATE "posts" SET "link_visits" = COALESCE("link_visits", 0) + 1 WHERE "posts"."id" = $1 28837 01:06:16.41853 UPDATE "posts" SET "link_visits" = COALESCE("link_visits", 0) + 1 WHERE "posts"."id" = $1 28838 01:06:16.380474 UPDATE "posts" SET "link_visits" = COALESCE("link_visits", 0) + 1 WHERE "posts"."id" = $1 31869 00:10:26.068636 UPDATE "posts" SET "link_visits" = COALESCE("link_visits", 0) + 1 WHERE "posts"."id" = $1 31870 00:08:47.790245 UPDATE "posts" SET "link_visits" = COALESCE("link_visits", 0) + 1 WHERE "posts"."id" = $1 23781 01:12:14.603475 UPDATE "posts" SET "link_visits" = COALESCE("link_visits", 0) + 1 WHERE "posts"."id" = $1 31862 00:10:47.98641 UPDATE "posts" SET "link_visits" = COALESCE("link_visits", 0) + 1 WHERE "posts"."id" = $1 31863 00:11:05.921372 UPDATE "posts" SET "link_visits" = COALESCE("link_visits", 0) + 1 WHERE "posts"."id" = $1 27648 01:17:38.773909 UPDATE "posts" SET "link_visits" = COALESCE("link_visits", 0) + 1 WHERE "posts"."id" = $1 27098 01:28:51.216489 UPDATE "posts" SET "link_visits" = COALESCE("link_visits", 0) + 1 WHERE "posts"."id" = $1 27106 01:27:18.455351 UPDATE "posts" SET "link_visits" = COALESCE("link_visits", 0) + 1 WHERE "posts"."id" = $1 31880 00:10:52.877779 UPDATE "posts" SET "link_visits" = COALESCE("link_visits", 0) + 1 WHERE "posts"."id" = $1 31881 00:10:52.168877 UPDATE "posts" SET "link_visits" = COALESCE("link_visits", 0) + 1 WHERE "posts"."id" = $1 RED: Idle in Transaction 23729 01:37:12.566497 UPDATE "posts" SET "link_unique_visits" = COALESCE("link_unique_visits", 0) + 1 WHERE "posts"."id" = $1 RED: Blocking Queries 23760 UPDATE "posts" SET "link_unique_visits" = COALESCE("link_unique_visits", 0) + 1 WHERE "posts"."id" = $1 01:37:12.646868 23729 UPDATE "posts" SET "link_visits" = COALESCE("link_visits", 0) + 1 WHERE "posts"."id" = $1 01:31:59.088208 Dead Background Worker caused silent queue back-up for 1hr+

PGConf APAC 2018 - Monitoring PostgreSQL at Scale