1.189 Database Observability & Profiling#
Survey of database observability and profiling tools: pg_stat_statements and pgBadger (PostgreSQL), pt-query-digest and PMM (MySQL/Percona), EXPLAIN ANALYZE for query plan inspection, pghero, and pganalyze. Covers query statistics, slow log analysis, real-time monitoring, and query plan debugging.
Explainer
Domain Explainer: Database Observability & Profiling#
The Problem: The Database Is a Black Box#
Imagine a restaurant kitchen at peak dinner service. Orders pour in, dishes go out, and somewhere in the middle, things are getting slow. The front-of-house manager can see that tables are waiting longer than usual. But without going into the kitchen, they cannot tell whether the delay is coming from a single overwhelmed station, from one dish that takes too long to plate, from a delivery that blocked the walk-in cooler, or from a slow prep step that cascades across everything else.
A database under load is exactly this situation. Thousands of queries arrive per second. Applications wait for responses. Users experience slowness. But from the outside — from the application’s perspective — the database is a black box. You can measure that it is slow. You cannot easily tell what inside it is causing the slowness.
Database observability is the discipline of opening that black box: instrumenting, logging, and analyzing what the database is doing internally so that you can diagnose problems, understand costs, and make informed decisions about optimization.
Why This Is Hard#
Several properties of databases make observability genuinely difficult:
Volume: A busy web application database might process 50,000 queries per second. You cannot meaningfully log every query in full detail — the logging overhead alone would slow the database, and the resulting data would be unmanageably large. Observability tools must find ways to summarize, sample, and aggregate.
Heterogeneity: Not all queries are equal. A simple primary key lookup that runs in 0.1ms is very different from a multi-table JOIN that runs in 500ms, which is very different from a reporting query that runs in 45 seconds. A useful observability tool must help you find the queries that matter — the ones consuming the most cumulative time, the ones that run frequently with moderate cost, and the ones that occasionally spike catastrophically.
Context dependency: The same SQL query can perform very differently depending on the parameter values passed to it. WHERE status = 'pending' might return 5 rows or 5 million rows depending on your data. Observability tools that aggregate all executions of the same query pattern together may miss this — a query that is usually fast but occasionally catastrophic when called with certain parameters.
The optimizer is invisible: Modern databases (PostgreSQL, MySQL, Oracle, SQL Server) use a query optimizer — a component that takes your SQL and produces an execution plan. The plan describes how the database will physically access data: which indexes to use, in what order to join tables, whether to sort in memory or on disk. The optimizer’s choices are usually good. But sometimes, due to stale statistics, unusual data distributions, or limitations in the optimizer’s model, it makes a poor choice. Without tools to inspect the execution plan, you cannot know whether you have a bad plan or just an intrinsically expensive query.
Time horizons: Performance problems operate on different time scales. A query that degrades suddenly is likely a plan change (the optimizer switched to a worse strategy) or a data volume change (the table grew past a threshold). A query that degrades gradually over weeks is likely accumulating data volume or index bloat. Distinguishing these requires historical data — not just a snapshot of what is slow now.
The Four Solution Categories#
Database observability tools fall into four broad categories, each addressing a different part of the problem:
1. Query Statistics: Aggregated Over Time#
The most fundamental form of database observability. The database itself accumulates counters about every query it processes: how many times each query was called, how much total time was spent on it, how many rows were processed, how much I/O it generated.
These statistics answer the most important diagnostic question: across all the work this database is doing, what is actually expensive? Not “this specific query I am looking at right now” but “over the last hour (or day, or week), where did the database actually spend its time?”
This category of tool is non-disruptive — the database collects these statistics as a side effect of normal operation. There are no additional log files to manage, no separate processes to run, no queries to manually monitor. The statistics are just there, accumulated, waiting to be queried.
The limitation is aggregation: by pooling all executions of the same query pattern together, you lose per-execution detail. You learn that the SELECT * FROM orders WHERE customer_id = ? pattern consumed 2 hours of total execution time today across 400,000 calls. You do not learn that 99% of those calls were fast and 1% were catastrophically slow.
2. Log-Based Analysis: Post-Hoc Batch Processing#
Databases can be configured to write slow queries to a log file — capturing queries that exceed a configurable time threshold. A query that takes 2 seconds appears in the log with its full text (including the actual parameter values), its execution time, and other diagnostic information.
Log-based analysis tools process these log files after the fact. They fingerprint and group queries, aggregate statistics, and produce summary reports identifying the top offenders.
Log-based analysis has several advantages over statistics-only approaches. First, it captures the actual parameter values — you can see that WHERE status = 'pending' triggered the slow execution, while WHERE status = 'active' does not. Second, it captures events that statistics alone do not: lock wait events, checkpoint activity, connection patterns, temporary file creation. Third, it provides a concrete batch artifact — a report you can share, archive, and compare week-over-week.
The limitation is that slow query logging is not free. Every slow query logged has disk I/O overhead. Very aggressive thresholds (logging everything) can themselves degrade performance. And the log only captures queries that exceeded the threshold — queries that are individually fast but called millions of times and collectively expensive may not appear.
3. Real-Time Monitoring: Live Dashboards and Alerting#
Statistics and log analysis are fundamentally retrospective. Real-time monitoring tools connect to the database continuously, collect metrics on a short polling interval (every few seconds to every minute), store them in a time-series database, and display them as live dashboards with alerting.
This category answers different questions: “Is the database getting slower right now?” “Has query volume spiked in the last five minutes?” “Is replication lag growing?” “Are we approaching a connection limit?”
Real-time monitoring is essential for operational alerting — catching problems as they develop rather than discovering them in Monday’s batch report. It is also useful for correlating database behavior with application events: deploying a new version, running a data migration, or triggering a batch job.
The limitation is granularity. Real-time monitoring typically operates on aggregated metrics (queries per second, average latency, error rate) rather than per-query-fingerprint detail. You can see that the database slowed down at 3:47 PM; you need a different tool to know which query caused it.
4. Query Plan Inspection: Why Is This Query Slow?#
The first three categories tell you what is slow. Query plan inspection tells you why.
Every SQL query goes through an optimizer before execution. The optimizer produces an execution plan — a tree of operations describing how the database will retrieve and process data. The plan determines whether an index is used, how tables are joined, whether data is sorted in memory or on disk. A query with a bad plan can be orders of magnitude slower than the same query with a good plan.
Query plan inspection commands (EXPLAIN and EXPLAIN ANALYZE in SQL) show you the optimizer’s chosen plan. They reveal:
- Whether a full table scan is happening when an index scan was expected
- How many rows the optimizer estimated would match a filter vs. how many actually matched
- Whether a join operation is spilling to disk because data doesn’t fit in memory
- Which step in a complex multi-table query is actually the expensive one
Plan inspection is not automated or continuous — it is an interactive diagnostic tool. You use it after identifying a specific slow query via statistics or log analysis, to understand the root cause. It is the step where you move from “this query is slow” to “this query is slow because the optimizer is doing a full table scan on a 10-million-row table when an index on the created_at column would serve it far better.”
When to Invest in Each Category#
These four categories are not mutually exclusive. Mature database operations use all of them, but the appropriate starting point depends on where you are:
Starting from zero: Enable query statistics first. It costs almost nothing and gives you immediate visibility into what is expensive. Configure slow query logging at the same time — set a threshold and let it run, so you have historical data when you need it.
First incident: Your first serious performance incident will likely require log-based analysis (to find the offending query with its actual parameters) and plan inspection (to understand why it is slow). Statistics alone will have told you something is wrong; the other categories tell you what and why.
Growing team: As more developers write queries and the schema evolves, real-time monitoring becomes important for catching regressions before they affect users. An alert when average query latency exceeds a threshold catches problems faster than a weekly report.
Scale: At high scale, all four categories become essential infrastructure, not optional tools. Statistics drive dashboards. Log analysis drives weekly reviews. Real-time monitoring drives alerting and on-call response. Plan inspection is a skill every developer on the team should be comfortable with.
The Core Insight#
Database observability is not about having the right tool. It is about having visibility at multiple levels simultaneously:
- Aggregate visibility (what is most expensive overall?)
- Historical visibility (is this getting better or worse over time?)
- Real-time visibility (is something going wrong right now?)
- Detailed visibility (why is this specific query slow?)
Each level requires different instrumentation. The tools in this space are specializations for each level, and the mature approach is to deploy multiple tools that together provide complete coverage — not to pick one and hope it answers every question.
The database is no longer a black box when you have query statistics showing you the cost distribution, slow query logs giving you specific examples, dashboards showing you real-time trends, and the ability to run EXPLAIN ANALYZE when you need to diagnose a specific query. Together, these transform database performance from guesswork into engineering.
S1: Rapid Discovery
S1: Rapid Discovery — Database Observability & Profiling#
Quick Answer#
For PostgreSQL: enable pg_stat_statements first (always), run pgBadger on your slow query logs for batch analysis, and use EXPLAIN ANALYZE interactively when you need to diagnose a specific query. For MySQL and MariaDB: pt-query-digest from Percona Toolkit is the standard slow-log analyzer, and Percona Monitoring and Management (PMM) gives you a full open source monitoring stack. If you want a polished web dashboard on top of PostgreSQL without building dashboards yourself, pghero wraps pg_stat_statements with a UI. If you want managed PostgreSQL monitoring with index advice, pganalyze offers a SaaS collector. For catching bad query patterns before they reach production, sqlfluff lints SQL at write time.
The single most impactful thing you can do for a PostgreSQL database is load pg_stat_statements. Everything else builds on it.
The Landscape in Brief#
Database observability sits at the intersection of three concerns: what queries are running and how expensive they are, why a specific query is slow, and how database health trends over time. Different tools address different parts of this problem.
There is a rough division between log-based tools (pgBadger, pt-query-digest) and statistics-based tools (pg_stat_statements, PMM). Log-based tools work after the fact — you collect a slow query log over some period and then analyze the batch. Statistics-based tools accumulate counters inside the database process itself and can be queried at any time. EXPLAIN ANALYZE is in a third category: it actually executes the query and measures real plan behavior, making it the most powerful single-query diagnostic available.
pg_stat_statements#
Stars: Built into PostgreSQL — no GitHub stars, but effectively universal adoption in the PostgreSQL ecosystem.
What it is: A PostgreSQL extension that tracks execution statistics for every query the server processes. Once enabled, it maintains a shared hash table of normalized query fingerprints with counters for calls, total execution time, mean execution time, row counts, buffer hits and reads, and more.
Community consensus: Universally recommended. Every PostgreSQL DBA guide, every production setup guide, and every managed PostgreSQL service (AWS RDS, Google Cloud SQL, Azure Database) exposes pg_stat_statements data. It is the foundation on which pgBadger reports, pghero dashboards, pganalyze analysis, and Prometheus exporters all depend. If you run PostgreSQL and do not have pg_stat_statements enabled, you are flying blind.
Trade-offs: It has some overhead, typically measured at 1-5% of query throughput on write-heavy workloads. The pg_stat_statements.max setting controls how many distinct query fingerprints to track (default 5000); under very high query diversity this can evict older entries. Statistics accumulate since the last reset, so the view shows historical totals, not real-time rates. You need to query it periodically and compute deltas yourself (or use a tool that does this for you) to see trends.
pgBadger#
Stars: ~3,700 on GitHub. Actively maintained by Dalibo (a French PostgreSQL consulting firm).
What it is: A Perl script that parses PostgreSQL log files and produces rich HTML (or JSON) reports. It reads the standard PostgreSQL log format, extracts slow queries (controlled by log_min_duration_statement), and produces summaries of the top queries by total time, average time, and frequency. It also analyzes lock waits, checkpoints, connections, and autovacuum activity.
Community consensus: The de facto standard for PostgreSQL log analysis. When someone says “run your logs through a log analyzer,” they mean pgBadger. It requires very little setup beyond enabling the right PostgreSQL logging parameters.
Trade-offs: Purely batch/historical — you run it against a log file, not in real time. Log file sizes can be large; on a busy server the log can grow quickly if log_min_duration_statement is set too low. The HTML report is excellent but not interactive beyond what a browser provides. Requires Perl.
pt-query-digest (Percona Toolkit)#
Stars: Percona Toolkit has ~2,400 stars. pt-query-digest is its flagship tool.
What it is: A Perl script that analyzes MySQL and MariaDB slow query logs. It fingerprints queries (normalizes literals to placeholders), groups them, and reports on the top offenders by total execution time, average time, and query count. It can also read from PERFORMANCE_SCHEMA, tcpdump captures, and binary logs.
Community consensus: The standard MySQL slow log analysis tool. MySQL DBAs reach for this the way PostgreSQL DBAs reach for pgBadger. Percona maintains it actively.
Trade-offs: MySQL/MariaDB only (does not work with PostgreSQL). Requires configuring slow_query_log and long_query_time in MySQL. Like pgBadger, it is batch/historical. The output is text-based, not HTML by default (though HTML output is available with a flag).
EXPLAIN / EXPLAIN ANALYZE#
Stars: N/A — built into every major relational database.
What it is: The SQL command for inspecting query execution plans. EXPLAIN shows the plan the query planner would choose. EXPLAIN ANALYZE actually runs the query and shows both the estimated plan and real measured timings, row counts, and buffer usage (with BUFFERS option). PostgreSQL’s EXPLAIN output is the most detailed of any major database.
Community consensus: Essential, non-negotiable skill for any database developer or DBA. When you find a slow query via pg_stat_statements or pt-query-digest, EXPLAIN ANALYZE is how you diagnose why it is slow.
Trade-offs: EXPLAIN ANALYZE actually executes the query — for slow queries this is the point, but for expensive writes you may want to wrap in a transaction and roll back. The plan output is not trivial to read; tools like explain.dalibo.com and pgMustard exist to visualize it. Statistics divergence (where estimated rows differ greatly from actual rows) is the most common sign of a problem.
auto_explain#
Stars: Built into PostgreSQL contrib — no GitHub stars.
What it is: A PostgreSQL extension that automatically logs the EXPLAIN (ANALYZE) output of queries that exceed a configurable time threshold. It runs in the background, so you get plan information in your PostgreSQL logs without having to run EXPLAIN manually on every slow query.
Community consensus: Recommended for production environments where you want plan-level information on slow queries without manual intervention. Often overlooked in favor of pg_stat_statements, but it provides data that pg_stat_statements does not: the actual execution plan.
Trade-offs: Adds overhead proportional to the queries it logs. If log_min_duration is set very low, it can generate enormous log volume. Nested statement logging (log_nested_statements) requires care to avoid logging utility queries inside functions.
pghero#
Stars: ~8,600 on GitHub (Instacart). Most starred PostgreSQL performance tool.
What it is: A Ruby gem (and Docker image) that provides a web dashboard for PostgreSQL performance. It wraps pg_stat_statements and adds opinions — it surfaces the top slow queries, identifies missing indexes, shows connection counts, bloat estimates, and replication lag. It is opinionated and ready to use out of the box.
Community consensus: Very popular for teams that want immediate visibility without building dashboards. The Ruby dependency and web server requirement are sometimes a barrier, but the Docker image makes deployment simpler. It is read-only and safe to run against production.
Trade-offs: Requires pg_stat_statements. Ruby runtime or Docker. No alerting, no time-series storage — it reflects the current state of pg_stat_statements at query time. Not a replacement for proper monitoring (Prometheus + Grafana).
pganalyze#
Stars: The open source collector has ~900 stars. The SaaS product is the main offering.
What it is: A SaaS monitoring service for PostgreSQL with an open source collector agent. The collector ships statistics from pg_stat_statements, schema information, and EXPLAIN plans (via auto_explain integration) to the pganalyze cloud. The SaaS provides index recommendations, EXPLAIN plan visualization, vacuum analysis, and alerting.
Community consensus: Well-regarded in the PostgreSQL community. Particularly valued for its index advisor and for its EXPLAIN plan visualization (clearer than plain text). The open source collector means you are not locked in for data collection.
Trade-offs: The useful features are SaaS and paid. Plans are priced per server. Not suitable for teams with strict data residency requirements (query samples are sent to pganalyze servers). Not self-hostable (collector is open source; the analysis platform is not).
Percona Monitoring and Management (PMM)#
Stars: ~1,000 on GitHub. Percona-maintained, enterprise-grade.
What it is: A full open source monitoring stack for MySQL, PostgreSQL, MongoDB, and ProxySQL. It includes a server component (deployed via Docker or as a VM), an agent (deployed alongside each database), Prometheus for metric storage, Grafana for dashboards, and a Query Analytics module that surfaces slow queries. It is self-hosted.
Community consensus: The standard choice for teams that want Datadog-style database monitoring without the SaaS cost. Widely adopted in the MySQL ecosystem. PostgreSQL support is solid but the MySQL/Percona Server support is deeper.
Trade-offs: Significant operational overhead — you run a PMM server plus agents on every monitored host. Initial setup is non-trivial. Very powerful once running. Better suited to teams with dedicated operations capacity.
sqlfluff#
Stars: ~8,000 on GitHub. Active community.
What it is: A SQL linter and formatter that checks SQL code against a configurable rule set. It is not a runtime monitoring tool — it runs on SQL files at write time or in CI. It can catch patterns like SELECT *, unqualified column references, and style issues.
Community consensus: The default SQL linter choice. Most useful for teams that write a lot of SQL in files or dbt models. It does not detect slow queries at runtime; it detects potentially problematic patterns in source code.
Trade-offs: Cannot detect runtime performance problems — a perfectly linted query can still produce a sequential scan. Most valuable in teams using dbt or writing SQL migrations in version-controlled files.
Summary: Decision by Scenario#
| Scenario | Tool |
|---|---|
| PostgreSQL, first thing to do | pg_stat_statements |
| PostgreSQL, weekly slow query report | pgBadger |
| PostgreSQL, diagnose one slow query | EXPLAIN ANALYZE |
| PostgreSQL, automatic plan logging | auto_explain |
| PostgreSQL, quick web dashboard | pghero |
| PostgreSQL, managed SaaS monitoring | pganalyze |
| MySQL/MariaDB, slow query analysis | pt-query-digest |
| MySQL + PostgreSQL + MongoDB monitoring | PMM |
| SQL quality in CI/CD | sqlfluff |
The PostgreSQL and MySQL ecosystems evolved separately, and their tooling reflects this. PostgreSQL has richer built-in statistics infrastructure (pg_stat_statements, auto_explain) while the MySQL ecosystem historically leaned harder on log-based tools and third-party monitoring stacks. PMM bridges both.
S2: Comprehensive
S2: Comprehensive Discovery — Database Observability & Profiling#
pg_stat_statements: Deep Technical Analysis#
How It Works Internally#
pg_stat_statements is a PostgreSQL shared library loaded via shared_preload_libraries. Once loaded, it installs hooks into the executor and planner so that every query that completes is recorded. It maintains a fixed-size hash table in shared memory. The key for each entry is a hash of the normalized query text — meaning all literal values are replaced with positional placeholders ($1, $2, etc.) before hashing. This allows SELECT * FROM orders WHERE id = 1 and SELECT * FROM orders WHERE id = 42 to be tracked as the same query fingerprint.
The extension adds the pg_stat_statements view which exposes this hash table. Each row represents one distinct normalized query fingerprint.
Key Columns and Their Meaning#
calls: Total number of times this query has been executed since the last statistics reset. Divide other cumulative values by this to get per-execution averages.
total_exec_time (PostgreSQL 13+, formerly total_time): Cumulative wall clock time spent executing this query across all calls, in milliseconds. This is the primary metric for finding the queries with the highest aggregate cost. A query called 10,000 times with 1ms average is more impactful than a query called once with 5000ms — both appear here with the same total_exec_time of 10,000ms.
mean_exec_time: total_exec_time / calls. Useful for finding queries that are individually expensive.
stddev_exec_time: Standard deviation of execution time. A high standard deviation on a normally fast query often indicates plan instability — the planner sometimes picks a bad plan.
rows: Total rows returned or affected across all calls.
shared_blks_hit: Buffer pool cache hits — pages found in PostgreSQL’s shared_buffers. High ratio of hits to reads is good.
shared_blks_read: Pages read from disk (or OS page cache). High values here indicate I/O pressure. The ratio shared_blks_hit / (shared_blks_hit + shared_blks_read) is a rough cache hit rate per query.
shared_blks_dirtied and shared_blks_written: Blocks modified and flushed to disk, relevant for write-heavy queries.
local_blks_* and temp_blks_*: Temporary table and work_mem spill statistics. temp_blks_read and temp_blks_written spike when a sort or hash join exceeds work_mem and spills to disk — a common performance problem.
wal_bytes (PostgreSQL 13+): WAL (write-ahead log) bytes generated by this query. Useful for finding queries with surprising write amplification.
queryid: A stable numeric hash of the normalized query. The same queryid appears consistently across resets within a PostgreSQL version, enabling correlation with external systems.
Query Normalization#
The normalization process is important to understand. pg_stat_statements replaces all constant literals with $N placeholders before computing the hash. This means:
- String literals become
$1 - Numeric literals become
$1 - Array literals are normalized
- Multiple constants in sequence are numbered sequentially
This normalization is intentional: it lets you see the performance of a query pattern regardless of the specific parameter values. However, it means you cannot distinguish between WHERE status = 'active' and WHERE status = 'deleted' — they hash to the same entry even though they may have very different performance due to data distribution.
In PostgreSQL 14+, the pg_stat_statements view includes a toplevel boolean to distinguish queries called at the top level from those called within a PL/pgSQL function or procedure.
Essential Queries for Finding Top Offenders#
-- Top 10 queries by total execution time (most impactful overall)
SELECT
round(total_exec_time::numeric, 2) AS total_ms,
calls,
round(mean_exec_time::numeric, 2) AS mean_ms,
round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 2) AS pct,
left(query, 80) AS query_snippet
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;Resetting Statistics#
SELECT pg_stat_statements_reset(); resets all accumulated statistics. This is useful after major schema changes (new indexes, query rewrites) to establish a fresh baseline. On busy servers, a common practice is to reset at a fixed interval (e.g., every 24 hours) and record snapshots, giving you time-windowed statistics rather than ever-growing totals.
The pg_stat_statements.max configuration parameter (default 5000) controls how many distinct fingerprints are tracked. When the table is full and a new fingerprint appears, an existing entry is evicted (LRU-ish policy). On systems with enormous query diversity (ORMs generating highly varied SQL, many ad-hoc queries), this limit can be reached, causing statistics loss. Raising it to 10,000 or 20,000 is reasonable on busy systems with sufficient shared memory.
pgBadger: Log Parsing and Analysis#
Required PostgreSQL Configuration#
pgBadger works by parsing PostgreSQL log files. The log format and content must be configured appropriately:
log_min_duration_statement: The most critical setting. Any query that takes longer than this value (in milliseconds) is logged. Setting it to 0 logs everything; setting it to 1000 logs queries over 1 second. For high-traffic production systems, starting at 500ms or 1000ms is reasonable. Too low a threshold generates massive log volume.
log_line_prefix: Must include timestamp and process ID at minimum. pgBadger expects a specific format. The recommended prefix is '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h ' which includes timestamp, PID, line number, user, database, application, and client host.
log_checkpoints = on: Logs checkpoint information, which pgBadger uses to analyze checkpoint frequency and spread — relevant for I/O burst analysis.
log_connections = on and log_disconnections = on: Enables connection lifecycle logging, used by pgBadger to compute connection statistics and detect connection churn.
log_lock_waits = on: Logs lock wait events, enabling pgBadger’s lock analysis section.
log_temp_files = 0: Logs any temporary file creation (work_mem spills). pgBadger reports on these.
log_autovacuum_min_duration = 0: Logs autovacuum runs, useful for understanding vacuum pressure.
How pgBadger Parses Logs#
pgBadger reads log files line by line, using regular expressions to match log line prefixes and extract structured fields: timestamp, PID, database, user, duration, and query text. It handles multi-line queries (query text wrapped across lines with the PID as a continuation marker).
It processes the entire log file in a single pass, maintaining in-memory accumulators for:
- Query frequency and total duration per normalized fingerprint
- Hourly and per-minute query rate histograms
- Lock wait events with durations
- Temporary file creation events
- Connection/disconnection timestamps
- Checkpoint events
At the end, it renders these accumulators into the output format. The default HTML output is a single self-contained HTML file with embedded JavaScript for charts.
Output Formats and Content#
The standard invocation produces an HTML report covering:
- Overview: Total queries, unique queries, total connections, total duration
- Slowest queries: Top N individual query executions by duration
- Most time-consuming: Top N query fingerprints by aggregate duration
- Most frequent: Top N fingerprints by call count
- Normalized queries: Aggregated statistics with full normalized query text
- Locks: Queries that waited for locks, with lock duration
- Temporary files: Queries that spilled to disk
- Checkpoints: Timing and frequency
- Connections: Connection rate and duration analysis
- Sessions: Session duration distribution
pgBadger also supports --format json for machine-readable output, useful for integration with other tools or custom dashboards. Incremental processing (--last-parsed) allows running pgBadger daily on rotating logs without reprocessing the entire history.
EXPLAIN ANALYZE: Query Plan Inspection#
Plan Node Types#
PostgreSQL’s query planner decomposes a SQL query into a tree of plan nodes. Understanding the common node types is essential for reading EXPLAIN output:
Seq Scan: Sequential scan of an entire table. Not inherently bad — for small tables or queries returning most of the table, a sequential scan is faster than an index scan. A Seq Scan on a large table with high filter ratio (few rows returned, many rows scanned) is a red flag indicating a missing index.
Index Scan: Traverses an index to find matching rows, then fetches the corresponding heap pages. Efficient for selective queries (few rows returned). For each indexed row, it does a random heap page access — expensive if many rows match.
Index Only Scan: Like Index Scan but satisfied entirely from the index without visiting the heap. Requires that all queried columns are in the index and that the visibility map indicates the heap pages are all-visible. Very efficient.
Bitmap Index Scan + Bitmap Heap Scan: Two-phase approach. First, a Bitmap Index Scan builds a bitmap of matching heap pages. Then, Bitmap Heap Scan reads those pages in sequential order. More efficient than Index Scan when many rows match, because it reads heap pages in order rather than randomly. Multiple Bitmap Index Scans can be combined with BitmapAnd/BitmapOr for multi-condition queries.
Hash Join: Builds a hash table from the smaller relation (the “inner”), then probes it with each row from the larger relation (the “outer”). Efficient for large joins where neither relation fits in memory for a sort. Requires a hash-compatible join condition (equality).
Nested Loop: For each row in the outer relation, probes the inner relation. Efficient when the outer relation is small and the inner relation has an index on the join column. Terrible when both relations are large.
Merge Join: Requires both inputs sorted on the join key. Can be efficient when both sides have sorted indexes or when sorts are amortized.
Sort: Explicit sort operation. Watch for Sort Method: external merge Disk: NNNkB — this indicates a work_mem spill to disk.
Hash: Building phase of a Hash Join. Watch for Batches: N where N > 1 — this means the hash table spilled to disk.
Cost Units and Estimates#
The cost values in EXPLAIN output (cost=0.00..8.42) are dimensionless units calibrated by seq_page_cost (default 1.0), random_page_cost (default 4.0), cpu_tuple_cost (default 0.01), and cpu_operator_cost (default 0.0025). These are tunable; many workloads on SSDs benefit from random_page_cost = 1.1 to account for SSDs’ near-sequential random access.
The first number is the startup cost (cost before first row is returned), the second is the total cost. The planner optimizes for total cost by default, but cursor_tuple_fraction can shift this for CURSOR-based queries.
Detecting Statistics Divergence#
The most important diagnostic in EXPLAIN ANALYZE is comparing rows=N (planner estimate) to actual rows=M (measured). Significant divergence — an order of magnitude or more — indicates that the planner’s statistics are stale or that the data distribution is unusual.
Common causes:
- Outdated table statistics: Run
ANALYZE table_nameto refresh. autovacuum normally handles this, but large bulk inserts can outpace it. - Correlated columns: The planner assumes independence between column predicates.
WHERE city = 'Seattle' AND state = 'WA'— the planner doesn’t know these are perfectly correlated and will overestimate the filter’s selectivity. Extended statistics (CREATE STATISTICS) can fix this in PostgreSQL 10+. - Non-uniform distributions: The planner uses histograms but with limited granularity (default 100 buckets). Highly skewed distributions can fool it.
The BUFFERS Option#
EXPLAIN (ANALYZE, BUFFERS) adds buffer statistics to each plan node:
shared hit=N— pages found in shared_buffersshared read=N— pages read from diskshared dirtied=N— pages modifiedlocal hit/read— temporary table pagestemp read/written— work_mem spill pages
This shows exactly where I/O is happening in the plan. A node with high shared read values is performing disk I/O; checking whether the referenced table/index has appropriate caching is the next diagnostic step.
Timing Overhead#
EXPLAIN ANALYZE times every node. This adds overhead, particularly for queries executing many small operations (e.g., index lookups on OLTP queries). For extremely fast queries (sub-millisecond), EXPLAIN (ANALYZE, TIMING OFF) reduces timing overhead while still measuring actual row counts.
auto_explain: Automatic Plan Capture#
auto_explain is a PostgreSQL extension (in contrib) that hooks into the executor to log EXPLAIN output for slow queries without requiring manual intervention.
auto_explain.log_min_duration: Queries taking longer than this threshold (milliseconds) have their plan logged. Set to -1 to disable, 0 to log everything. Typical production setting: 1000ms to 5000ms.
auto_explain.log_analyze: If true, runs EXPLAIN ANALYZE (measuring actual row counts and timings). More informative but adds overhead to slow queries. Generally worth it.
auto_explain.log_buffers: If true, adds buffer statistics (requires log_analyze = true). Highly recommended — provides disk I/O context.
auto_explain.log_timing: If false, suppresses per-node timing while still collecting row counts. Reduces overhead.
auto_explain.log_nested_statements: If true, logs plans for statements inside PL/pgSQL functions and procedures. Useful for debugging stored procedures but can generate verbose output.
auto_explain.log_format: text (default), xml, json, or yaml. JSON is parseable by downstream tools.
The plans appear in the PostgreSQL log file alongside other log messages, identifiable by the duration: prefix and the plan text below it. pgBadger can parse auto_explain output when both are in the same log file.
Percona Toolkit: pt-query-digest Deep Dive#
Fingerprinting Algorithm#
pt-query-digest normalizes queries by applying a series of transformations:
- Strip comments
- Replace all quoted string literals with
? - Replace numeric literals with
? - Replace lists of values
(1, 2, 3)with(?+)(one or more values) - Collapse consecutive placeholders in IN lists
- Normalize whitespace
The resulting fingerprint is a compact string that represents the query pattern. A MD5 hash of the fingerprint is used as the query ID (reported as # Query_id).
Event Sources#
pt-query-digest can read from multiple sources:
Slow query log (most common): MySQL’s slow query log records queries exceeding long_query_time. Enabled with slow_query_log = ON and slow_query_log_file = /path/to/slow.log. Set long_query_time = 1 for 1-second threshold, or 0 to log everything (high volume).
PERFORMANCE_SCHEMA (alternative for live analysis): pt-query-digest --processlist h=localhost polls the processlist or PERFORMANCE_SCHEMA events_statements tables. Less overhead than a very verbose slow log.
tcpdump capture: pt-query-digest --type=tcpdump parses network traffic captured with tcpdump. Captures ALL queries regardless of duration, at the cost of network capture overhead. Useful when you cannot modify MySQL configuration.
Binary log: pt-query-digest --type=binlog analyzes replication binary log. Useful for analyzing write workloads on replicas.
Output Structure#
pt-query-digest produces a text report organized as:
- Overall summary: Total query count, unique fingerprints, time range analyzed
- Profile: Top N fingerprints ranked by total execution time, with percentage contribution
- Per-fingerprint analysis: For each top fingerprint — response time distribution, calls, rows examined vs. rows sent ratio, full normalized query, and example (the actual query text from one occurrence)
The rows examined vs. rows sent ratio is particularly valuable: a ratio of 10,000:1 (examined 10,000 rows, returned 1) is a strong indicator of a missing index causing a full table scan for each execution.
Percona Monitoring and Management (PMM)#
PMM consists of:
- PMM Server: Docker image running Prometheus, Grafana, ClickHouse (for Query Analytics), and the PMM UI. Deployed once per monitoring environment.
- PMM Client (pmm-agent): Lightweight agent deployed on each monitored host. Runs exporters for each database and system metrics.
For MySQL, the agent runs mysqld_exporter for Prometheus metrics and a slow log or PERFORMANCE_SCHEMA collector for Query Analytics. For PostgreSQL, it runs postgres_exporter and pg_stat_statements integration. The Query Analytics (QAN) module provides a UI similar to pt-query-digest’s text output but interactive and time-windowed.
sqlfluff: Static SQL Analysis#
sqlfluff parses SQL files using a configurable dialect (ANSI, PostgreSQL, MySQL, BigQuery, Snowflake, dbt, and many more) and applies a rule set. Rules cover:
- Style: keyword casing, indentation, trailing whitespace
- Structure:
SELECT *(L036), missing table aliases (L011), ambiguous column references - Anti-patterns: some rules catch patterns that are likely to be slow (e.g., functions on indexed columns in WHERE clauses)
It integrates into CI via sqlfluff lint and can auto-fix many style issues with sqlfluff fix. It is particularly valuable in dbt projects where SQL is written in Jinja-templated files — sqlfluff understands dbt’s {{ ref() }} syntax.
sqlfluff does not execute queries or have database connectivity by default. It is purely static analysis. For runtime query performance, other tools in this survey are needed.
Summary: Technical Capability Matrix#
| Tool | Aggregated Stats | Per-Query Plan | Log-Based | Real-Time | MySQL | PostgreSQL | Self-Hosted |
|---|---|---|---|---|---|---|---|
| pg_stat_statements | Yes | No | No | Yes | No | Yes | Yes (built-in) |
| pgBadger | Yes (batch) | No | Yes | No | No | Yes | Yes |
| pt-query-digest | Yes (batch) | No | Yes | Partial | Yes | No | Yes |
| EXPLAIN ANALYZE | No | Yes | No | Ad-hoc | Yes | Yes | N/A |
| auto_explain | No | Yes (logged) | Yes | No | No | Yes | Yes |
| pghero | Yes | No | No | Yes | No | Yes | Yes |
| pganalyze | Yes | Yes | No | Yes | No | Yes | Collector only |
| PMM | Yes | Yes (QAN) | Yes | Yes | Yes | Yes | Yes |
| sqlfluff | No | No | No | N/A | Yes | Yes | Yes |
S3: Need-Driven
S3: Need-Driven Discovery — Database Observability & Profiling#
Who Needs Database Observability, and When?#
Database observability is not a single tool problem. The right tooling depends heavily on your role, your database platform, and whether you are debugging a crisis, doing proactive maintenance, or building long-term monitoring infrastructure. This section maps specific personas and use cases to the tools that serve them best.
Persona 1: Python Developer Debugging Slow API Responses#
Context: A backend developer at a mid-sized company has received a Datadog alert — the /api/orders endpoint has degraded from 120ms to 800ms average response time over the past two days. The service uses SQLAlchemy connecting to a PostgreSQL 15 database. The developer has SSH access to the application server and read access to the database.
The investigation workflow:
The developer’s first instinct is to look at what changed. But with hundreds of queries potentially running against this endpoint, they need to narrow down which query regressed. pg_stat_statements is the right starting point.
They connect to the database and query pg_stat_statements sorted by total execution time, filtering to the past time window by looking at queries with high mean time that weren’t there before. They notice a query fingerprint with an unusual pattern: a JOIN on the line_items table that shows 2,400ms mean execution time and 50,000 mean rows estimated but 1 actual row. This is a statistics divergence red flag.
They extract the actual query from the application logs (or from pgBadger’s sample, since the slow query log has been capturing it), then run EXPLAIN ANALYZE on it directly. The plan reveals a Hash Join where the planner estimated 50,000 rows on the line_items table but only 12 rows were returned. The inner side is doing a Seq Scan on line_items rather than using the index on order_id. Running ANALYZE on line_items updates the statistics — the table had a large bulk import two days ago that outpaced autovacuum, leaving the statistics stale.
After running ANALYZE, the plan shifts to an Index Scan. The mean execution time drops to 2ms. The API endpoint recovers.
Tools used and why:
- pg_stat_statements: First stop — found the offending query fingerprint without needing to know which query it was in advance. The aggregated statistics pointed directly at the problem.
- EXPLAIN ANALYZE: Revealed the root cause (statistics divergence, bad plan). Would not have been found from statistics alone.
- pgBadger (secondary): Used to get an actual sample of the query from the slow log, since pg_stat_statements normalizes away the parameter values.
What this persona does NOT need: A SaaS monitoring service, a web dashboard, or PMM. The tooling needed was already built into the database. The entire investigation took 15 minutes.
Persona 2: DBA Generating a Weekly Performance Report#
Context: A senior DBA is responsible for five PostgreSQL databases across three environments. Every Monday morning, they generate a performance summary for the engineering leads — which queries degraded, what new slow queries appeared, how vacuum is keeping up, whether there were any lock contention events. This is proactive maintenance, not crisis response.
The workflow:
The DBA has configured PostgreSQL logging appropriately on all five instances: log_min_duration_statement = 500, log_checkpoints = on, log_lock_waits = on, log_temp_files = 0, log_autovacuum_min_duration = 0. Log files rotate daily.
Every Sunday night, a cron job runs pgBadger against the week’s accumulated logs on each server. The output is an HTML report dropped into an S3 bucket with a dated prefix. Monday morning, the DBA reviews five reports.
The pgBadger report immediately surfaces:
- Top queries by total time: Shows that a new reporting query added by the analytics team on Thursday is consuming 18% of total database time — it was not there the previous week. The DBA files a ticket to add an index.
- Lock analysis: Shows three lock wait events lasting over 5 seconds on the
accountstable Friday afternoon, coinciding with the end-of-week billing run. The DBA schedules a review of the transaction ordering in the billing code. - Temporary files: A nightly ETL job is creating 4GB of temporary files, indicating that a sort or hash join is exceeding
work_mem. The DBA considers raisingwork_memfor that specific session or adding a sort index. - Checkpoint analysis: Checkpoint completion percentage is consistently high (95%), indicating the checkpoint is barely completing before the next one starts. The DBA increases
max_wal_size.
Tools used and why:
- pgBadger: Perfect for this use case — batch, historical, comprehensive. The HTML report is easy to share with engineering leads. The lock and autovacuum sections are not available from pg_stat_statements alone.
- pg_stat_statements: The DBA snapshots it weekly (
pg_stat_statements_reset()after each snapshot), using the delta to identify newly expensive queries.
What this persona does NOT need: Real-time dashboards (they are not on-call), a SaaS product (privacy and cost constraints), or pt-query-digest (PostgreSQL environment only).
Persona 3: MySQL DBA at Scale with pt-query-digest and PMM#
Context: A DBA team manages 40 MySQL 8.0 instances across multiple product lines for an e-commerce company. They use Percona Server for MySQL (a drop-in MySQL replacement with enhanced performance_schema). Query performance has regressed on the catalog service database — the team needs to identify the root cause and also wants continuous monitoring going forward.
Immediate investigation with pt-query-digest:
The DBA team has slow_query_log enabled with long_query_time = 0.5 on all Percona Server instances. They copy the slow log from the affected instance and run pt-query-digest.
The digest output shows that one query pattern — a product search with a complex WHERE clause across three tables — accounts for 34% of total query time. The rows examined vs. rows sent ratio is 85,000:1: for every result row returned, 85,000 rows are being scanned. This is a missing index diagnosis. The pt-query-digest output includes an example query (with actual parameter values, not normalized), which the DBA runs with EXPLAIN to confirm a full table scan on the product_attributes table.
Adding a composite index resolves the issue.
Long-term monitoring with PMM:
The immediate fix is done, but the DBA team needs continuous visibility across all 40 instances without running pt-query-digest manually each week. They deploy PMM.
The PMM server runs as a Docker container on a dedicated monitoring VM. The pmm-agent is installed on each database host and configured with credentials to monitor MySQL. Within 10 minutes of deployment, Grafana dashboards show real-time metrics for each instance: QPS (queries per second), slow queries per second, InnoDB buffer pool hit ratio, replication lag, and disk I/O.
The Query Analytics (QAN) module provides a browser-based interface equivalent to pt-query-digest but with time range selection and drill-down into individual query plans. When the catalog service regression occurs next week on a different instance, any team member can open PMM’s QAN, select the time range of the regression, and see the top offenders immediately — without needing SSH access or running command-line tools.
PMM also integrates with Alertmanager for alerting: when a specific query’s average execution time exceeds a threshold, an alert fires to PagerDuty. The team can set SLOs at the query level, not just the host level.
Tools used and why:
- pt-query-digest: Fast, low-overhead immediate analysis. Available on any MySQL host with a slow log.
- PMM: Persistent monitoring infrastructure — eliminates the need for manual log analysis for routine monitoring, provides team-wide visibility, and integrates with the alerting stack.
- EXPLAIN: Still used at the end of the investigation to confirm the diagnosis and verify the fix.
What this persona does NOT need: pg_stat_statements (PostgreSQL only), pgBadger (PostgreSQL only), or pganalyze (PostgreSQL only).
Persona 4: DevOps Engineer Adding Database Metrics to Dashboards#
Context: A DevOps engineer is responsible for a Prometheus + Grafana observability stack that monitors all services including infrastructure. The engineering team runs PostgreSQL databases but the current Grafana dashboards show only host-level metrics (CPU, memory, disk I/O) — there is no query-level visibility. The engineer needs to add database metrics without deploying additional services.
Integrating pg_stat_statements with Prometheus:
The engineer deploys postgres_exporter (Prometheus community project) alongside each PostgreSQL instance. The exporter connects to PostgreSQL and scrapes metrics from pg_stat_statements (and other views like pg_stat_bgwriter, pg_stat_replication, pg_locks) on each Prometheus scrape interval (typically 15s).
The key metrics exported from pg_stat_statements include:
pg_stat_statements_calls_total— counter of query executions by fingerprintpg_stat_statements_seconds_total— cumulative execution time by fingerprintpg_stat_statements_rows_total— cumulative rows by fingerprint
From these counters, Grafana’s rate() and increase() functions compute:
- Queries per second by fingerprint (from
callsrate) - Mean execution time per second (from
seconds / callsrate) - Error rate trends
The engineer imports the standard community Grafana dashboard for PostgreSQL (Dashboard ID 9628 or the Percona PostgreSQL dashboard) which is pre-configured with these metrics. Within an hour of setup, the team has per-query-pattern latency tracking in the same Grafana instance used for all other service metrics.
Adding alerting:
The engineer adds Prometheus alerting rules:
- Alert if mean query execution time for any fingerprint exceeds 2s for more than 5 minutes
- Alert if
pg_stat_bgwriter_buffers_checkpointrate is high (indicates checkpoint pressure) - Alert if
pg_stat_replicationlag exceeds 30s
Where pghero complements this:
The Prometheus/Grafana stack provides excellent time-series trending but poor ad-hoc query investigation. The engineer also deploys pghero as a Docker container connected to production PostgreSQL (read-only). When an alert fires indicating high query execution time, developers can open pghero to see which query fingerprints are currently most expensive — without needing database credentials or CLI access. pghero also surfaces missing index recommendations (queries that reference columns without indexes) automatically.
Tools used and why:
- pg_stat_statements: Data source. The exporter reads it via SQL.
- postgres_exporter: Bridge between pg_stat_statements and Prometheus.
- Grafana: Visualization, alerting. Already in the stack.
- pghero: Developer self-service dashboard for ad-hoc investigation, lower barrier to access than psql.
What this persona does NOT need: pgBadger (batch/log-based, doesn’t integrate with Prometheus), pganalyze (they already have Grafana; SaaS cost is not justified), or PMM (they already have a Prometheus stack; deploying PMM would be redundant infrastructure).
Cross-Persona Patterns#
Several patterns emerge across these personas:
pg_stat_statements is universal for PostgreSQL. Every PostgreSQL persona uses it, directly or through a tool that wraps it. There is no scenario where a PostgreSQL team should not have it enabled.
EXPLAIN ANALYZE is always the final diagnostic. Statistics and log analysis tell you what is slow. EXPLAIN ANALYZE tells you why. Every investigation eventually reaches this step.
The right tool depends on access pattern. The Python developer needs ad-hoc CLI tools. The DBA needs batch reporting. The DevOps engineer needs Prometheus integration. The MySQL DBA team needs both immediate analysis and persistent monitoring. There is no single tool that serves all these needs.
SaaS vs. self-hosted is a meaningful decision. pganalyze offers better UX and index advice but sends query samples to a third-party server. PMM provides equivalent depth but requires significant operational investment. Teams should evaluate their data sensitivity and operational capacity before choosing.
Slow log configuration is foundational. Whether you use pgBadger or pt-query-digest, the slow log must be properly configured before you need it. Enabling it after a performance event means you have no historical data. Configure log_min_duration_statement and slow_query_log from day one.
S4: Strategic
S4: Strategic Discovery — Database Observability & Profiling#
Ecosystem Health and Long-Term Viability#
Choosing database observability tooling is a long-term infrastructure decision. These tools become embedded in operational workflows, CI pipelines, runbooks, and dashboards. Switching them is expensive — not because data migration is complex, but because the knowledge and workflows built around them take time to rebuild. This section assesses the long-term health and strategic positioning of each tool.
Tool-by-Tool Longevity Assessment#
pg_stat_statements — Immortal#
pg_stat_statements ships with every PostgreSQL installation as a contrib extension and is effectively mandatory for any serious PostgreSQL deployment. It has been part of PostgreSQL since version 8.4 (2009). It is not a third-party project with funding risk or a SaaS that can be acquired or shut down — it is maintained by the PostgreSQL Global Development Group alongside the database itself.
Each PostgreSQL major version brings improvements: PostgreSQL 13 split total_time into total_exec_time and total_plan_time. PostgreSQL 14 added the toplevel column for nested statement tracking. PostgreSQL 15 added jit_functions, jit_optimization_time, and other JIT metrics. The extension evolves with the database.
Strategic verdict: Zero longevity risk. Enable it unconditionally on every PostgreSQL instance. Every other PostgreSQL observability tool in this survey depends on it.
pgBadger — Mature and Stable#
pgBadger is maintained by Dalibo, a well-established French PostgreSQL consulting company. The project has been active since 2012, with regular releases. The GitHub repository shows steady maintenance — releases happen several times per year, focused on compatibility with new PostgreSQL log formats, bug fixes, and minor features.
The development pace is slow, which is appropriate for a mature tool. There are no major features left to add: it parses PostgreSQL logs and produces reports, and it does that well. The risk is not abandonment but stagnation — if PostgreSQL significantly changes its log format (unlikely given backward compatibility commitments), pgBadger would need to keep up.
Dalibo’s business depends on PostgreSQL expertise; pgBadger is a flagship product of their open source portfolio. The likelihood of abandonment is very low.
Strategic verdict: Safe long-term choice. Mature, low maintenance, well-understood. Deploy and automate.
Percona Toolkit (pt-query-digest) — Actively Supported#
Percona is a commercial company built on MySQL expertise. The Percona Toolkit is their flagship open source offering and is central to their commercial support business — Percona’s support engineers use these tools to diagnose customer problems. This creates strong commercial alignment with maintaining the toolkit.
pt-query-digest has been in continuous use since the mid-2000s (originally under the Maatkit project name). It is deeply embedded in MySQL operations culture. GitHub activity shows regular maintenance, primarily around MySQL 8.x compatibility as Percona keeps pace with upstream MySQL changes.
The primary risk is MySQL’s own trajectory. Oracle maintains MySQL; Percona maintains a fork (Percona Server for MySQL) and a separate distribution (Percona XtraDB Cluster). If MySQL’s slow query log format or PERFORMANCE_SCHEMA changes significantly, Percona Toolkit must adapt. Historically, Percona has kept pace reliably.
Strategic verdict: Safe for MySQL/MariaDB/Percona Server environments. Deep community trust. Stable.
auto_explain — Immortal (PostgreSQL contrib)#
Like pg_stat_statements, auto_explain is a PostgreSQL contrib module maintained by the PostgreSQL Global Development Group. It is not a separate project; it ships with PostgreSQL. No longevity risk.
Strategic verdict: Enable it alongside pg_stat_statements on production systems where plan-level debugging information is needed.
pghero — Healthy Independent Project#
pghero is maintained by Andrew Kane, a prolific Rails/Ruby open source developer at Instacart. The project has strong GitHub activity, with regular updates to track PostgreSQL version changes and add new insights. Its 8,600+ stars make it the most starred PostgreSQL performance tool, indicating broad adoption.
The risk factors: it is a single-maintainer project. If Andrew Kane shifts focus, maintenance could slow. The Ruby dependency can be a barrier for non-Ruby shops. The Docker image partially mitigates this, but the underlying project maintenance risk remains.
That said, pghero’s simplicity — it is primarily a SQL queries wrapper around pg_stat_statements — means it does not require frequent maintenance to remain functional. It could survive reduced maintenance for years on existing PostgreSQL versions.
Strategic verdict: Good choice for teams wanting a quick web dashboard. Acceptable single-maintainer risk given the tool’s simplicity. Monitor for maintenance activity if adopting long-term.
pganalyze — SaaS with Open Source Collector#
pganalyze has operated as a SaaS since 2013 and has a stable commercial business. The open source collector (pganalyze-collector) is maintained on GitHub and is genuinely open. The collector is designed to function as the integration point between PostgreSQL and the pganalyze service.
The strategic risk is SaaS dependency. If pganalyze raises prices significantly, pivots, or is acquired, you lose the analysis platform — even though the collector is open source. The collector could theoretically be repurposed (it emits data in a documented format), but the value is in the SaaS analysis, which is not open source.
pganalyze’s index advisor and EXPLAIN plan visualization are genuinely differentiated features not easily replicated with open source tools. Teams that value these and accept the SaaS model should evaluate it seriously.
Strategic verdict: Good SaaS choice for PostgreSQL teams with tolerance for external dependency. Evaluate against your data residency requirements (query samples are sent to pganalyze infrastructure). Acceptable for non-sensitive workloads; problematic for regulated industries.
Percona Monitoring and Management (PMM) — Operationally Heavy but Durable#
PMM is Percona’s flagship monitoring product and a direct competitive answer to commercial solutions like Datadog’s database monitoring. Percona actively develops PMM; it is central to their commercial support offering (PMM is free, but support contracts generate revenue).
PMM’s architecture has matured significantly. PMM 2.x replaced the previous all-in-one design with a containerized, Prometheus-native architecture. The Query Analytics module (QAN) added ClickHouse as the storage backend for query analytics data, which scales well.
The strategic challenge with PMM is operational overhead. Deploying and maintaining PMM server requires dedicated resources (a VM or Kubernetes namespace, persistent storage, regular upgrades). Small to mid-size teams often find that the overhead exceeds the benefit compared to simpler alternatives (pghero + postgres_exporter). Large teams running dozens of database instances across multiple platforms (MySQL + PostgreSQL + MongoDB) find PMM pays for itself in consolidated visibility.
Strategic verdict: Strong long-term viability backed by Percona’s commercial model. Best suited for teams running multiple database platforms at scale and with operational capacity to manage it. Overkill for single-platform small deployments.
sqlfluff — Growing Rapidly#
sqlfluff has experienced rapid growth since 2020, driven largely by the dbt ecosystem. dbt users write extensive SQL in version-controlled files, creating exactly the use case sqlfluff was designed for. The project has a large contributor community and is no longer dependent on any single maintainer.
The risk is fragmentation: SQL dialects are numerous and evolving rapidly (especially cloud data warehouse SQL), and maintaining accurate parsers for each is ongoing work. The project has handled this reasonably well by treating dialect support as pluggable.
Strategic verdict: Strong long-term momentum. Particularly valuable for teams using dbt. Adoption is accelerating, not declining.
Decision Framework: Choosing Your Stack#
PostgreSQL vs. MySQL Platform Split#
The most fundamental decision is your database platform. The PostgreSQL and MySQL ecosystems have largely separate tooling:
| Need | PostgreSQL | MySQL |
|---|---|---|
| Query statistics | pg_stat_statements | PERFORMANCE_SCHEMA |
| Slow log analysis | pgBadger | pt-query-digest |
| Web dashboard | pghero | PMM QAN |
| Managed monitoring | pganalyze | PMM or PlanetScale Insights |
| Plan inspection | EXPLAIN ANALYZE | EXPLAIN ANALYZE |
| SQL linting | sqlfluff | sqlfluff |
EXPLAIN ANALYZE and sqlfluff work across both platforms.
Open Source vs. SaaS#
Go open source when:
- Data sensitivity/residency requirements prohibit sending query samples to third parties
- Team has operational capacity to deploy and maintain self-hosted tools
- Cost at scale makes SaaS prohibitive (pganalyze charges per server)
- You want full control and no external dependencies
Go SaaS when:
- Team is small and operational overhead of self-hosted tools is high
- Index advisor and EXPLAIN visualization justify the cost
- Integration with commercial monitoring stacks (PagerDuty, Slack alerts) is needed out of the box
- Managed PostgreSQL hosts (RDS, Cloud SQL) already provide pg_stat_statements access
Real-Time vs. Log-Based#
Real-time (pg_stat_statements, pghero, PMM, postgres_exporter):
- Shows current state — what is expensive right now
- Counters accumulate since last reset; requires periodic snapshotting for trend data
- Immediately actionable during incidents
- Does not capture individual slow query examples with parameter values
Log-based (pgBadger, pt-query-digest):
- Retroactive — requires slow log to have been enabled before the problem occurred
- Captures actual parameter values (not normalized)
- Better for weekly reporting and post-incident analysis
- Provides lock analysis, checkpoint analysis, connection analysis not available from statistics alone
The operationally mature answer is to use both: enable slow logging with a reasonable threshold, and run pg_stat_statements continuously. Use statistics for real-time alerting and dashboards; use log analysis for weekly reports and post-incident reviews.
Minimum Viable Observability Stack#
For a team starting from zero with a PostgreSQL database:
- Enable pg_stat_statements — zero cost, built-in, essential. Query it weekly or integrate with postgres_exporter.
- Configure slow query logging —
log_min_duration_statement = 1000. Install pgBadger. Run it weekly. - Enable auto_explain —
auto_explain.log_min_duration = 5000,log_analyze = true. Gets you plan-level data for the worst offenders.
This stack costs nothing, has negligible operational overhead, and covers 80% of observability needs. Add pghero for a web dashboard, postgres_exporter + Grafana for trending, and pganalyze or PMM if you need managed insights or multi-instance coverage.
Competitive Alternatives Not Covered in Depth#
Datadog Database Monitoring: Commercial, per-host pricing, excellent UX, integrates with Datadog APM for query-to-trace correlation. Strong choice for teams already paying for Datadog. Captures explain plans automatically.
New Relic Database Monitoring: Similar to Datadog, APM-integrated.
PgHero alternatives: pgwatch2 (open source, more complex), pgmetrics (CLI report generator, no web UI).
eBPF-based monitoring (emerging): Tools like odigos and pixie use eBPF to capture database traffic at the kernel level without query instrumentation. Still maturing but potentially powerful for environments where agent deployment is constrained.
pgMustard: SaaS EXPLAIN plan analyzer. Paste in EXPLAIN output and get annotated advice. Complementary to pg_stat_statements — helps interpret plans, not find them.
Summary Recommendation#
For most PostgreSQL teams: pg_stat_statements + pgBadger + EXPLAIN ANALYZE + auto_explain. These four tools cost nothing, require minimal configuration, and cover the complete observability lifecycle from query statistics to plan debugging to batch reporting.
For teams that want a web dashboard without building one: add pghero.
For teams with managed monitoring infrastructure: add postgres_exporter and pull metrics into existing Grafana/Prometheus stacks.
For MySQL-centric teams: pt-query-digest + PMM is the standard answer, equivalent to pgBadger + pghero in the PostgreSQL world.
For teams requiring SaaS convenience with deeper insight: pganalyze (PostgreSQL) or PMM + Percona support (MySQL/mixed).