When the Postgres planner lies to you (and how to tell)

EXPLAIN ANALYZE is not a confession. It's a story the planner tells you about what it expected and then what it did. Both parts can mislead, in different ways.

The two numbers

Every node in the plan has two numbers I care about: the estimated rows and the actual rows. They look like this:

Index Scan using users_pkey on users
  (cost=0.43..8.45 rows=1 width=320)
  (actual time=0.012..0.013 rows=1 loops=1)

Estimated 1, actual 1. Healthy. The lies start when those numbers diverge by an order of magnitude or more — and especially when they diverge in a way the planner couldn't have known about.

Three kinds of lie

1. The stats are stale. ANALYZE hasn't run since you bulk-loaded that table. The planner thinks the table has 100k rows; it has 50M. Run ANALYZE; the lie evaporates. This is the easy case and accounts for maybe a third of "the planner is broken" complaints I see.

2. The stats are fine, the correlation is wrong. The planner knows column A has selectivity 0.01 and column B has selectivity 0.1, so WHERE A AND B "should" return 0.001 of the table. But A and B are not independent. They might be perfectly correlated. The planner has no way to know unless you create extended statistics:

CREATE STATISTICS users_country_lang
    (dependencies, ndistinct)
    ON country, language FROM users;
ANALYZE users;

I forget about CREATE STATISTICS approximately once a quarter and rediscover it the hard way.

3. The plan is right, the timing isn't. A plan can have correct row estimates and still be slow because of buffer cache behavior, lock contention, or the JIT taking 200ms to warm up on a short query. EXPLAIN ANALYZE shows wall-clock time including JIT-compile time. If you see a node with 50ms of "Planning Time" and "JIT" in the output, the plan is fine; the engine is being weird.

Reading the diff

My eye goes to one thing first: the loops column.

-> Index Scan using comments_post_id_idx
   (actual time=0.014..0.018 rows=8 loops=1247)

That node ran 1247 times. The "rows" is per-loop average, not total. Total rows produced: 1247 × 8 ≈ 10k. Total time: 1247 × 0.018 ≈ 22ms. If you read the per-loop number and forget to multiply, you will be confused for an embarrassing amount of time, and I have been.

BUFFERS, always

Run EXPLAIN (ANALYZE, BUFFERS), every time. The buffer counts tell you what the timing won't:

  • shared hit — pages already in cache. Free.
  • shared read — pages from disk. Not free.
  • shared dirtied — pages this query had to write back, often because of hint-bit updates. Surprising.

A query that's "fast" because the data is hot in cache will be slow in production after a restart. BUFFERS shows you that. Timing doesn't.

The rest of my checklist

  1. Are estimates within 2× of actual? If not, ANALYZE. If still not, extended stats.
  2. Is the plan using the index you expected? If not, why not — selectivity, types, casts?
  3. Is there a Sort node that could be replaced by an index ordering?
  4. Is there a Materialize or Hash node spilling to disk? (Look at Memory Usage; if it has "Disk: " in it, you've spilled.)
  5. Are there nested loops with high loops=? Often a missed hash join opportunity.

The trick I keep relearning

When a query is mysteriously slow and the plan looks fine, run it twice. Then run EXPLAIN (ANALYZE, BUFFERS) on the second run. Most of the "lies" I chase turn out to be cache-warmth artifacts. The plan was telling the truth; I was reading it at the wrong moment.

Edit, mar 2026: Postgres 18 changes some of this. Updating soon.

← archive · prev: tail latency