Tail latency for people who write CRUD apps

If you build CRUD apps for a living and someone in a meeting starts talking about p99, this post is for you. It's not a textbook chapter. It's the thing I wish I had read in 2017.

The shape of the problem

Latency is not a number. It's a distribution. The most common mistake I see — in code, in dashboards, in shoulder conversations — is collapsing the distribution into a single number too early and then reasoning about that number as if it were the truth.

Average response time is the worst offender. A service can have a lovely 40 ms average and be torturing 5% of users. The average cannot tell you that. It is, structurally, incapable.

"Use p99," people say. p99 is better. p99 is also, in a different way, a liar — and that's most of this post.

Why p99 lies

Three reasons, in increasing order of importance.

One: p99 hides the worst 1%. If your service handles a million requests an hour, the worst 1% is ten thousand requests an hour where users are having a bad time. p99 says nothing about how bad. The p99 might be 200 ms; the p99.9 might be 8 seconds. You don't know unless you measure.

Two: p99 of components ≠ p99 of the system. If your request fans out to four downstream services, each with a p99 of 100 ms, your end-to-end p99 is not 100 ms. It's much worse. The probability that at least one of the four hits its tail is roughly 1 − 0.99⁴ ≈ 4%. Your end-to-end p96 is now where the ugly stuff lives.

Three: most measurements are coordinated-omission'd. This is the Gil Tene point. If your load-tester sends one request, waits for the response, and then sends the next, then a slow response causes the load-tester to send fewer requests, which causes the slow request not to be representative of the load pattern under which it occurred. The load tool is colluding with the system to hide tail behavior. Most homemade benchmark scripts do this. Most off-the-shelf ones used to and now mostly don't.

What to measure instead

In rough order of how often I reach for them:

  1. Histogram, not summary. Prometheus has both; prefer histograms with sensible buckets. Summaries can't be aggregated across instances. Histograms can.
  2. p99 and p99.9 and max. max is noisy but worth keeping. Some of the most useful incidents I've debugged started with "max is 30 seconds and p99.9 is 800 ms; what changed?"
  3. Latency by route, not in aggregate. One slow endpoint can poison your global p99 and you'll never figure out which one without slicing.
  4. The CDF, occasionally. When I'm trying to understand a service I've never touched, I'll generate a CDF of the last hour's response times. The shape tells you almost everything. Bimodal distribution? Two paths through the code, one of them slow. Long tail past p99? Some external dependency is leaking. Shelf at exactly 30 seconds? Timeout.

An example, kept boring on purpose

I had a Go service last year that fronted a Postgres database. CRUD, fundamentally — list-by-user, get-by-id, the usual. p50 around 8 ms, p99 around 60 ms, max usually under 200 ms. Healthy.

Once a day, briefly, the p99.9 would jump to 4 seconds. Nobody had noticed for months because the dashboard only showed p99.

What I did:

// instrument the slow path
defer func(start time.Time) {
    d := time.Since(start)
    if d > 500*time.Millisecond {
        log.Warn().
            Dur("dur", d).
            Str("route", route).
            Int64("user_id", uid).
            Msg("slow request")
    }
}(time.Now())

Eight lines. Within a day I had a list of slow requests. Same route. Same user. Different times of day. The user had 2.3 million rows in a child table that the route was full-scanning because the ORM was emitting a query the planner didn't know how to use. The fix was a partial index. Took an hour.

The point isn't the fix. The point is: the slow requests had been happening for months and nothing was telling me about them because nothing crossed the p99 threshold. p99 was a surveillance system pointed at the wrong place.

Some heuristics I trust

  • Anything past p99 is mostly the network and the runtime, not your code. GC pauses, TCP retransmits, kernel scheduling decisions. Your business logic only owns the head.
  • If max is much worse than p99.9, look at timeouts and connection pools first. A request that waited 10s for a free connection and then ran for 8ms is a different bug than a request that ran for 10s.
  • Periodicity is a clue, not a feature. Daily spike at 03:00? Some cron job is running. Hourly spike on the hour? Some metric scrape, or some stats-flushing thing. Spike every Tuesday at 14:00 UTC? Welcome to my current contract.
  • If your p50 is moving and p99 isn't, you have a load problem. If p99 is moving and p50 isn't, you have a contention problem. They're different fixes.

What I tell juniors

Don't optimize for averages. Don't even optimize for p99 directly. Optimize for: "how often does my service make a real human wait visibly long enough to be annoyed?" That's usually p99.5 or p99.9 depending on the service. The exact number matters less than the discipline of asking the question.

And: log the slow requests. With context. The single most leveraged eight lines of code I write.


Edit, mar 2026: a reader pointed out — correctly — that the "p99 of components" math above is a simplification and assumes independence between the four downstream calls, which they usually aren't. The real number is worse. I'll write a follow-up. Thank you, P.

Original: this post on Feb 2, 2026.

← back to archive · next: Postgres planner →