PostgreSQL Connection Debugging for Web Apps: DATABASE_URL, sslmode, Pool Size, and Deploy Failures
How to debug Postgres connection failures on deploy: percent-encoding in DATABASE_URL, what each sslmode value does, pool size math against max_connections, connect timeouts, and why migrations hang.
PostgreSQL Connection Debugging for Web Apps: DATABASE_URL, sslmode, Pool Size, and Deploy Failures
The app runs fine on your laptop. You deploy it, and the health check dies with FATAL: password authentication failed — even though you pasted the password straight from the dashboard. Or the first request after a quiet night takes 30 seconds and then fails. Or the migration step hangs forever and the deploy times out. Almost every one of these traces back to five things: how the DATABASE_URL was parsed, which sslmode is in effect, how many connections the pool opens, what the timeouts are, and what the migration is waiting on. This post walks through each one with the actual error text you'll see, so you can match symptom to cause instead of guessing.
If you want the compact reference version of the commands used below (\conninfo, pg_stat_activity, lock inspection), the PostgreSQL Cheatsheet has them organized by task with copy buttons.
Start with the exact FATAL line, not the config
Postgres connection errors are unusually specific. Read the whole line before changing anything:
Connection refused— nothing is listening at that host:port. Wrong host, wrong port, or the database isn't up. Postgres itself never saw you.FATAL: password authentication failed for user "app_user"— you reached Postgres. The host and port are right. Only the credentials (or the user's auth method) are wrong.FATAL: no pg_hba.conf entry for host "10.0.0.5", user "app_user", database "shop", no encryption— credentials may be fine; the server refuses this combination. The trailingno encryptionis the tell: the server requires TLS and your client connected without it. This is the single most common managed-Postgres deploy failure, and the fix issslmode=require, not a password reset.FATAL: database "shop" does not exist— auth succeeded. You're on the wrong database name, or the URL path segment got lost.timeout expired/Connection timed out— packets are being dropped, usually a security group, firewall, or a private-network hostname resolved from the wrong network.
Each error rules out everything before it in the handshake. Connection refused means don't touch the password. password authentication failed means don't touch the firewall.
DATABASE_URL parsing: percent-encoding bites first
A connection URL is parsed as a URL, and that has a consequence people hit constantly: reserved characters in the password must be percent-encoded. Suppose the real password is p@ss:w0rd. This URL is broken:
postgres://app_user:p@ss:w0rd@db.internal:5432/shop
The parser splits on the first @, so it reads the password as p, the host as ss:w0rd@db.internal — and you get either a DNS failure on a garbage hostname or password authentication failed with a password you know is correct. The working form encodes @ as %40 and : as %3A:
postgres://app_user:p%40ss%3Aw0rd@db.internal:5432/shop?sslmode=require
Input: p@ss:w0rd → output: p%40ss%3Aw0rd. That's plain URL component encoding — the URL Encoder / Decoder does exactly this transformation if you don't want to hand-count percent signs. And because these URLs live in env files where a stray space or an unquoted ? breaks everything silently, running the file through the .env Validator before deploying catches the mechanical mistakes (duplicate keys, unquoted special characters, trailing whitespace) that produce "works locally, fails on the server" behavior.
Two more parsing traps: a URL that ends in / has an empty database name (some drivers then default to the username as the database, giving database "app_user" does not exist), and query parameters after the first one must be joined with &, not a second ?.
sslmode: six values, three that matter
The sslmode query parameter controls whether the client uses TLS and whether it checks who it's talking to:
disable— plaintext. Fine for localhost, guaranteedno pg_hba.conf entry ... no encryptionon most managed providers.prefer(libpq's default) — tries TLS, falls back to plaintext. The dangerous one: it works everywhere, so you never notice which mode you're actually in.require— TLS mandatory, but the server's certificate is not verified. This is what most hosting docs tell you to use. It encrypts traffic but doesn't authenticate the server.verify-ca/verify-full— TLS plus certificate verification (verify-fullalso checks the hostname). These need a CA file (sslrootcert=...) and are what you want when the database crosses a network you don't control.
The failure signature for a missing sslmode is precisely the no encryption FATAL above. The failure signature for verify-full without a CA cert is root certificate file "/root/.postgresql/root.crt" does not exist. Match the message; don't rotate credentials for a TLS problem.
Pool size × instances must fit under max_connections
Postgres defaults to max_connections = 100 (per the PostgreSQL documentation), and every connection is a forked backend process. Andres Freund's 2020 measurement on the Microsoft Citus blog put the real memory overhead of an idle connection at under 2 MB — smaller than the 10 MB folklore number, but his companion scalability benchmark showed throughput degrading measurably as idle connection counts climb into the thousands, because of snapshot and cache-management costs. Connections are not free even when idle.
The math that actually breaks deploys is simpler than any benchmark: pool size × app instances + workers + migrations + one human with psql ≤ max_connections. A pool of 20 looks harmless until autoscaling brings up 6 instances (120 connections) against a managed plan capped at 100, and new instances crash-loop on FATAL: remaining connection slots are reserved for non-replication superuser connections. When that hits, see who is holding what:
SELECT usename, state, count(*)
FROM pg_stat_activity
GROUP BY 1, 2
ORDER BY 3 DESC;
If most rows are idle, your pools are oversized; shrink them or put PgBouncer in transaction-pooling mode in front. If they're idle in transaction, the app is holding transactions open across await points — that's an application bug, and it also blocks the migrations below.
Timeouts and hanging migrations on deploy
Two timeout settings prevent the classic "first request after idle takes forever" incident. connect_timeout in libpq defaults to 0 — wait indefinitely — so a dropped SYN packet means your app waits until the OS gives up. Set connect_timeout=5 in the URL. On the server side, statement_timeout bounds runaway queries; setting it per-role (ALTER ROLE app_user SET statement_timeout = '30s') keeps one bad query from pinning a connection forever.
Migrations hang for a different reason: ALTER TABLE needs an ACCESS EXCLUSIVE lock, and it queues behind any open transaction touching that table — including an idle in transaction session from a stale pod. Worse, every new query then queues behind the waiting ALTER TABLE. The deploy looks frozen; the database is just waiting politely. Wrap migrations with SET lock_timeout = '5s' so they fail fast and retry instead of stalling the whole app. If you're running Postgres in a container during CI, note the port mapping too — docker run -p 5433:5432 means your migration connects to localhost:5433, and the Docker Cheatsheet covers the port and network inspection commands for confirming what's actually exposed.
I lost a Friday evening to the pg_hba variant of all this. Our staging URL had sslmode unset, and libpq's prefer silently fell back to plaintext against a self-hosted box that allowed it. When we moved staging to a managed provider, the same URL produced no pg_hba.conf entry ... no encryption, and because the message mentions the user and database, I spent an hour rotating credentials that were never wrong. Appending ?sslmode=require fixed it in one line. Since then, every connection string I write gets an explicit sslmode — prefer is the default you only discover during an outage.
The pattern across all five areas is the same: the error text names the failing layer precisely. Decode the URL, pin the sslmode, do the pool arithmetic, set the two timeouts, and give migrations a lock_timeout — and the next deploy failure will be something more interesting.
Made by Toolora · Updated 2026-07-02