Background jobs and queues for web apps
When to push work off the request path, how to choose a database queue or Redis, and how to write a worker that survives crashes and retries.

The moment your app sends an email, resizes an image, or calls a slow third-party API inside a request handler, you have a problem waiting to happen. The user stares at a spinner, your server holds a connection open for three seconds it did not need to, and one flaky API turns into a wall of timeout errors. Background jobs are the fix: do the slow thing later, off the request path, where it can fail and retry without anyone watching.
What actually belongs in a background job
Not everything needs a queue. The honest test is simple: does the user need the result before you can send the HTTP response? If yes, do it inline. If no, push it to a job.
Things that almost always belong in the background:
- Sending email and SMS (the SMTP or provider call is slow and flaky)
- Generating PDFs, thumbnails, or video transcodes
- Calling external APIs that you do not control (payment reconciliation, syncing to a CRM)
- Fan-out work, like notifying 5,000 followers about a new post
- Anything you want to retry automatically on failure
Things that should usually stay inline: validating input, writing the row the user just submitted, returning the ID they need next. Do not background work that the next page load depends on. You will spend the rest of the week explaining race conditions to yourself.
Do you even need Redis? Start with your database
The default reflex is to reach for Redis and BullMQ. Sometimes that is right. But if you already run Postgres and your job volume is modest (say, under a few hundred jobs a minute), a database-backed queue is less infrastructure to run, gives you transactional guarantees for free, and is easy to inspect with plain SQL.
The trick that makes Postgres a decent queue is SELECT ... FOR UPDATE SKIP LOCKED. It lets multiple workers grab different rows without stepping on each other. No worker ever waits on a row another worker already claimed.
Here is a minimal but real worker loop in TypeScript using the pg driver.
import { Pool } from "pg";
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
type Job = {
id: string;
type: string;
payload: Record<string, unknown>;
attempts: number;
};
async function claimJob(): Promise<Job | null> {
const { rows } = await pool.query<Job>(
`UPDATE jobs
SET status = 'running', started_at = now()
WHERE id = (
SELECT id FROM jobs
WHERE status = 'pending'
AND run_after <= now()
ORDER BY run_after
FOR UPDATE SKIP LOCKED
LIMIT 1
)
RETURNING id, type, payload, attempts`
);
return rows[0] ?? null;
}
async function runOnce(handlers: Record<string, (p: any) => Promise<void>>) {
const job = await claimJob();
if (!job) return false;
try {
await handlers[job.type](job.payload);
await pool.query(`DELETE FROM jobs WHERE id = $1`, [job.id]);
} catch (err) {
const attempts = job.attempts + 1;
const backoff = Math.min(2 ** attempts, 3600); // seconds, capped at 1h
const failed = attempts >= 5;
await pool.query(
`UPDATE jobs
SET status = $2,
attempts = $3,
last_error = $4,
run_after = now() + ($5 || ' seconds')::interval
WHERE id = $1`,
[job.id, failed ? "dead" : "pending", attempts, String(err), backoff]
);
}
return true;
}The table behind it is unremarkable, which is the point:
CREATE TABLE jobs (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
type text NOT NULL,
payload jsonb NOT NULL,
status text NOT NULL DEFAULT 'pending',
attempts int NOT NULL DEFAULT 0,
run_after timestamptz NOT NULL DEFAULT now(),
last_error text,
started_at timestamptz
);
CREATE INDEX jobs_pending_idx ON jobs (run_after)
WHERE status = 'pending';The partial index keeps the claim query fast even when finished jobs pile up. The biggest win here: you can enqueue a job in the same transaction that writes your business data. If the transaction rolls back, the job never existed. With Redis you cannot do that, and you end up writing jobs for orders that were never created.
When Redis and a real queue earn their keep
Push past a few hundred jobs a second, or want sub-second latency between enqueue and execution, and Postgres starts to feel the strain. Polling adds latency, and dead rows churn the table. This is where BullMQ on Redis is genuinely better: it uses blocking pops so workers wake instantly, and it ships with rate limiting, scheduled jobs, and a UI.
import { Queue, Worker } from "bullmq";
const connection = { host: "127.0.0.1", port: 6379 };
export const emailQueue = new Queue("email", { connection });
// enqueue from your API route
await emailQueue.add(
"welcome",
{ userId: user.id },
{
attempts: 5,
backoff: { type: "exponential", delay: 1000 },
removeOnComplete: 1000,
removeOnFail: 5000,
}
);
// the worker (a separate process)
new Worker(
"email",
async (job) => {
await sendWelcomeEmail(job.data.userId);
},
{ connection, concurrency: 10 }
);The cost is that Redis is now a piece of stateful infrastructure you have to back up, monitor, and reason about. If you lose Redis and have not configured persistence properly, you lose in-flight jobs. That is a fine trade at scale and a needless one when you have 40 emails an hour.
A note on serverless
If you deploy to a platform where functions are short-lived (Vercel, Lambda), you cannot run a long-lived worker process the way the loop above assumes. Two practical options: trigger a function on a cron schedule to drain the queue, or use a managed queue (SQS, Upstash QStash, or a durable workflow runtime) that calls an HTTP endpoint per job. The job logic stays the same. Only the thing that wakes the worker changes.
The three rules that keep you out of trouble
Most background job pain comes from skipping one of these.
Make handlers idempotent. A job can and will run twice. The worker might crash after sending the email but before deleting the row. Design every handler so that running it again is harmless. Charge a card? Use an idempotency key. Send an email? Record that you sent it and check first. Assume at-least-once delivery, never exactly-once.
Always have a retry and a dead state. Transient failures (a network blip, a rate limit) should retry with exponential backoff. Permanent failures (a malformed payload, a deleted user) should stop after a few tries and land in a dead state where a human can look at them. A job that retries forever is a slow-motion incident.
Watch the queue depth. The single most useful metric is how many jobs are pending and how old the oldest one is. If pending depth climbs and never drains, your workers are down or too slow, and you want to know before your users do. Alert on it.
Takeaway
Start with the simplest thing that works. If you already run Postgres, a SKIP LOCKED queue will carry you a long way and gives you transactional enqueue for free. Reach for Redis and BullMQ when volume or latency actually demands it, not on reflex. Whatever you choose, make handlers idempotent, cap your retries, and watch the queue depth. Get those three right and background work becomes the boring, reliable part of your system instead of the part that pages you at 2am.
If you want a second pair of eyes on a queue that keeps falling over, that is the kind of thing we like untangling as part of our API and backend engineering work.

