Observability basics for production web apps

Most apps ship to production with exactly one observability tool: the gut feeling you get when a user emails to say something is broken. That works right up until the day it doesn't, usually at 2am, usually during a sale. Observability is the boring insurance that turns "the site feels slow" into "checkout p95 jumped to 4 seconds because the payments API started timing out at 14:03."

What observability actually means

There is a lot of vendor noise around this word, so let's keep it plain. Observability is your ability to ask new questions about what your system is doing right now, without shipping new code to answer them. The classic split is three signals:

Logs: discrete events. "User 4821 placed order 9930." Good for the specific story of one request.
Metrics: numbers aggregated over time. "Requests per second, error rate, p95 latency." Good for trends and alerts.
Traces: the path of a single request across services. "This request spent 30ms in your handler and 2.1s waiting on Postgres."

You do not need all three on day one. But you do need to know which question each one answers, because reaching for the wrong signal is how people burn an afternoon grepping logs for something a single latency graph would have shown in five seconds.

Start with structured logs

If you only do one thing this quarter, make your logs structured. A line like console.log("order failed", err) is invisible to every tool that could help you. JSON logs with consistent fields are queryable, filterable, and groupable.

Use a real logger. In Node, pino is fast and unopinionated. Here is a setup that attaches a request ID to every log line so you can reconstruct a single request later.

// lib/logger.ts
import pino from "pino";
 
export const logger = pino({
  level: process.env.LOG_LEVEL ?? "info",
  // Redact anything that could leak into your log store.
  redact: ["req.headers.authorization", "req.headers.cookie", "*.password"],
  base: {
    service: "checkout-api",
    env: process.env.NODE_ENV,
  },
});

// app/api/orders/route.ts
import { randomUUID } from "node:crypto";
import { logger } from "@/lib/logger";
 
export async function POST(req: Request) {
  const requestId = req.headers.get("x-request-id") ?? randomUUID();
  const log = logger.child({ requestId });
  const start = performance.now();
 
  try {
    const order = await createOrder(await req.json());
    log.info({ orderId: order.id, durationMs: performance.now() - start }, "order created");
    return Response.json(order, { headers: { "x-request-id": requestId } });
  } catch (err) {
    log.error(
      { err, durationMs: performance.now() - start },
      "order creation failed",
    );
    return Response.json({ error: "internal_error", requestId }, { status: 500 });
  }
}

Two details that matter more than they look. First, returning the requestId to the client means a user can paste it into a support ticket and you can find the exact failure in one query. Second, redact is not optional. The fastest way to turn an observability project into a security incident is to log a full request body with a session cookie in it.

Log levels people actually use

Pick a discipline and hold it. A version that survives contact with real on-call rotations:

error: something failed and a human or a retry needs to deal with it.
warn: degraded but handled. A fallback kicked in, a retry succeeded.
info: meaningful business events. Order created, payment captured, user signed up.
debug: noisy detail you switch on temporarily via LOG_LEVEL.

If everything is info, nothing is. The point of levels is that you can alert on error, sample info, and drop debug in production without losing the plot.

Metrics tell you when, traces tell you why

Logs are great for the specific request you already know is broken. They are a terrible way to notice that error rates are climbing across thousands of requests. That is what metrics are for.

The four numbers worth watching first, sometimes called the RED method, are Rate, Errors, and Duration per endpoint. Add saturation (CPU, memory, DB connection pool usage) and you have covered most real outages. You can derive a lot of this from a hosting platform's built-in analytics, but the moment you have your own backend, exposing your own metrics pays off fast.

When a metric alert fires, you will want to know why the slow requests are slow. Traces answer that. A trace breaks one request into spans: time in your handler, time in the database, time in each downstream call. OpenTelemetry is the vendor-neutral standard, and Next.js has first-class support for it.

// instrumentation.ts
import { registerOTel } from "@vercel/otel";
 
export function register() {
  registerOTel({ serviceName: "checkout-api" });
}

That one file gets you automatic spans for incoming requests and outgoing fetch calls. For the parts that matter most, your own database queries and business logic, add manual spans so the trace tells the real story.

// lib/orders.ts
import { trace } from "@opentelemetry/api";
 
const tracer = trace.getTracer("orders");
 
export async function createOrder(input: OrderInput) {
  return tracer.startActiveSpan("createOrder", async (span) => {
    try {
      span.setAttribute("order.itemCount", input.items.length);
      const order = await db.insertOrder(input); // child DB span via auto-instrumentation
      span.setAttribute("order.id", order.id);
      return order;
    } catch (err) {
      span.recordException(err as Error);
      throw err;
    } finally {
      span.end();
    }
  });
}

Now when that p95 latency alert fires, you open one slow trace and see the answer in the waterfall: 2.1 seconds in a single Postgres query that is missing an index. No guessing.

Connect the database, because it is usually the database

In a web app, the slow thing is the database far more often than people expect. Two cheap wins.

Turn on slow query logging in Postgres so the database tells you which statements are expensive before your users do:

-- Log any statement that takes longer than 500ms.
ALTER SYSTEM SET log_min_duration_statement = '500ms';
SELECT pg_reload_conf();

And keep an eye on connection pool saturation. A surprising number of "the app is down" incidents are really "every connection in the pool is checked out and new requests are queuing." Expose that as a metric and alert on it. A pool sitting at 95 percent usage is a slow-motion outage you can fix during business hours instead of at 2am.

Alerts: fewer, sharper, owned

Observability without alerting is a dashboard nobody looks at until after the incident. But the opposite failure is worse: 40 alerts a day until the team mutes the channel and misses the one that mattered. Some rules that hold up:

Alert on symptoms users feel (error rate, latency, checkout failures), not on every internal metric.
Every alert needs an owner and a rough idea of what to do. An alert with no runbook is a notification, not an alert.
Page for things that need a human now. Everything else goes to a dashboard or a daily digest.

A good starting alert is simply: error rate above 2 percent for 5 minutes on a critical route. It is specific, it maps to user pain, and it is hard to argue with at 2am.

The takeaway

You do not need a six-figure observability platform to stop flying blind. Start with structured JSON logs and a request ID you can trace end to end. Add RED metrics on your critical endpoints, then OpenTelemetry traces so you can answer why something is slow without redeploying. Turn on Postgres slow query logging, because that is where the time usually goes. Wire up two or three sharp alerts tied to real user pain, and resist the urge to add more. Do that much and the next 2am incident becomes a ten-minute fix instead of an archaeology dig.

If you want a second set of eyes on your production setup before the next traffic spike, that is the kind of unglamorous work we genuinely enjoy.

Observability basics for production web apps

What observability actually means

Start with structured logs

Log levels people actually use

Metrics tell you when, traces tell you why

Connect the database, because it is usually the database

Alerts: fewer, sharper, owned

The takeaway

Cloud & DevOps

Keep reading

Building software for recruiting and staffing agencies

Software for construction and field-service teams still stuck on paper

Want this built right?