How to Add Idempotency Keys to Your API Without Breaking Existing Clients

A field-tested guide to adding idempotency keys to a live API — Postgres-backed, opt-in, and safe to roll out without breaking the clients you already have in production.

Abhishek Sharma· Head of Engg @ Fordel Studios

April 28, 202612 min read

How to Add Idempotency Keys to Your API Without Breaking Existing Clients

Here's how we approach this at Fordel. The first time a client double-charges a customer because their phone hiccuped on a flaky train, you stop treating idempotency as a Stripe-only luxury. The second time, you build it into every write endpoint that touches money, inventory, or external state.

This guide walks through the exact pattern we use — Postgres-backed, opt-in, and rolled out without forcing a single existing client to change a line of code.

What is an idempotency key actually for?

An idempotency key is a client-supplied string that lets the client say, "if you see this same key again, treat it as the same request — don't do the work twice." It is the contract that lets a client safely retry a POST without worrying that they just charged the customer twice or shipped the same parcel twice.

The server's job is to remember the key, the request, and the response — long enough that retries during the realistic failure window land on the same answer.

Prerequisites

What you need before starting

A backend with a real database — Postgres in our examples, but Redis or DynamoDB work with adjustments
Control over the API layer (middleware, framework hooks, or a gateway)
Clients you can talk to — even if you do not require the header, you want at least one client opting in for testing
An honest read on which endpoints are non-idempotent today (POST /charges, POST /orders, POST /webhooks/inbound, anything that calls a third-party API)
The ability to ship a database migration without a maintenance window

How does the storage layer work?

Start with the table. We use Postgres because it gives us transactions, advisory locks, and TTL via a simple cron — without pulling in another piece of infrastructure.

sql

CREATE TABLE idempotency_keys (
  tenant_id     uuid        NOT NULL,
  key           text        NOT NULL,
  request_hash  text        NOT NULL,
  request_path  text        NOT NULL,
  response_code int         NOT NULL,
  response_body jsonb       NOT NULL,
  response_headers jsonb    NOT NULL,
  status        text        NOT NULL CHECK (status IN ('in_flight', 'completed')),
  locked_at     timestamptz,
  created_at    timestamptz NOT NULL DEFAULT now(),
  completed_at  timestamptz,
  PRIMARY KEY (tenant_id, key)
);

CREATE INDEX idempotency_keys_created_at_idx
  ON idempotency_keys (created_at);

A few decisions baked into that schema:

Tenant scoping. The key is unique per tenant, not globally. Two different customers using the literal string "order-1" should never collide. If you skip this, you have a multi-tenant data leak waiting to happen.

Request hash. We store a SHA-256 of (method + path + body). On a replay, we recompute and compare. If the client sent the same key with a different body, that is a client bug — we return 409 instead of replaying the wrong response.

Status field. "in_flight" means we accepted the key and started work. "completed" means we have a response stored. Without this, two simultaneous retries race and both run the work.

How do you handle the request lifecycle?

The middleware does five things, in order:

1. If no Idempotency-Key header, pass through. This is what makes the rollout safe — clients who do not opt in see no change in behavior.

2. Try to insert a row with status='in_flight'. The primary key constraint does the heavy lifting. If the insert succeeds, this is a fresh request — proceed.

3. If the insert fails (key already exists), read the row. If status is 'completed', verify the request hash matches and replay the response. If the hashes diverge, return 409 Conflict with a body that explains the mismatch.

4. If status is 'in_flight', return 409 with a Retry-After header. Do not block waiting for the original to finish — that creates connection-pool pressure under load. Tell the client to retry in a second.

5. After the handler runs, update the row to 'completed' with the actual response, and let the client see it.

Here is the middleware in TypeScript, using Express and pg:

typescript

import type { Request, Response, NextFunction } from "express"
import crypto from "crypto"
import { pool } from "./db"

const IDEMPOTENT_METHODS = new Set(["POST", "PATCH", "DELETE"])

export async function idempotency(req: Request, res: Response, next: NextFunction) {
  const key = req.header("Idempotency-Key")
  if (!key || !IDEMPOTENT_METHODS.has(req.method)) return next()

  const tenantId = req.tenantId // set by your auth middleware
  const hash = crypto
    .createHash("sha256")
    .update(req.method + req.path + JSON.stringify(req.body ?? {}))
    .digest("hex")

  const claim = await pool.query(
    `INSERT INTO idempotency_keys
       (tenant_id, key, request_hash, request_path, response_code, response_body, response_headers, status)
     VALUES ($1, $2, $3, $4, 0, '{}'::jsonb, '{}'::jsonb, 'in_flight')
     ON CONFLICT (tenant_id, key) DO NOTHING
     RETURNING tenant_id`,
    [tenantId, key, hash, req.path]
  )

  if (claim.rowCount === 0) {
    const existing = await pool.query(
      `SELECT request_hash, request_path, response_code, response_body, response_headers, status
         FROM idempotency_keys WHERE tenant_id = $1 AND key = $2`,
      [tenantId, key]
    )
    const row = existing.rows[0]

    if (row.status === "in_flight") {
      res.setHeader("Retry-After", "1")
      return res.status(409).json({ error: "request_in_progress" })
    }
    if (row.request_hash !== hash || row.request_path !== req.path) {
      return res.status(409).json({ error: "idempotency_key_reused_with_different_body" })
    }
    res.set(row.response_headers)
    return res.status(row.response_code).json(row.response_body)
  }

  // Capture the outgoing response so we can persist it.
  const originalJson = res.json.bind(res)
  res.json = (body: unknown) => {
    pool.query(
      `UPDATE idempotency_keys
         SET response_code = $3, response_body = $4, response_headers = $5,
             status = 'completed', completed_at = now()
       WHERE tenant_id = $1 AND key = $2`,
      [tenantId, key, res.statusCode, body, { "content-type": "application/json" }]
    ).catch((err) => {
      // Persistence failure is logged but does not break the response.
      req.log?.error({ err, key }, "idempotency_persist_failed")
    })
    return originalJson(body)
  }

  next()
}

How do you roll this out without breaking existing clients?

This is the part most guides skip. You have clients in production that do not send the header. You cannot make it required overnight. Here is the sequence we use:

Phase 1 — ship middleware, log only.

Deploy the middleware in observation mode. If a key is present, log the request hash and path but do not store anything yet. Watch your traffic for a week. You are looking for: clients already sending the header (some SDKs do by default), endpoints where keys would be most useful, and the volume of POST traffic that arrives with no key.

Phase 2 — turn on storage for one endpoint.

Pick the highest-stakes endpoint — usually payments or order creation. Turn the middleware on for that route only. Keep it opt-in: clients without the header keep working. Clients with the header get the new guarantees. Run for two weeks. Watch the 409 rate.

Phase 3 — expand by category, not by endpoint.

Group endpoints by blast radius: money, inventory, external API calls, internal-only writes. Turn on idempotency for the whole money group, then the whole inventory group. Endpoint-by-endpoint rollouts drag for months and you forget which ones are protected.

Phase 4 — make the header required for new clients only.

Old clients keep the optional behavior. New clients (new API version, new SDK release) get the header generated automatically and rejected if missing. This is the only way to migrate without a hard cutover. We tag this in the SDK as a breaking change at version bump time.

“Endpoint-by-endpoint rollouts drag for months and you forget which ones are protected.”

Field lesson, Fordel rollouts

How long should you keep the keys around?

Stripe keeps idempotency keys for 24 hours. We default to 48 hours — long enough to cover a client retrying after a weekend on-call escalation, short enough that the table does not grow unbounded.

Run a daily cleanup job:

sql

DELETE FROM idempotency_keys
WHERE created_at < now() - interval '48 hours';

If you see clients legitimately retrying after 48 hours, that is a separate problem — they are queuing requests for too long and you have a different reliability conversation to have with them, not a TTL knob to tune.

What about webhooks coming in?

Inbound webhooks are the other half of this story. The provider is now your client, and they will retry aggressively. The good news: most providers send a unique event ID in the payload (Stripe sends id, GitHub sends X-GitHub-Delivery, Slack sends X-Slack-Request-Timestamp + signature).

Treat the provider's event ID as the idempotency key. Same table, same middleware, slightly different extraction — pull from the header or body instead of Idempotency-Key. The replay protection you get against duplicates is the same.

Common mistakes we have made (so you do not have to)

Hashing the wrong thing.

If you hash the body without the path, two different endpoints with the same payload will collide. If you hash with timestamps the client did not send, your hash is non-deterministic and replays always 409. Hash exactly what the client controls: method, path, body.

Storing the response in a separate transaction from the work.

If the work commits but the idempotency row update fails, the next retry will run the work again. Either persist the response in the same transaction as the work, or accept that the response storage is best-effort and make the work itself idempotent at the database layer with unique constraints.

Forgetting tenant scoping.

Already covered above, but worth repeating. A globally unique keyspace across tenants is a security bug pretending to be a schema decision.

Treating in_flight as a lock.

The status field is informational, not a mutex. If you block waiting for in_flight to flip to completed, you have built a distributed lock with no timeout, no fairness guarantees, and a connection pool that will exhaust under retry storms. Tell the client 409 + Retry-After and let them back off.

Allowing the body to drift.

Some clients add a client-side timestamp to every request. If that ends up in the body, every retry has a new hash and every retry creates a new key. Either strip volatile fields before hashing, or document the wire format clearly so SDKs know what to send.

Skipping it for "internal" services.

Internal services retry too. Service meshes retry. Sidecar proxies retry. The boundary you think is safe is not. If the endpoint mutates state, it gets idempotency.

What you now have

A POST endpoint that survives client retries cleanly. A webhook receiver that survives provider retries cleanly. A schema that's queryable when an incident manager asks "did this run twice?" — and a clear answer in the form of a row with two timestamps.

More importantly, you have a pattern your team can apply to every new endpoint without thinking about it. The middleware is one decorator, one header check, one database call. Once it is in place, the cognitive overhead of adding it to a new endpoint is roughly zero.

That is the actual win. Not the bug you fixed today — the bug you will not write tomorrow.

Build with us

Need this kind of thinking applied to your product?

We build AI agents, full-stack platforms, and engineering systems. Same depth, applied to your problem.

Start a conversation View services

Newsletter

Enjoyed this? Get the weekly digest.

Research highlights and AI news, delivered every Thursday. No spam.

Loading comments...

Keep Reading

All articles