Background Job Processing: The Architecture Nobody Talks About

If your web application does anything beyond serving HTML, you need background jobs. Sending emails, processing uploads, generating reports, syncing with third-party APIs -- all of these should happen outside the request-response cycle. Yet in half the codebases we inherit, background processing is either nonexistent or bolted on with setTimeout calls and cron jobs nobody monitors.

A client's application crashed during a sales event because their order confirmation endpoint sent emails synchronously. When email delivery slowed under load, request timeouts cascaded, the connection pool exhausted, and the entire application went down. Orders were lost. The fix was straightforward, but the damage was already done.

Our standard architecture has five components, set up on day one of every project.

The queue: BullMQ with Redis. Named queues, job priorities, delayed jobs, repeatable jobs replacing cron, rate limiting, and automatic retries with exponential backoff. If a job is in the queue, it will be processed even if the worker crashes and restarts.

Workers: separate processes from the web server with their own resource limits. A background job consuming too much CPU or memory should never affect your API response times. We run workers as separate containers or Railway services that can be scaled independently.

Job schemas: every job has a Zod-typed payload validated both when enqueuing and when processing. This catches schema drift between the API code that creates jobs and the worker code that processes them.

Dead letter queues: jobs that exhaust all retry attempts land here with full payload, error message, and stack trace preserved. We review the dead letter queue regularly and can replay failed jobs with a single command once the underlying issue is fixed.

Observability: structured logs at three points -- enqueue, processing start, and completion or failure. We track queue depth, processing latency, and failure rate as metrics. Alerts fire if queue depth grows for fifteen minutes or failure rate exceeds five percent.

For email: enqueue with five-second delay, retry three times with exponential backoff. For file processing: presigned URL in the job payload, generous timeout, result stored in object storage. For third-party syncs: rate-limited queues respecting API limits, idempotency keys preventing duplicates, and resume-from-last-success timestamps.

Total setup time: about four hours on a new project. Four hours on day one saves weeks of firefighting later.