Docker in Production: 15 Lessons We Learned the Hard Way

We have been running Docker in production since 2018 and made every mistake possible. Fifteen lessons in order of how painfully we learned them.

One: your image is too big. We audited a client's 1.2GB Node.js image. Full Debian base, dev dependencies included, .git directory copied in. After optimization with multi-stage build and Alpine: 142MB. Build and deploy times dropped 60%.

Two: always use multi-stage builds. Build stage installs everything and compiles. Production stage copies only compiled output and production dependencies.

Three: pin your base image versions. node:18-alpine changes without warning. Use node:18.19.0-alpine3.19 and update deliberately.

Four: do not run as root. Add a USER directive to your Dockerfile. Three lines of configuration that eliminate an entire attack surface.

Five: health checks are not optional. Without them, Docker only knows the process is running, not if your application is working. We have seen containers where the Node process was alive but deadlocked, serving zero requests.

Six: handle SIGTERM. Without signal handling, Docker waits 10 seconds then SIGKILLs, dropping in-flight requests and leaking connections.

Seven: use tini as the init process for proper signal forwarding and zombie process reaping.

Eight: log to stdout, not files. Docker captures stdout natively and integrates with every logging platform.

Nine: secrets do not belong in images. Never COPY .env files into the image. Use runtime environment variables or a secrets manager.

Ten: local Docker lies. Docker on Mac runs a Linux VM. Networking, filesystem, and resource limits differ from production Linux.

Eleven: Docker Compose is not a production orchestrator. No rolling updates, no health-based restarts, no scaling. Use Swarm minimum or a managed platform.

Twelve: set memory and CPU limits. Without them, one container can starve everything else on the host.

Thirteen: optimize layer caching. Copy package.json and install dependencies before copying application code.

Fourteen: tag images with git commit SHA, not just "latest." Know exactly what code is deployed during incidents.

Fifteen: automate everything. SSH deploys are one typo from an outage. CI/CD with zero manual steps, every time. These lessons took four years and multiple incidents. Implement them from day one.

Related Articles

Infrastructure as Code: What They Do Not Tell You in the Tutorial

Zero-Downtime Deployments Without the Enterprise Tooling

A CI/CD Pipeline That Actually Works (Under 200 Lines of YAML)

Want to discuss this further?

Ready to build
something real?