How Kafka Actually Works Under the Hood

Kafka is a log, not a queue — and that distinction is exactly why AI agent runtimes use it for tool-call streams, audit trails, and replayable agent state.

Abhishek Sharma· Founder, Fordel Studios

March 29, 2026Updated May 8, 202615 min read

Most engineers who struggle with Kafka in production are using it like RabbitMQ. They are not. Kafka is fundamentally different. Once you internalize the log model, the rest — replication, consumer groups, exactly-once — becomes obvious rather than magical.

···

What Problem Does Kafka Actually Solve?

Traditional message queues — RabbitMQ, SQS, ActiveMQ — are built around a consumption model: a message exists until a consumer acknowledges it, then it is gone. This works for task queues. It breaks down the moment you need two different systems to independently process the same stream of events, or when you need to replay historical data after a consumer bug, or when you need to process 1.5 million events per second without the broker becoming the bottleneck.

Kafka was built at LinkedIn to solve the data pipeline problem at scale: multiple producer systems generating events, multiple consumer systems each needing their own view of that data, at write volumes that would melt a conventional broker. The solution was not a smarter queue. It was a different primitive entirely.

The core insight: if you store events in an immutable, append-only log and let consumers track their own read position, the broker no longer needs to manage per-consumer state. The broker just manages the log. Consumers manage themselves. This separation is why Kafka can scale to tens of thousands of consumers reading the same partition without the broker doing more work.

···

What Is the Mental Model for Kafka?

Think of Kafka as a distributed filesystem where the only operation is append. Each topic is a logical namespace. Each topic is split into partitions. Each partition is an ordered, immutable sequence of records — a segment of the log. Records within a partition are addressed by a monotonically increasing integer called an offset.

That is the entire data model. Everything else — producers, consumers, replication, consumer groups — is built on top of this.

The Four Core Primitives

Topic — logical stream name, contains 1 to N partitions
Partition — ordered, immutable log segment; unit of parallelism and replication
Offset — position of a record within a partition; assigned by the broker at write time
Consumer Group — set of consumers that jointly consume a topic; each partition is owned by exactly one group member at any time

On disk, each partition is a directory of segment files. A segment file is named by the offset of its first record — 00000000000000000000.log, 00000000000001048576.log, and so on. Kafka does not use a random-access database. It uses sequential disk writes, which on modern SSDs and spinning disks alike are dramatically faster than random I/O.

This is not a detail. It is the reason Kafka can sustain millions of writes per second on hardware that would buckle under equivalent load from a PostgreSQL-backed queue.

···

How Does Kafka Replication Actually Work?

Every partition has exactly one leader and zero or more followers. The leader handles all reads and writes for that partition. Followers replicate by fetching from the leader, exactly as a consumer would — they issue fetch requests and apply the received records to their own local log.

The leader tracks which followers are keeping up via the in-sync replica set, commonly abbreviated ISR. A replica is considered in-sync if it has fetched up to the high-watermark within the configured replica.lag.time.max.ms window (default: 30 seconds). If a follower falls behind — due to slow disk, GC pause, network partition — it is removed from the ISR.

This matters enormously for durability guarantees. The producer acks=all setting does not mean all replicas. It means all in-sync replicas. If two of your three replicas are lagging and fall out of the ISR, a single replica acknowledging the write is sufficient. You can restore the full replica count guarantee by setting min.insync.replicas=2 or more, which causes the broker to reject writes when the ISR shrinks below that threshold — trading availability for durability.

Setting	Who Acknowledges	Durability	Latency Impact	Failure Risk

The high-watermark is the offset of the last record that has been replicated to all in-sync replicas. Consumers can only read up to the high-watermark — not the leader end offset. This ensures a consumer never reads a record that could be rolled back due to a leader failure before replication completes. It is a critical correctness guarantee that confuses engineers who expect to immediately read what they just wrote.

···

How Do Consumer Groups Actually Work?

A consumer group is a collection of consumers that collectively consume a topic. Kafka guarantees that each partition is assigned to at most one consumer within a group at any given time. This is the unit of horizontal scaling: add consumers to a group to parallelize processing, up to the number of partitions.

The orchestration of this assignment happens through the group coordinator — a broker elected to manage a specific consumer group based on a hash of the group ID. When a consumer starts, it sends a FindCoordinator request to locate its group coordinator, then sends a JoinGroup request. The coordinator elects one consumer as the group leader — not to be confused with the partition leader — which runs the partition assignor algorithm and sends the resulting assignment back through the coordinator to all group members.

This is where the rebalance happens. During a rebalance, all consumers in the group stop processing — this is known as a stop-the-world rebalance. In practice, with large groups and slow consumers, rebalances can take 30 to 90 seconds. This is not a bug; it is the cost of the coordination protocol.

Three rebalance triggers engineers hit most often in production

Consumer joins or leaves the group — deploy, crash, or scale event
Consumer exceeds max.poll.interval.ms between poll() calls — slow message processing or JVM GC pause
Session timeout expires — consumer stops sending heartbeats due to network partition, OOM kill, or thread deadlock

Kafka 2.4 introduced incremental cooperative rebalancing via the CooperativeStickyAssignor. Instead of revoking all partition assignments at the start of a rebalance, only the partitions that need to move are revoked. This eliminates the stop-the-world characteristic for most rebalance scenarios. If you are still using the default RangeAssignor or RoundRobinAssignor, you are leaving significant availability on the table.

30-90sTypical stop-the-world rebalance window in large consumer groups with eager rebalancingIncremental cooperative rebalancing reduces this to near-zero for most common topology changes

···

How Does Exactly-Once Semantics Actually Work?

This is the part the Kafka documentation explains in the most confusing possible way. Here is the precise version.

Without any special configuration, Kafka gives you at-least-once delivery: if a producer retries after a timeout, the broker may write the same record twice. If a consumer crashes before committing its offset, it will re-process records it already processed. At-least-once is the default and the practical baseline for most workloads.

Idempotent producers (enable.idempotence=true) solve the producer-side duplication. Each producer is assigned a PID (producer ID) and attaches a monotonically increasing sequence number to each record. The broker deduplicates retries within the same session using this (PID, partition, sequence) tuple. This eliminates duplicate writes from producer retries — but only within a single producer session. If the producer restarts, it gets a new PID and deduplication is reset.

Transactional producers extend this with a stable transactional.id that persists across producer restarts. The broker uses the transactional ID to fence zombie producers and guarantees that either all records in a transaction are committed or none are — even across multiple partitions. Consumers configured with isolation.level=read_committed will only read records from committed transactions.

The combination of transactional producers and read-committed consumers gives you exactly-once semantics for a Kafka-to-Kafka pipeline: read-process-write. It does not give you exactly-once against external systems like databases — that requires your consumer to participate in a two-phase commit or implement idempotent writes on the sink side. This distinction trips up engineers constantly.

“Exactly-once in Kafka means exactly-once within Kafka. The moment you write to Postgres, you are back to at-least-once unless you design your sink for idempotency.”

Abhishek Sharma

···

Where Does the Performance Actually Come From?

Kafka achieves its throughput through four compounding design decisions that are invisible to most users but worth understanding deeply.

First: sequential I/O. Partitions are append-only. The OS write-ahead pattern and on-disk layout mean the storage subsystem is almost always doing sequential writes. On modern NVMe, sequential throughput is 5 to 10 times random throughput. Kafka exploits this fully.

Second: the OS page cache. Kafka explicitly avoids caching data in JVM heap. It writes to disk and relies on the OS page cache to serve reads. Since recent writes are almost always in cache, most consumer reads never actually hit disk. The JVM heap stays small, GC pauses are short, and the OS handles cache eviction automatically using LRU — without Kafka needing to implement any of it.

Third: zero-copy transfer. When a consumer fetches data, Kafka uses the Linux sendfile(2) syscall — exposed in Java via FileChannel.transferTo — to copy data from the page cache directly to the network socket buffer, bypassing the application heap entirely. No serialization, no intermediate buffers, no user-space copy. This is how a single broker can saturate a 10Gbps NIC.

Fourth: batching and compression. Producers batch records before sending. Brokers store batches as-is. Consumers decompress at read time. The batch is the unit of I/O throughout the entire pipeline. ZSTD compression at the batch level routinely achieves 4 to 10 times size reduction on structured event payloads.

2M+Write events per second achievable on a 3-broker cluster with commodity NVMe hardwareLinkedIn reported processing over 7 trillion messages per day across their Kafka infrastructure at peak load

~600nsTypical end-to-end produce latency within a single datacenter at acks=1acks=all adds replication round-trip time — typically 1 to 5ms on healthy clusters with co-located replicas

···

What Are the Failure Modes Nobody Talks About?

The happy path in Kafka is very happy. The failure modes require architectural preparation before the incident, not during it.

ISR shrink under load. Under sustained load or a network partition, followers fall behind and are removed from the ISR. If you have set min.insync.replicas=2 with a replication factor of 3 and two replicas go offline, your producers will receive NotEnoughReplicas exceptions. The cluster is healthy — it is doing exactly what you configured — but your producers are blocked. The fix is to understand your durability vs availability trade-off before the incident, not during it.

Unclean leader election. If all in-sync replicas are unavailable and unclean.leader.election.enable=true (the default in older Kafka versions, now false), Kafka will elect an out-of-sync replica as leader to restore availability. This means consumers can see records a previous leader had already committed, or miss records entirely. This is data loss by design. Set this to false on any durability-sensitive topic.

Consumer lag exceeding retention. Kafka retains data for a configured period (default 7 days) regardless of whether it has been consumed. If your consumer falls behind and your retention is configured purely by time (log.retention.hours), records are deleted even if unread. Engineers configure time-based retention without thinking about worst-case consumer downtime. Size-based retention (log.retention.bytes) is frequently the safer choice.

Log compaction edge cases. Compacted topics retain only the latest record per key — useful for changelog topics materializing state from events. Compaction runs asynchronously in the background. A consumer reading a compacted topic might still see old values for a key before compaction runs. More critically, tombstone records — null-value records signaling deletion — have their own configurable retention window via delete.retention.ms. Read a tombstone after it expires and you will never know a deletion happened. This is a correctness issue for consumers rebuilding state from scratch.

Rebalance storms. A consumer that consistently exceeds max.poll.interval.ms due to slow processing will be kicked from the group, triggering a rebalance. If the slow processing is then retried by the newly assigned consumer, the same timeout occurs again. The result is a rebalance loop. The fix is to reduce the batch size via max.poll.records, increase max.poll.interval.ms, or move slow processing off the consumer thread.

Production configuration checklist for new Kafka topics

Set replication.factor=3 and min.insync.replicas=2 for any durable topic
Set unclean.leader.election.enable=false explicitly
Configure retention by size (log.retention.bytes) if consumer downtime is a realistic failure scenario
Use enable.idempotence=true on all producers by default — it has no meaningful overhead
Use CooperativeStickyAssignor to eliminate stop-the-world rebalances
Monitor consumer group lag with Burrow or expose consumer_offsets metrics to your observability stack
Set delete.retention.ms explicitly on compacted topics — do not rely on the default

···

What Changed with KRaft? Is ZooKeeper Actually Gone?

ZooKeeper served as Kafka's external metadata store from the beginning: topic configurations, broker registrations, ISR tracking, controller election. As of Kafka 3.3, KRaft mode (Kafka Raft Metadata mode) is production-ready. As of Kafka 4.0, ZooKeeper is fully removed from the codebase.

In KRaft, metadata is stored in a dedicated internal Kafka topic called __cluster_metadata using Raft consensus. A subset of brokers are designated as controllers and participate in the Raft quorum. The active controller leader handles all metadata changes. This eliminates the ZooKeeper dependency and its associated operational complexity: separate deployment, separate JVMs, separate monitoring, separate upgrade path, separate security configuration.

More importantly, KRaft eliminates the controller failover bottleneck that plagued large clusters. In ZooKeeper mode, the Kafka controller cached a snapshot of ZooKeeper state in memory. A controller failover required reloading that state from ZooKeeper — which meant a proportionally long recovery time as partition count scaled. Engineers regularly reported 30-plus second blackouts on clusters with 100,000 or more partitions during controller failovers. KRaft snapshots eliminate this: controller recovery is bounded by the snapshot size, not partition count. This enables partition counts an order of magnitude larger than were previously practical.

200msTypical KRaft controller failover time at scale vs 30+ seconds in ZooKeeper mode for large partition countsEnables viable clusters with 1M+ partitions — previously impractical at that scale

···

Is Kafka the Right Tool? When Should You Avoid It?

Dimension	Kafka	RabbitMQ / SQS	Apache Pulsar	NATS JetStream

Use Kafka when: you need multiple independent consumers reading the same stream; you need replay capability for debugging, backfilling downstream systems, or audit requirements; you need to decouple producers from consumers at high write volume; you are building a change data capture pipeline from a database.

Avoid Kafka when: your team is fewer than five engineers and cannot own its operational surface; you need complex routing logic such as priority queues, dead-letter routing, or conditional routing — RabbitMQ's exchange model is genuinely better here; your workload is request-reply style; your event volume is under 10K per second and replay is not a requirement.

The most common Kafka mistake I see at Fordel is teams adopting it for task queue workloads — sending 50 job requests per minute through a topic with 12 partitions and a consumer group of 3 workers. They get all the operational complexity of Kafka with none of the benefits. SQS or a Postgres-backed queue would serve them better for a fraction of the cognitive overhead.

···

Is Kafka Still Worth It in 2026?

Kafka is a precision instrument. The log abstraction is genuinely elegant: immutable, ordered, replayable. The replication model is correct and auditable. The consumer group protocol, once understood, is a clean solution to distributed read coordination.

But it operationalizes everything onto the engineer — partition count decisions, replication factor, retention policy, ISR configuration, rebalance protocol choice. These decisions made wrong compound quietly until they fail loudly in production.

Read the Kafka documentation. Then read the Confluent internals blog series. Then read the KIP (Kafka Improvement Proposal) for any feature you depend on in production. The architecture is excellent. The defaults are not always right for your workload. That gap is where most Kafka war stories are born.

“Kafka does not hide complexity. It relocates it — from the broker into your configuration and your consumer code. That is a good trade if you are ready for it.”

Abhishek Sharma

Frequently Asked Questions

What is Apache Kafka and what problems does it solve?

Kafka is a distributed event streaming platform designed for high-throughput, fault-tolerant message passing between services. It solves the problem of coupling producers and consumers directly — instead, producers write to a durable log and consumers read at their own pace, enabling async processing, event replay, and fan-out to multiple consumers without producer changes.

When should I use Kafka vs a traditional message queue like RabbitMQ?

Use Kafka when you need: message replay (consumers can reprocess past events), high throughput (millions of messages/sec), durable storage (Kafka retains messages for days/weeks), or multiple independent consumer groups. Use RabbitMQ when you need: complex routing logic, per-message acknowledgment with complex retry semantics, or a simpler ops footprint for lower-volume workloads.

What are the most common Kafka production mistakes?

Top Kafka production mistakes: incorrect partition count (too few limits parallelism, too many adds coordination overhead), consumer group rebalancing storms from slow consumers, missing dead letter queues for poison messages, log retention set too short causing consumer lag issues, and missing monitoring for consumer lag as a primary health signal.

How does Kafka handle durability and message delivery guarantees?

Kafka durability is controlled by acks and replication factor. With acks=all and replication.factor>=3, a message is not acknowledged until all in-sync replicas have written it — this survives broker failures. At-least-once delivery is the default; exactly-once requires idempotent producers and transactional APIs, which add overhead.

Is Kafka worth the operational complexity for small teams in 2026?

For small teams, managed Kafka (Confluent Cloud, AWS MSK, Redpanda Cloud) eliminates most of the ops burden. Self-hosted Kafka is hard to operate correctly — partition rebalancing, broker sizing, and ZooKeeper/KRaft configuration require deep expertise. If your throughput justifies Kafka, use managed. If not, consider Redis Streams or a simple queue first.

Build with us

Need this kind of thinking applied to your product?

We build AI agents, full-stack platforms, and engineering systems. Same depth, applied to your problem.

Start a conversation View services

Newsletter

Enjoyed this? Get the weekly digest.

Research highlights and AI news, delivered every Thursday. No spam.

Loading comments...

Keep Reading

All articles

How Kafka Actually Works Under the Hood

What Problem Does Kafka Actually Solve?

What Is the Mental Model for Kafka?

How Does Kafka Replication Actually Work?

How Do Consumer Groups Actually Work?

How Does Exactly-Once Semantics Actually Work?

Where Does the Performance Actually Come From?

What Are the Failure Modes Nobody Talks About?

What Changed with KRaft? Is ZooKeeper Actually Gone?

Is Kafka the Right Tool? When Should You Avoid It?

Is Kafka Still Worth It in 2026?

Frequently Asked Questions

Related articles

How to Add Idempotency Keys to Your API Without Breaking Existing Clients

How Database Connection Pooling Actually Works Under the Hood

How PostgreSQL MVCC Actually Works Under the Hood

How Vector Similarity Search Actually Works Under the Hood

Designing Event-Driven Architectures for Scale