← All writing
Paper Breakdown

Kafka: what the original paper actually says

Reading Kreps, Narkhede, and Rao (NetDB 2011) after a rolling deploy caused a 90-second consumer blackout that shouldn't have happened.

The deploy was routine — a new consumer binary, rolled out one instance at a time across 12 pods. Total restart time per pod: 8 seconds. Expected consumer downtime: zero, since at any given moment only one pod was restarting.

What actually happened: each restart triggered a consumer group rebalance. Each rebalance stopped all consumption for the entire group for 15–30 seconds while ZooKeeper coordinated partition reassignment. Twelve restarts × 30 seconds each = 6 minutes of cumulative downtime across a 90-second deploy window. No consumer processed messages for the full 90 seconds. Lag accumulated. Downstream alerts fired.

The root cause wasn't a bug. It was a fundamental property of how the original Kafka consumer group protocol worked, explicitly described in the 2011 paper: when any consumer in a group changes, every consumer in the group stops, releases its partition ownerships, and reruns the assignment algorithm from scratch. The design assumed restarts were rare events, not routine deploy operations.

"Kafka: a Distributed Messaging System for Log Processing", Jay Kreps, Neha Narkhede, Jun Rao, LinkedIn, NetDB 2011, is the paper behind every Kafka tutorial, architecture diagram, and streaming pipeline you've used. It is a remarkably short paper — six pages — and it describes a system with design decisions that are deliberately weaker than what most engineers assume when they reach for Kafka. The paper's contribution is demonstrating that a log-structured, pull-based, stateless-broker messaging system can deliver throughput orders of magnitude higher than contemporary alternatives, at the cost of specific guarantees that the paper is explicit about not providing.

The problem the paper is actually solving

LinkedIn in 2011 had a specific problem: collecting and processing large volumes of activity data — user clicks, page views, ad impressions, search queries — for both real-time dashboards and offline batch processing. The data volume was high. The consumers were heterogeneous: Hadoop jobs ran on hourly batches, real-time monitoring systems consumed continuously, and search indexers needed to process each event within seconds.

Traditional message queues (ActiveMQ, RabbitMQ) weren't designed for this workload. They tracked per-message delivery state at the broker, acknowledged messages individually, and deleted messages once a consumer acknowledged them. This made them reliable for point-to-point work queues but expensive at high volume — the broker's memory pressure grew with message counts, not just consumer throughput.

The paper's observation: LinkedIn's use case didn't need the guarantees traditional queues provided. Individual event loss was acceptable. Reprocessing was acceptable (the source data was always GFS). What wasn't acceptable was throughput constraints that required sharding across dozens of broker instances for a single log stream.

The design decision follows from this directly: build a log that is fast to write and read, that multiple independent consumers can traverse at different positions simultaneously, and that doesn't track per-message delivery state at the broker. Accept that this means at-most-once producer semantics and at-least-once consumer semantics. Build the rest around those constraints.

The log as a data structure

The paper's central abstraction is the append-only log. A topic is divided into partitions. Each partition is a log: messages are appended to the end, and consumers read forward from a position in the log.

Messages are identified not by a sequence number or UUID but by their byte offset in the partition log. If a message starts at byte 10240, its offset is 10240. The offset of the next message is 10240 + length-of-current-message. This is not a sequential integer counter. It's a cursor into a file.

The broker stores a partition as a set of segment files, each approximately 1 GB. Only the last segment receives new writes. Previous segments are immutable. The broker maintains an in-memory sorted list of the starting offset of each segment file. Finding a message by offset is a binary search over this list to locate the segment, then a sequential scan from the segment's start offset to the target.

This design choice — offset as byte position, not integer ID — means the broker needs no auxiliary random-access index structure. Seeking is O(log N) where N is the number of segments (usually small), followed by a sequential file read. The paper's benchmarks show this is sufficient to saturate a 1 Gbps network link.

Messages are not deleted when consumed. They're deleted on a time-based retention schedule — the paper describes 7 days as a typical configuration. This is what enables multiple consumer groups: each group maintains its own offset, and Kafka doesn't care how far behind any given group is as long as its offsets are within the retention window.

Why the throughput numbers are what they are

The paper benchmarks Kafka against ActiveMQ and RabbitMQ on 10 million messages of 200 bytes each, on hardware with a 1 Gbps network link.

Producer throughput:

  • Kafka at batch size 1: 50,000 messages/second
  • Kafka at batch size 50: 400,000 messages/second
  • "Orders of magnitude higher than ActiveMQ" and "at least 2x higher than RabbitMQ"

Consumer throughput:

  • Kafka: 22,000 messages/second
  • ActiveMQ and RabbitMQ: approximately 5,500 messages/second each

The batch size 1 → batch size 50 difference is nearly 10×. This isn't about parallelism — it's about amortizing the cost of network round-trips and disk flush triggers across more messages per call. A batch of 50 messages is sent in a single MessageSet with one network round-trip. Without batching, 50 messages require 50 round-trips.

The performance gap relative to ActiveMQ has a less obvious cause: storage overhead per message. The paper reports Kafka using 9 bytes of overhead per message versus 144 bytes for ActiveMQ. ActiveMQ tracks per-message delivery status, acknowledgment state, and routing metadata. Kafka stores none of this. At 200-byte messages, ActiveMQ's overhead is 72% of total message size. Kafka's overhead is 4%.

The throughput performance also comes from what the paper calls the filesystem page cache strategy: Kafka stores nothing in the JVM heap. All message data sits in the OS page cache. Reads go directly from page cache to the network socket via the Linux sendfile system call, bypassing the application process entirely. The standard path — read from kernel buffer into application buffer, then copy to socket buffer — requires 4 data copies and 4 context switches. sendfile reduces this to 2 copies and 2 context switches. The paper explicitly cites this as a primary source of consumer throughput efficiency.

The catch: sendfile optimization disappears when you enable TLS. TLS requires the broker to decrypt and re-encrypt each byte, breaking the zero-copy path. Production Kafka deployments where TLS is enabled (all of them, in most organizations) pay the full copy cost.

The producer design: fire and forget

The 2011 producer has no acknowledgment mechanism. The paper states this directly: messages are sent to the broker as fast as the broker accepts them, with no confirmation that they were written to disk.

The rationale is explicit: LinkedIn's activity data use case tolerates small amounts of message loss. "As long as the number of dropped messages is relatively small, the unacknowledged sends achieve much better performance." The producers who needed durability guarantees — events that must not be lost — were expected to write to the source database first, not publish directly to Kafka.

The producer sends a batch of messages as a MessageSet to the partition's current leader. Partition selection is either random (default, for load balancing) or via a user-provided partition key and hash function. If you always want all events for a given user on the same partition — to maintain per-user ordering — you hash the user ID to a partition. If you don't care about ordering across users, you let the producer round-robin.

The producer-side acknowledgment model came later: acks=0 (no acknowledgment, original 2011 behavior), acks=1 (leader acknowledges), and acks=all (full ISR acknowledgment) were introduced as the replication model was added in 2012-2013.

The consumer design: pull, not push

Consumers pull from the broker. The paper gives two reasons for this choice over push.

First, rate independence: a pull-based consumer retrieves messages "at the maximum rate it can sustain." A push-based broker might send faster than a slow consumer can process, requiring the broker to buffer per-consumer state. With pull, the slow consumer simply doesn't request new messages until it's ready. The broker holds no per-consumer buffer.

Second, rewind: a pull-based consumer that wants to re-read old messages issues a pull request with an earlier offset. In a push-based system, the broker has already sent and potentially deleted those messages. In Kafka, as long as the offset falls within the retention window, the consumer can request any offset it chooses. This is what makes Kafka useful for error recovery — fix a processing bug, reset the consumer offset to before the bad messages, replay. This would require special broker support in a push model. In Kafka it's just a different offset in the pull request.

The downside is CPU waste when nothing is available: the consumer's pull loop blocks, waiting for new messages. The paper addresses this with a configurable blocking wait — the consumer issues a pull request with a minimum byte count, and the broker holds the connection until at least that much data is available. This avoids busy-polling at the cost of one broker thread per blocked consumer.

Consumer groups and the partition ownership model

This is the section the paper describes with deceptive simplicity and where most production operational problems originate.

A consumer group is a named set of consumer processes that collectively consume a topic. Each partition in the topic is owned by exactly one consumer in the group at a time. If the group has 12 consumers and the topic has 24 partitions, each consumer owns 2 partitions. Messages are delivered to exactly one consumer per group (point-to-point), but to all groups (publish-subscribe fan-out across groups).

The partition is the unit of parallelism because it eliminates per-message locking. If two consumers in the same group could read the same partition, they'd need coordination to avoid processing the same message twice. By enforcing single-consumer ownership of each partition, coordination only happens at partition assignment time, not per-message.

The original ownership protocol runs through ZooKeeper:

  1. Consumers register themselves in the ZooKeeper Consumer Registry (ephemeral node — auto-deletes on disconnect).
  2. Consumers watch the Broker and Consumer registries for changes.
  3. When any broker or consumer joins or leaves, all consumers receive a watch notification.
  4. All consumers simultaneously release their current partition ownerships and rerun the assignment algorithm.
  5. The assignment is deterministic: sort the partition list, sort the consumer list, range-partition and round-robin assign.
  6. If two consumers race to claim the same partition (both computed the same assignment), one writes first and the other detects a conflict, releases all partitions, and retries.

The paper notes that this "often stabilizes after only a few retries" and that the ZooKeeper ownership mechanism "relies on timing assumptions and is not entirely foolproof." The practical implication: any event that changes the consumer group membership — a consumer crash, a scale-up, a rolling deploy — triggers a full-group rebalance. During a rebalance, no partition is being consumed by anyone.

This is the source of the rolling deploy outage I described at the start. Each pod restart was a consumer leave + rejoin event, triggering a rebalance for the entire group. The 2011 paper's rebalance model wasn't designed for deployments where consumer membership changes continuously.

The modern fix, introduced years later, is the incremental cooperative rebalance (KIP-429): instead of revoking all partition assignments globally, consumers only revoke partitions that need to move, and keep processing the rest during the transition. The 2011 protocol stopped everything. The cooperative protocol stops only what must stop.

Consumer offsets in ZooKeeper

In the original design, consumers write their committed offset to ZooKeeper after processing each batch. The ZooKeeper Offset Registry stores one persistent node per (consumer group, topic, partition) triple. Persistent nodes survive consumer crashes — when a consumer restarts, it reads its last committed offset from ZooKeeper and resumes from there.

The at-least-once delivery property follows directly: the consumer processes a batch, then commits the offset. If it crashes between processing and committing, it re-processes those messages on restart. If it commits before processing completes, it loses those messages on restart. The safe default is commit after processing, which gives at-least-once.

This offset-in-ZooKeeper design created a second operational problem at scale: ZooKeeper was not designed for high-frequency writes. A consumer processing 10,000 messages/second and committing every 100ms sends 10 offset writes/second per partition to ZooKeeper. With 100 partitions and 100 consumer groups, that's 100,000 ZooKeeper writes/second — well beyond ZooKeeper's designed capacity, which is optimized for infrequent configuration changes, not throughput workloads.

The __consumer_offsets internal Kafka topic, introduced in Kafka 0.9, moved offset storage from ZooKeeper to a compacted Kafka topic. Offset commits became Kafka writes, not ZooKeeper writes. This removed the ZooKeeper bottleneck and enabled fine-grained offset commit frequency. The 2011 design couldn't support this because offsets needed to outlive both the consumer and the broker — ZooKeeper's persistence model was the only available option at the time.

The absence of replication

The most operationally significant fact about the original Kafka paper is stated plainly in the limitations section: "If a broker goes down, any message not yet consumed becomes unavailable."

The 2011 design had no replication. Each partition had a single copy on a single broker. If that broker failed, the partition was unavailable until the broker came back. Messages that had been produced but not yet consumed were stuck on the failed broker's disk. If the broker's disk was corrupted, those messages were lost.

The paper lists replication as planned future work: "We plan to add built-in replication in Kafka to redundantly store each message on multiple brokers."

The ISR (In-Sync Replicas) model that production Kafka uses today — where each partition has a leader and followers, and acknowledgment requires the message to be written to all in-sync replicas — came in 2012-2013 and was not part of the original design. The performance numbers in the paper are from a system without the overhead of replication. Modern Kafka with acks=all and RF=3 incurs significantly more I/O and network overhead than the 2011 benchmark numbers suggest.

Production tradeoffs the benchmark tables don't mention

Partition count is a permanent decision. Once you create a topic with N partitions, changing to N+M requires either creating a new topic and migrating consumers (handling the ordering discontinuity), or accepting that messages are reshuffled across partitions when you add partitions (breaking per-key ordering guarantees you may have relied on). The 2011 design never addresses partition count as a problem because LinkedIn's use cases tolerated reshuffling. Production workloads with strict per-key ordering requirements are stuck with the original partition count or need a migration plan.

The broker is stateless but segment deletion is not instant. Kafka retains messages for the configured retention period even if all consumer groups have consumed past them. Retention is per-topic, not per-consumer-group. If you have one consumer group that processes in real time and another that runs weekly batch jobs, your 1-week retention is set by the batch consumer, and all topics retain a full week of data regardless of whether the real-time consumer would benefit from a shorter retention.

Consumer lag is an aggregate metric, not a per-partition metric. Consumer monitoring tools typically report total lag across all partitions as a single number. A total lag of 10,000 messages could be 10 partitions each 1,000 behind (healthy, catchable) or 1 partition 10,000 behind (one dead consumer thread, not catching up at all). Alerting on aggregate lag misses the pathological case. Production monitoring needs per-partition lag with per-partition consumer throughput to detect stuck partitions.

The pull model doesn't prevent producer backpressure. Kafka's pull model means consumers control their own consumption rate. It does not mean producers get backpressure when consumers are slow. Producers write to the log at their rate. The log grows. If consumers are slow enough that the lag exceeds the retention window, messages expire before they're consumed. Kafka doesn't signal the producer to slow down. This is by design — the log buffers the rate mismatch — but in systems where message loss is unacceptable, unchecked producer rates against a fixed retention window need active monitoring.

Failure modes in practice

Unclean leader election. The ISR (added post-2011) defines in-sync replicas as replicas that are within a configurable lag of the leader. When the leader fails, the controller elects a new leader from the ISR. If the ISR is empty — all replicas were lagging — and unclean.leader.election.enable=true, Kafka elects the least-lagging out-of-sync replica. This guarantees availability at the cost of durability: the new leader may not have some messages the old leader acknowledged. With unclean.leader.election.enable=false, the partition is unavailable until a previously in-sync replica recovers. Most production deployments set this to false for financial or event data and accept the unavailability.

Rebalance storms during slow consumers. When a consumer is slow to process a batch, it takes longer than the max.poll.interval.ms timeout to call poll() again. Kafka interprets this as the consumer having left the group (regardless of whether it's actually alive), triggers a rebalance, and reassigns the partition. The slow consumer comes back, triggers another rebalance, and the cycle repeats. The symptom is a consumer group that rebalances continuously without making progress. The fix is either to increase max.poll.interval.ms or to reduce the per-batch processing time. The 2011 paper didn't have this timeout mechanism at all — it was added as a safeguard against silent consumer failures, but the threshold interacts badly with legitimately slow processing.

Cross-partition ordering is not guaranteed. The paper states this explicitly and it's routinely violated by producers who use multiple partitions with a key-based partitioner and then expect globally ordered consumption. If event A and event B both have user ID 12345 but go to different partitions (because the partition count changed and the key hash maps differently), a consumer that reads partitions in sequence will see them out of order. Global ordering in Kafka requires a single partition, which eliminates all parallelism.

When not to use Kafka

When you need per-message acknowledgment. Traditional work queues (RabbitMQ, SQS) delete a message only after the consumer explicitly acknowledges it. This makes them the right tool for job queues where each job must be processed exactly once: the message exists until a consumer claims it and acks it. Kafka's model is wrong for this — Kafka deletes messages by time, not by ack. If you use Kafka as a job queue and a consumer falls behind its retention window, jobs are lost silently.

When your topic has a handful of consumers and you need per-consumer delivery guarantees. Kafka's partition count limits consumer parallelism per group. A topic with 4 partitions can support at most 4 consumers in a group consuming in parallel. If you need 8 parallel consumers, you need at least 8 partitions — and adding partitions post-creation is disruptive. Traditional queues allow arbitrary consumer scaling without this constraint.

When you need sub-second end-to-end latency with small message volumes. Kafka's batching — the mechanism that delivers its throughput advantage — adds latency. The linger.ms producer config (default 0ms in modern Kafka, but often set to 5–20ms in production for throughput) means messages sit in the producer buffer waiting for a batch to fill. For a workload producing a few hundred messages per second, the latency added by batching may exceed your requirements. A direct RPC or a lower-latency queue is more appropriate.

When message ordering across partitions is required. Kafka guarantees ordering within a partition, not across partitions. Any workload requiring a strict global total order over all messages must use a single partition — at which point you've eliminated horizontal scalability entirely.

When your consumers need exactly-once semantics without application-level idempotency. Kafka's transactional producer and EOS (exactly-once semantics) were added in Kafka 0.11, years after the 2011 paper. They work, but they require end-to-end coordination between producer transactions, consumer offset commits, and sink writes. Getting EOS right in a non-trivial pipeline is significantly harder than the documentation suggests. If you can make your consumers idempotent (process the same message twice without side effects), at-least-once is simpler and just as correct.

What the paper actually gives you

The 2011 Kafka paper is a proof that a messaging system built on a simple, immutable, append-only log with stateless brokers and pull-based consumers can outperform traditional message queues by an order of magnitude at the cost of weaker per-message guarantees.

The tradeoff is explicit throughout: 9 bytes of per-message overhead instead of 144; no acknowledgment instead of per-message ack; time-based retention instead of consumer-driven deletion; consumer-managed offsets instead of broker-managed delivery state. Every performance advantage traces to something the broker doesn't do.

The limitations the paper acknowledges — no replication, at-least-once only, ZooKeeper coordination assumptions — are not oversights. They're deliberate choices for a use case (log processing, activity data aggregation) where some data loss was acceptable and throughput was the binding constraint.

Modern Kafka added replication, exactly-once semantics, cooperative rebalancing, consumer offset storage in Kafka itself, and Kafka Raft (KRaft) to replace ZooKeeper. These additions addressed the paper's stated limitations. They also added complexity, configuration surface area, and new failure modes that the original 6-page paper couldn't anticipate.

The deploy outage: we switched from the eager rebalance protocol to cooperative rebalancing (partition.assignment.strategy=CooperativeStickyAssignor), added a pre-stop lifecycle hook to the pod spec to delay pod shutdown until the consumer committed its offsets, and reduced max.poll.interval.ms to match our actual processing time rather than using the default. The next rolling deploy: zero consumer downtime, no lag accumulation, no alerts.


Kafka: a Distributed Messaging System for Log Processing — Jay Kreps, Neha Narkhede, Jun Rao. LinkedIn Corp. NetDB Workshop, VLDB 2011.

Related reading

  • MapReduce: What the Google Paper Actually Says

    The 2004 Google paper that gave us Hadoop — and everything that replaced it — is worth reading not for the map/reduce abstraction itself, but for the fault tolerance model and the straggler insight. The failure modes are still the failure modes.

  • Pregel: What the Large-Scale Graph Processing Paper Actually Says

    PageRank in MapReduce is O(iterations × full dataset reloads). Pregel fixes this by keeping the graph in memory across iterations and replacing disk I/O with message passing. The 'think like a vertex' model is the insight — BSP is the implementation.

  • Cassandra: What the Paper Actually Says

    We had a Cassandra cluster where DELETE operations made reads progressively slower until queries timed out. Adding more disk space made it worse. The root cause is described precisely in the 2009 paper — but only if you understand that Cassandra cannot actually delete data.

← All writing