Apache Kafka

Distributed, partitioned, replicated commit log. The backbone of modern event-driven architectures.

Category
Streaming
Difficulty
Advanced
When to use
You need durable, high-throughput event streaming with multiple independent consumers replaying history.
When not to use
You have a single producer and consumer, small volume, or no ops budget for Zookeeper/KRaft and schema registry.
Alternatives
Redis Streams AWS SQS / Kinesis Google Pub/Sub RabbitMQ

At a glance

FieldValue
CategoryStreaming / Event Broker
DifficultyAdvanced (operationally)
When to useHigh-throughput events, durable log, multiple consumers, replay required
When not to useSmall volume, single consumer, no operational budget
AlternativesRedis Streams, AWS SQS/Kinesis, Google Pub/Sub, RabbitMQ

The mental model: a distributed commit log

Kafka is not a message queue. It’s a log. Producers append records; consumers read them at their own pace and track their own position. Messages are not deleted when read — they stick around for the retention window.

This single choice drives everything else Kafka does well: multiple independent consumer groups, replay of historical data, durable storage, horizontal scalability. It’s also why Kafka is heavier than a queue: you’re operating a distributed log with replication and coordination.

Core concepts

Topics — named streams. You publish to a topic, you subscribe to a topic. Partitions — each topic is split into N partitions. A partition is the unit of parallelism and ordering. Records within a partition are strictly ordered; across partitions, there is no global order.

Producers send records. If the record has a key, Kafka hashes the key and picks a partition — same key, same partition, preserved order. No key, and records round-robin across partitions.

Consumers read records. They track their position via an offset per partition. Offsets are committed back to Kafka.

Consumer groups are the trick that turns topics into load-balanced streams. A group is a logical consumer; its members share the partitions of a topic, so each partition is processed by exactly one member at a time. Scale out by adding members — up to the number of partitions. Beyond that, extra consumers sit idle. Partitions are the cap on parallelism — pick the count carefully, because increasing it later reshuffles keys.

Brokers are the servers that hold the partitions. Each partition has a leader and N-1 followers for replication. If the leader dies, a follower in the in-sync replica (ISR) set takes over.

Delivery semantics

  • At-most-once. Commit offset before processing. Crash loses the record. Rarely what you want.
  • At-least-once. Commit after processing. Crash causes reprocessing. This is the default most systems target — pair it with idempotent consumers.
  • Exactly-once. Kafka supports this via idempotent producers + transactions, end-to-end with Kafka Streams or carefully written consumer code. It’s real but costly; most teams should achieve effective exactly-once with at-least-once + idempotency keys in the downstream system.

Kafka vs Redis Streams vs SQS

DimensionKafkaRedis StreamsSQS
ThroughputHundreds of MB/s per brokerTens of MB/s, limited by RAMHigh, but per-message cost
RetentionDays to foreverUntil memory pressureUp to 14 days
ReplayFirst-class (reset offsets)By ID rangeNot supported
Consumer fan-outMany independent groupsConsumer groups, similar modelOne queue = one consumer set (SNS fans)
Ops burdenHigh (brokers, KRaft, schemas)Low (already running Redis)Zero (AWS-managed)
OrderingPer-partitionPer-streamFIFO queues only
Best fitData platforms, cross-team backboneIn-service fan-out, moderate volumeApp-level work queues

The short heuristic: SQS for simple work queues. Redis Streams if Redis is already there and volume is modest. Kafka when you’re building a data platform that outlives any one service.

When Ephizen reaches for Kafka

  • Cross-service events that multiple teams consume — the product service emits order.placed, and the ML, analytics, and billing teams all read independently without coordinating with us.
  • Event sourcing and CDC — Debezium → Kafka → downstream stores (Snowflake, OpenSearch, feature store).
  • Feature pipelines that need sub-minute freshness and eventual delivery guarantees.

We do not use Kafka for:

  • Ephemeral pub/sub inside a single service (Redis is enough).
  • Request/response or RPC (that’s what HTTP and gRPC are for).
  • Background job queues where we just need retries and a DLQ (SQS wins).

Operational gotchas

  • Under-partitioning. If the topic only has 4 partitions, you can’t scale past 4 consumers in a group. Increasing partitions later rehashes keys and breaks per-key ordering. Overestimate on day one.
  • Hot partitions. Pick keys with high cardinality. country_code will concentrate 30% of your traffic on US.
  • Consumer lag that no one watches. Lag is the single most important health metric. Alert on it. Untreated lag eventually becomes unrecoverable.
  • Rebalances from long poll loops. If your consumer takes longer than max.poll.interval.ms to process a batch, Kafka thinks it’s dead and triggers a rebalance. Shrink the batch or extend the timeout.
  • Assuming global ordering. Ordering is per-partition only. If your system truly needs global order, you’re limited to one partition — which means one consumer, which means no horizontal scale.
  • Under-replicated partitions in prod. Always run replication.factor=3 and min.insync.replicas=2. Anything less is a data-loss incident waiting to happen.