Apache Kafka
Distributed, partitioned, replicated commit log. The backbone of modern event-driven architectures.
At a glance
| Field | Value |
|---|---|
| Category | Streaming / Event Broker |
| Difficulty | Advanced (operationally) |
| When to use | High-throughput events, durable log, multiple consumers, replay required |
| When not to use | Small volume, single consumer, no operational budget |
| Alternatives | Redis Streams, AWS SQS/Kinesis, Google Pub/Sub, RabbitMQ |
The mental model: a distributed commit log
Kafka is not a message queue. It’s a log. Producers append records; consumers read them at their own pace and track their own position. Messages are not deleted when read — they stick around for the retention window.
This single choice drives everything else Kafka does well: multiple independent consumer groups, replay of historical data, durable storage, horizontal scalability. It’s also why Kafka is heavier than a queue: you’re operating a distributed log with replication and coordination.
Core concepts
Topics — named streams. You publish to a topic, you subscribe to a topic. Partitions — each topic is split into N partitions. A partition is the unit of parallelism and ordering. Records within a partition are strictly ordered; across partitions, there is no global order.
Producers send records. If the record has a key, Kafka hashes the key and picks a partition — same key, same partition, preserved order. No key, and records round-robin across partitions.
Consumers read records. They track their position via an offset per partition. Offsets are committed back to Kafka.
Consumer groups are the trick that turns topics into load-balanced streams. A group is a logical consumer; its members share the partitions of a topic, so each partition is processed by exactly one member at a time. Scale out by adding members — up to the number of partitions. Beyond that, extra consumers sit idle. Partitions are the cap on parallelism — pick the count carefully, because increasing it later reshuffles keys.
Brokers are the servers that hold the partitions. Each partition has a leader and N-1 followers for replication. If the leader dies, a follower in the in-sync replica (ISR) set takes over.
Delivery semantics
- At-most-once. Commit offset before processing. Crash loses the record. Rarely what you want.
- At-least-once. Commit after processing. Crash causes reprocessing. This is the default most systems target — pair it with idempotent consumers.
- Exactly-once. Kafka supports this via idempotent producers + transactions, end-to-end with Kafka Streams or carefully written consumer code. It’s real but costly; most teams should achieve effective exactly-once with at-least-once + idempotency keys in the downstream system.
Kafka vs Redis Streams vs SQS
| Dimension | Kafka | Redis Streams | SQS |
|---|---|---|---|
| Throughput | Hundreds of MB/s per broker | Tens of MB/s, limited by RAM | High, but per-message cost |
| Retention | Days to forever | Until memory pressure | Up to 14 days |
| Replay | First-class (reset offsets) | By ID range | Not supported |
| Consumer fan-out | Many independent groups | Consumer groups, similar model | One queue = one consumer set (SNS fans) |
| Ops burden | High (brokers, KRaft, schemas) | Low (already running Redis) | Zero (AWS-managed) |
| Ordering | Per-partition | Per-stream | FIFO queues only |
| Best fit | Data platforms, cross-team backbone | In-service fan-out, moderate volume | App-level work queues |
The short heuristic: SQS for simple work queues. Redis Streams if Redis is already there and volume is modest. Kafka when you’re building a data platform that outlives any one service.
When Ephizen reaches for Kafka
- Cross-service events that multiple teams consume — the product
service emits
order.placed, and the ML, analytics, and billing teams all read independently without coordinating with us. - Event sourcing and CDC — Debezium → Kafka → downstream stores (Snowflake, OpenSearch, feature store).
- Feature pipelines that need sub-minute freshness and eventual delivery guarantees.
We do not use Kafka for:
- Ephemeral pub/sub inside a single service (Redis is enough).
- Request/response or RPC (that’s what HTTP and gRPC are for).
- Background job queues where we just need retries and a DLQ (SQS wins).
Operational gotchas
- Under-partitioning. If the topic only has 4 partitions, you can’t scale past 4 consumers in a group. Increasing partitions later rehashes keys and breaks per-key ordering. Overestimate on day one.
- Hot partitions. Pick keys with high cardinality.
country_codewill concentrate 30% of your traffic onUS. - Consumer lag that no one watches. Lag is the single most important health metric. Alert on it. Untreated lag eventually becomes unrecoverable.
- Rebalances from long poll loops. If your consumer takes longer than
max.poll.interval.msto process a batch, Kafka thinks it’s dead and triggers a rebalance. Shrink the batch or extend the timeout. - Assuming global ordering. Ordering is per-partition only. If your system truly needs global order, you’re limited to one partition — which means one consumer, which means no horizontal scale.
- Under-replicated partitions in prod. Always run
replication.factor=3andmin.insync.replicas=2. Anything less is a data-loss incident waiting to happen.