Skip to main content

Apache Kafka Did Not Win as a Messaging System. It Won as an ETL Backbone.

Struggling with delivery, architecture alignment, or platform stability?

I help teams fix systemic engineering issues: processes, architecture, and clarity.
→ See how I work with teams.


Apache Kafka is marketed as a messaging system, but most teams run it as an ETL backbone and central data hub. That is why Kafka clusters store tens of terabytes, sit between every database and engine, and show up in platform cost and reliability discussions instead of messaging design reviews. This article explains how Kafka actually gets used in modern stacks, what problems that creates for integration, execution and governance, and concrete steps leaders can take to simplify architectures, control Kafka related costs and prepare for a federated execution layer on top.

Kafka solved integration, not messaging

On paper Kafka sits next to message brokers. In practice it earned adoption because it fixed an integration problem. As organisations added more systems that needed to exchange data, point to point ETL pipelines multiplied beyond control.

With ten source systems and ten destinations you quickly end up with something close to one hundred individual jobs. Each pipeline carries its own schema handling, scheduling and failure modes. As both N and M grow this model stops scaling.

Kafka changed the shape of that problem. Instead of connecting every system directly to every other system, teams connected sources and sinks to a single hub. Producers wrote once to Kafka topics. Consumers subscribed to those topics. Integrations dropped from N times M to N plus M, and teams standardised on a single API for data movement and schema evolution.

Stanislav Kozlovski explains this N to M reduction with a concrete ten by ten example. His public thread at https://x.com/BdKozlovski/status/1992962689052856514 and the follow up article at https://www.linkedin.com/posts/stanislavkozlovski... describe Kafka as the hub that eliminated bespoke pipelines. These posts do not present Kafka as a messaging system but as an integration backbone.

This is not a messaging use case. It is integration and ETL. That is the foundation for how Kafka is used today.

What leaders can do at this layer

  • Identify Kafka explicitly as integration and ETL backbone in architecture documents.
  • Retire remaining point to point pipelines and unify them behind topics.
  • Centralise schema policies so producers and consumers evolve safely.

Kafka became the ETL backbone of the stack

Once Kafka was in place as a hub, systems defaulted to connecting through it. Change data capture landed in Kafka. Stream processors read from Kafka and wrote enriched events back. Warehouses, search systems, data lakes, feature stores and monitoring pipelines subscribed to topics.

Practitioners describe this pattern as the central nervous system of their platforms.

Sina Riyahi describes Kafka as the central nervous system for modern enterprises in his technical walkthrough at https://www.linkedin.com/posts/sina-riyahi.... Kai Waehner uses the same framing for automotive and manufacturing data flows in his write up at https://kai-waehner.medium.com/apache-kafka....

Kafka did not replace RabbitMQ or JMS. It replaced ad hoc ETL buses and custom integration layers.

What leaders can do here

  • Map all systems reading from or writing to Kafka. This usually reveals the centrality of the log.
  • Distinguish between event driven flows and batch style movement implemented on top of Kafka.
  • Define boundaries for when Kafka is appropriate and when it is not.

Operational evidence that Kafka behaves like ETL infrastructure

Kafka clusters are sized and operated like data infrastructure, not like messaging servers. Terabytes of retained data and large replication factors are common.

One public cost analysis describes a six broker cluster ingesting thirty five megabytes per second with seven day retention and a replication factor of three. This results in more than sixty terabytes of stored data for a single week.

The details are laid out in the thread at https://www.reddit.com/r/apachekafka/comments/1fscdmm. A related discussion in r/dataengineering at https://www.reddit.com/r/dataengineering/comments/1ftw5is shows how storage and replication dominate cloud bills once Kafka is used as a durable event store rather than a transient messaging system.

Questions on Stack Overflow and Reddit focus on ingestion throughput, lag, retention, compaction, schema compatibility, data loss and recovery. These are data infrastructure concerns rather than messaging semantics.

Reducing Kafka related operational cost

  • Review retention per topic and align it with compliance needs.
  • Move historical data to object storage and keep Kafka focused on recent windows.
  • Partition topics based on access patterns to contain compaction and load.
  • Use tiered storage to keep cold segments on cheaper media.

The new bottleneck: execution and governance, not ingestion

Kafka solved movement. The bottleneck moved upward. Modern stacks use multiple engines: Kafka, stream processors, warehouses, lakehouses, OLAP engines and batch systems. Each adds its own scheduling, metadata and failure modes.

Practitioners describe the result as tool sprawl that exceeds the value delivered.

The survey at https://medium.com/@community_md101... shows broad agreement that the current data stack is too complex. The DZone write up at https://dzone.com/articles/modern-data-stack-is-overrated describes the operational drag from Airflow, dbt and other components when combined without a coherent design.

Meanwhile table formats and lakes introduced new metadata workloads. There are cases where metadata dwarfs actual data by orders of magnitude.

An example is documented in https://www.reddit.com/r/dataengineering/comments/18yzbp8 and in the Stack Overflow question at https://stackoverflow.com/questions/77762787. In this case roughly thirteen gigabytes of data generated more than sixty terabytes of metadata in S3 due to many small writes without compaction.

Governance also shifted. Teams now coordinate access control, lineage and quality across Kafka, warehouses, lakes and AI systems.

The article at https://www.moderndata101.com/blogs/accelerating-development... notes that enterprise AI agents require unified governance and discovery rather than disparate systems. DataHub’s CONTEXT 2025 recap at https://datahub.com/blog/context-2025-highlights/ explains that metadata for humans is insufficient and that connected context is needed for automated agents.

Reducing execution and governance complexity

  • Limit the number of engines that run core workloads.
  • Define data contracts, ownership and retention for each domain.
  • Introduce lifecycle policies for table formats to control metadata growth.
  • Invest in a catalog that spans Kafka, warehouses and lakes.

Kafka plus compute: real world friction

Kafka is often blamed for lag and instability that actually originate in downstream compute layers. Flink jobs that read from Kafka and write to databases or object stores frequently stall due to backpressure and checkpoint issues.

In the Stack Overflow question at https://stackoverflow.com/questions/61311010, enabling checkpoints immediately produces backpressure and growing Kafka lag. Another case at https://stackoverflow.com/questions/79424289 shows Flink FileSink to S3 causing persistent backpressure. A related discussion in r/apacheflink at https://www.reddit.com/r/apacheflink/comments/1o2k1t0 highlights how Iceberg sinks inflate checkpoint latency.

These issues confirm that Kafka sits at the front of ETL pipelines whose performance and reliability depend on downstream systems, not just the brokers.

Aligning engines with Kafka

  • Profile downstream sinks before scaling brokers. Most lag originates after Kafka.
  • Use asynchronous sinks and batching in Flink and Spark to avoid backpressure.
  • Monitor checkpoint size and duration. These are leading indicators of ETL failure.
  • Keep compute state small unless stateful operations are required.

What comes after the ETL backbone

If Kafka is the ETL backbone of a platform, then the next challenge is executing work across multiple engines consistently. AI driven agents, analytics and operational workloads touch Kafka, processors, warehouses, lakes and APIs in a single flow.

Industry posts at moderndata101 (https://www.moderndata101...) and DataHub (https://datahub.com/blog/context-2025-highlights/) emphasise that agents span many systems and that coordination is the limiting factor, not model performance.

This points to a need for a logical execution layer above Kafka. Engineers describe the intent of a workflow, and a federated planner determines where each step executes across the available engines.

Preparing for a federated execution layer

  • Separate logical workflows from engine specific implementations.
  • Identify workloads that depend on a specific engine only out of habit.
  • Consolidate redundant jobs across engines into unified patterns.
  • Align observability and lineage so that runs can be traced across systems.

Leadership checklist

Kafka won because it solved an integration and ETL problem. That success reshaped data architectures and shifted bottlenecks into execution, governance and cost. Leaders who recognise Kafka’s true role can simplify their stacks and prepare for the next abstraction layer.

  • Define whether Kafka is part of messaging, integration or data platform strategy.
  • Audit retention, topology and usage to match Kafka’s actual ETL role.
  • Reduce tool sprawl on top of Kafka to avoid recreating N times M complexity in another layer.
  • Invest in governance and metadata that span Kafka, warehouses and lakes.
  • Plan for federated execution where logical workflows span multiple engines.

If you need help with distributed systems, backend engineering, or data platforms, check my Services.

Most read articles

Why Is Customer Obsession Disappearing?

Many companies trade real customer-obsession for automated, low-empathy support. Through examples from Coinbase, PayPal, GO Telecommunications and AT&T, this article shows how reliance on AI chatbots, outsourced call centers, and KPI-driven workflows erodes trust, NPS and customer retention. It argues that human-centric support—treating support as strategic investment instead of cost—is still a core growth engine in competitive markets. It's wild that even with all the cool tech we've got these days, like AI solving complex equations and doing business across time zones in a flash, so many companies are still struggling with the basics: taking care of their customers. The drama around Coinbase's customer support is a prime example of even tech giants messing up. And it's not just Coinbase — it's a big-picture issue for the whole industry. At some point, the idea of "customer obsession" got replaced with "customer automation," and no...

How to scale MySQL perfectly

When MySQL reaches its limits, scaling cannot rely on hardware alone. This article explains how strategic techniques such as caching, sharding and operational optimisation can drastically reduce load and improve application responsiveness. It outlines how in-memory systems like Redis or Memcached offload repeated reads, how horizontal sharding mechanisms distribute data for massive scale, and how tools such as Vitess, ProxySQL and HAProxy support routing, failover and cluster management. The summary also highlights essential practices including query tuning, indexing, replication and connection management. Together these approaches form a modern DevOps strategy that transforms MySQL from a single bottleneck into a resilient, scalable data layer able to grow with your application. When your MySQL database reaches its performance limits, vertical scaling through hardware upgrades provides a temporary solution. Long-term growth, though, requires a more comprehensive approach. This invo...

What the Heck is Superposition and Entanglement?

This post is about superposition and interference in simple, intuitive terms. It describes how quantum states combine, how probability amplitudes add, and why interference patterns appear in systems such as electrons, photons and waves. The goal is to give a clear, non mathematical understanding of how quantum behavior emerges from the rules of wave functions and measurement. If you’ve ever heard the words superposition or entanglement thrown around in conversations about quantum physics, you may have nodded politely while your brain quietly filed them away in the "too confusing to deal with" folder.  These aren't just theoretical quirks; they're the foundation of mind-bending tech like Google's latest quantum chip, the Willow with its 105 qubits. Superposition challenges our understanding of reality, suggesting that particles don't have definite states until observed. This principle is crucial in quantum technologies, enabling phenomena like quantum comp...