Apache Kafka Did Not Win as a Messaging System. It Won as an ETL Backbone.

Apache Kafka is marketed as a messaging system, but most teams run it as an ETL backbone and central data hub. That is why Kafka clusters store tens of terabytes, sit between every database and engine, and show up in platform cost and reliability discussions instead of messaging design reviews. This article explains how Kafka actually gets used in modern stacks, what problems that creates for integration, execution and governance, and concrete steps leaders can take to simplify architectures, control Kafka related costs and prepare for a federated execution layer on top.

Kafka solved integration, not messaging

On paper Kafka sits next to message brokers. In practice it earned adoption because it fixed an integration problem. As organisations added more systems that needed to exchange data, point to point ETL pipelines multiplied beyond control.

With ten source systems and ten destinations you quickly end up with something close to one hundred individual jobs. Each pipeline carries its own schema handling, scheduling and failure modes. As both N and M grow this model stops scaling.

Kafka changed the shape of that problem. Instead of connecting every system directly to every other system, teams connected sources and sinks to a single hub. Producers wrote once to Kafka topics. Consumers subscribed to those topics. Integrations dropped from N times M to N plus M, and teams standardised on a single API for data movement and schema evolution.

Stanislav Kozlovski explains this N to M reduction with a concrete ten by ten example. His public thread at https://x.com/BdKozlovski/status/1992962689052856514 and the follow up article at https://www.linkedin.com/posts/stanislavkozlovski... describe Kafka as the hub that eliminated bespoke pipelines. These posts do not present Kafka as a messaging system but as an integration backbone.

This is not a messaging use case. It is integration and ETL. That is the foundation for how Kafka is used today.

What leaders can do at this layer

Identify Kafka explicitly as integration and ETL backbone in architecture documents.
Retire remaining point to point pipelines and unify them behind topics.
Centralise schema policies so producers and consumers evolve safely.

Kafka became the ETL backbone of the stack

Once Kafka was in place as a hub, systems defaulted to connecting through it. Change data capture landed in Kafka. Stream processors read from Kafka and wrote enriched events back. Warehouses, search systems, data lakes, feature stores and monitoring pipelines subscribed to topics.

Practitioners describe this pattern as the central nervous system of their platforms.

Sina Riyahi describes Kafka as the central nervous system for modern enterprises in his technical walkthrough at https://www.linkedin.com/posts/sina-riyahi.... Kai Waehner uses the same framing for automotive and manufacturing data flows in his write up at https://kai-waehner.medium.com/apache-kafka....

Kafka did not replace RabbitMQ or JMS. It replaced ad hoc ETL buses and custom integration layers.

What leaders can do here

Map all systems reading from or writing to Kafka. This usually reveals the centrality of the log.
Distinguish between event driven flows and batch style movement implemented on top of Kafka.
Define boundaries for when Kafka is appropriate and when it is not.

Operational evidence that Kafka behaves like ETL infrastructure

Kafka clusters are sized and operated like data infrastructure, not like messaging servers. Terabytes of retained data and large replication factors are common.

One public cost analysis describes a six broker cluster ingesting thirty five megabytes per second with seven day retention and a replication factor of three. This results in more than sixty terabytes of stored data for a single week.

The details are laid out in the thread at https://www.reddit.com/r/apachekafka/comments/1fscdmm. A related discussion in r/dataengineering at https://www.reddit.com/r/dataengineering/comments/1ftw5is shows how storage and replication dominate cloud bills once Kafka is used as a durable event store rather than a transient messaging system.

Questions on Stack Overflow and Reddit focus on ingestion throughput, lag, retention, compaction, schema compatibility, data loss and recovery. These are data infrastructure concerns rather than messaging semantics.

Reducing Kafka related operational cost

Review retention per topic and align it with compliance needs.
Move historical data to object storage and keep Kafka focused on recent windows.
Partition topics based on access patterns to contain compaction and load.
Use tiered storage to keep cold segments on cheaper media.

The new bottleneck: execution and governance, not ingestion

Kafka solved movement. The bottleneck moved upward. Modern stacks use multiple engines: Kafka, stream processors, warehouses, lakehouses, OLAP engines and batch systems. Each adds its own scheduling, metadata and failure modes.

Practitioners describe the result as tool sprawl that exceeds the value delivered.

The survey at https://medium.com/@community_md101... shows broad agreement that the current data stack is too complex. The DZone write up at https://dzone.com/articles/modern-data-stack-is-overrated describes the operational drag from Airflow, dbt and other components when combined without a coherent design.

Meanwhile table formats and lakes introduced new metadata workloads. There are cases where metadata dwarfs actual data by orders of magnitude.

An example is documented in https://www.reddit.com/r/dataengineering/comments/18yzbp8 and in the Stack Overflow question at https://stackoverflow.com/questions/77762787. In this case roughly thirteen gigabytes of data generated more than sixty terabytes of metadata in S3 due to many small writes without compaction.

Governance also shifted. Teams now coordinate access control, lineage and quality across Kafka, warehouses, lakes and AI systems.

The article at https://www.moderndata101.com/blogs/accelerating-development... notes that enterprise AI agents require unified governance and discovery rather than disparate systems. DataHub’s CONTEXT 2025 recap at https://datahub.com/blog/context-2025-highlights/ explains that metadata for humans is insufficient and that connected context is needed for automated agents.

Reducing execution and governance complexity

Limit the number of engines that run core workloads.
Define data contracts, ownership and retention for each domain.
Introduce lifecycle policies for table formats to control metadata growth.
Invest in a catalog that spans Kafka, warehouses and lakes.

Kafka plus compute: real world friction

Kafka is often blamed for lag and instability that actually originate in downstream compute layers. Flink jobs that read from Kafka and write to databases or object stores frequently stall due to backpressure and checkpoint issues.

In the Stack Overflow question at https://stackoverflow.com/questions/61311010, enabling checkpoints immediately produces backpressure and growing Kafka lag. Another case at https://stackoverflow.com/questions/79424289 shows Flink FileSink to S3 causing persistent backpressure. A related discussion in r/apacheflink at https://www.reddit.com/r/apacheflink/comments/1o2k1t0 highlights how Iceberg sinks inflate checkpoint latency.

These issues confirm that Kafka sits at the front of ETL pipelines whose performance and reliability depend on downstream systems, not just the brokers.

Aligning engines with Kafka

Profile downstream sinks before scaling brokers. Most lag originates after Kafka.
Use asynchronous sinks and batching in Flink and Spark to avoid backpressure.
Monitor checkpoint size and duration. These are leading indicators of ETL failure.
Keep compute state small unless stateful operations are required.

What comes after the ETL backbone

If Kafka is the ETL backbone of a platform, then the next challenge is executing work across multiple engines consistently. AI driven agents, analytics and operational workloads touch Kafka, processors, warehouses, lakes and APIs in a single flow.

Industry posts at moderndata101 (https://www.moderndata101...) and DataHub (https://datahub.com/blog/context-2025-highlights/) emphasise that agents span many systems and that coordination is the limiting factor, not model performance.

This points to a need for a logical execution layer above Kafka. Engineers describe the intent of a workflow, and a federated planner determines where each step executes across the available engines.

Preparing for a federated execution layer

Separate logical workflows from engine specific implementations.
Identify workloads that depend on a specific engine only out of habit.
Consolidate redundant jobs across engines into unified patterns.
Align observability and lineage so that runs can be traced across systems.

Leadership checklist

Kafka won because it solved an integration and ETL problem. That success reshaped data architectures and shifted bottlenecks into execution, governance and cost. Leaders who recognise Kafka’s true role can simplify their stacks and prepare for the next abstraction layer.

Define whether Kafka is part of messaging, integration or data platform strategy.
Audit retention, topology and usage to match Kafka’s actual ETL role.
Reduce tool sprawl on top of Kafka to avoid recreating N times M complexity in another layer.
Invest in governance and metadata that span Kafka, warehouses and lakes.
Plan for federated execution where logical workflows span multiple engines.

If you need help with distributed systems, backend engineering, or data platforms, check my Services.

How to scale MySQL perfectly

When MySQL reaches its limits, scaling cannot rely on hardware alone. This article explains how strategic techniques such as caching, sharding and operational optimisation can drastically reduce load and improve application responsiveness. It outlines how in-memory systems like Redis or Memcached offload repeated reads, how horizontal sharding mechanisms distribute data for massive scale, and how tools such as Vitess, ProxySQL and HAProxy support routing, failover and cluster management. The summary also highlights essential practices including query tuning, indexing, replication and connection management. Together these approaches form a modern DevOps strategy that transforms MySQL from a single bottleneck into a resilient, scalable data layer able to grow with your application. When your MySQL database reaches its performance limits, vertical scaling through hardware upgrades provides a temporary solution. Long-term growth, though, requires a more comprehensive approach. This invo...

novatechflow | Alexander Alten

Search This Blog