SQL on Streaming Data Does Not Require a Streaming Engine

Most teams do not need continuous stream processing for day to day Kafka questions. Kafka data is already written as immutable log segments, and those segments can live in object storage. For bounded queries like tailing recent events, time window inspection, and key based debugging, a SQL interface that plans against segment boundaries can replace an Apache Flink or ksqlDB cluster, with clearer costs and less operational overhead.

Stream processing engines solved a real problem: continuous computation over unbounded data. Flink, ksqlDB, and Apache Kafka Streams gave teams a way to run SQL-like queries against event streams without writing custom consumers. The operational cost of that solution is widely acknowledged even by vendors and practitioners: you are adopting a distributed runtime with state, checkpoints, and cluster operations.

For a large share of the questions teams ask their Kafka data, a simpler architecture exists: SQL on immutable segments in object storage.

The streaming ecosystem assumed that querying event data requires streaming infrastructure. That assumption made sense when Kafka was the only place the data lived. Brokers held segments on local disk. If you wanted to query that data with SQL, you needed a system that understood Kafka topics and offsets.

The assumption changes when the same data sits in object storage as immutable segment files. Kafka already writes data as append-only segment files and rolls them by size or time. The default segment size is 1 GiB, and the default retention is 168 hours. These are not exotic settings. They are what most clusters run.

The question is not whether streaming engines are powerful. They are. The question is whether you need that power for the queries you actually run.

1. Research: The 80% that nobody talks about

Most interactions with Kafka data in production are not continuous aggregations or complex event processing. They are simple questions engineers and analysts ask every day:

What is in this topic right now? A developer deploys a new producer and wants to verify messages are flowing. They need the last 10 to 20 messages. This is a bounded read against the tail of the log.

What happened between 9:00 and 9:05? An alert fired. The on-call engineer needs to see messages during a short window. This is a bounded time-range scan.

Where is the message with this key? A customer reports a missing order. Support needs to find the message and trace it through the system. This is a bounded lookup, often constrained by time.

Which partitions have data flowing? An operator suspects a stuck consumer. They need partition-level facts: counts, latest timestamps, and recent offsets. This is metadata plus a small scan.

These questions share the same shape: bounded queries over immutable data. They have a clear start and end. They do not need to run continuously. They do not need to maintain state between executions.

What teams actually ask their Kafka data

Simple lookups and debugging Most queries

Tail, time range, key lookup, partition health

Continuous aggregations Sometimes needed

Windowed counts, rolling metrics

Complex event processing Rare

Joins, patterns, stateful ML

Most teams need a streaming interface, not a streaming engine.

2. Kafka data is already written in immutable chunks

There is a misconception that streaming data is fundamentally different from batch data. Kafka does not persist records as individual writes. It appends to log segment files and rolls segments based on configured thresholds. Those segments are immutable once rolled, which is exactly why retention deletes whole segments and why compaction rewrites segments.

Kafka also maintains sparse indexes so readers can seek efficiently. The default index interval is 4096 bytes, trading index size for faster seeks.

Once you accept that Kafka logs are already a sequence of immutable chunks plus indexes, putting those chunks in object storage becomes a storage choice, not a semantic change. You still have partitions. You still have offsets. You still have timestamps. You still have the same bounded questions.

3. The streaming engine tax

Continuous query engines pay an operational cost for capabilities that simple queries do not use:

State management. Aggregations, joins, and windowing require state that must be stored, checkpointed, and restored.

Checkpoint coordination. Exactly-once pipelines rely on coordinated snapshots and careful tuning. That is real engineering work.

Cluster deployment. You are operating a runtime, not just a query interface. That includes scaling, upgrades, and incident response.

Event-time mechanics. Watermarks, late arrivals, and window close behavior are necessary for continuous semantics, and irrelevant for bounded historical reads.

If your daily reality is tailing, key lookups, and time-bounded debugging, you are paying for a distributed system you are not using.

4. Pull queries on object storage

The alternative architecture is conceptually simple: treat segment files as the query target, and expose Kafka-native columns as first-class fields:

_topic, _partition, _offset, _ts, _key, _value, _headers

With that model, the common questions become straightforward SQL, plus a little ergonomic sugar for interactive use:

-- What is in this topic? (tail)
SELECT * FROM orders TAIL 10;

-- What happened at 9am? (time range)
SELECT * FROM orders
WHERE _ts BETWEEN '2026-01-08 09:00:00' AND '2026-01-08 09:05:00';

-- Find a specific message (key lookup, usually time-bounded)
SELECT * FROM orders
WHERE _key = 'order-12345' LAST 24h;

-- Partition health check (bounded)
SELECT _partition, count(*), max(_ts) AS latest
FROM orders LAST 5m
GROUP BY _partition;

This is the same core idea behind SQL-on-files systems. DuckDB is a good mental model: it queries data in S3 directly, and it is distributed only when you choose to distribute it.

5. Latency expectations

A bounded SQL query on object storage is not a sub-millisecond system. That is fine because most of these queries are interactive debugging and near-time analysis, not fraud scoring.

In practice, object storage latency is dominated by request round trips and first-byte time. Real-world S3 benchmarks show time-to-first-byte in the tens of milliseconds for in-region GETs, with variance by region and access pattern.

The trade is simple: use streaming engines for continuous, stateful, sub-second problems. Use SQL on segments for bounded lookups, debugging, and time-window inspection where seconds-level latency is acceptable.

6. Cost transparency, before you run the query

Object storage cost is transparent if you surface it. S3 charges for requests and data scanned, so the query interface should be honest about what it will read before execution.

At minimum, an EXPLAIN should show:

kfs=> EXPLAIN SELECT * FROM orders WHERE _key = 'order-12345' LAST 24h;

Query Plan:
├── Topic: orders (24 partitions)
├── Time range: 2026-01-07T09:00:00Z to 2026-01-08T09:00:00Z
├── Segments: 127 of 3,842 (3.3%)
├── Estimated bytes: 4.2 GB
├── Index usage: key index on 89 segments
└── Estimated S3 cost: $0.0004

kfs=> EXPLAIN SELECT * FROM orders;

WARNING: Query has no time bounds. Use LAST, TAIL, or WHERE _ts to limit scan.
Run with SCAN FULL to override.

A guardrail should exist for unbounded scans. If a query has no time filter, require explicit opt-in to scan the full topic history. S3 request pricing is region-specific but typically a fraction of a cent per 1,000 GETs. Surface this in the planner, not as a surprise on the bill.

7. Where streaming engines still belong

This is not an argument against streaming engines. Complex event processing, continuous aggregations, windowed joins, and real-time feature serving are exactly what systems like Flink were designed for.

The point is separation of concerns. Most teams adopt streaming infrastructure just to answer bounded questions because that was the only SQL interface available for Kafka data. If your data is already durable as immutable segments in object storage, you can decouple the SQL interface from the streaming runtime.

Right tool for the job

Streaming engine (Flink, Wayang, Fluss, ksqlDB)

Continuous fraud scoring · Windowed aggregations with exactly-once · Real-time feature serving · Complex event patterns · Stream-table joins · Sub-second alerting

SQL on segments (kafsql)

"What's in this topic?" · "What happened at 9am?" · "Find this message" · Partition health checks · Ad-hoc debugging · Historical analysis · Dashboard queries

Rule: Streaming engines for streaming problems. SQL on files for lookup problems.

kafsql

KafScale stores Kafka data as .kfs segments in S3. The format is open and documented. Brokers are stateless; segments are immutable. This architecture addresses the operational pain of traditional Kafka: no partition rebalancing, no disk capacity planning, no broker recovery.

kafsql is a SQL interface for that data. It speaks the Postgres wire protocol, so any Postgres client works out of the box: psql, DBeaver, pgAdmin, JDBC drivers, Python's psycopg2, Go's pgx. No custom driver. No new query language.

$ psql -h kafsql.novatechflow.dev.svc -p 5432 -d kfs

kfs=> \dt
        List of topics
   Name    │ Partitions │ Retention
───────────┼────────────┼───────────
 orders    │         24 │ 7 days
 shipments │         12 │ 7 days
 events    │         48 │ 30 days

kfs=> SELECT * FROM orders TAIL 5;
kfs=> SELECT * FROM orders WHERE _ts >= now() - INTERVAL '15 minutes';
kfs=> SELECT _partition, count(*) FROM orders LAST 5m GROUP BY 1;

The kafsql processor runs as a standalone binary or Kubernetes deployment. It reads segment metadata from etcd or directly from S3 prefix listing. It plans queries against segment boundaries, using timestamp and key indexes when available. Results stream back over the Postgres protocol.

For external access, a proxy handles TLS termination and access control. ACL rules map OIDC groups to topic patterns. Audit logs track who queried what. Internal pods hold S3 credentials; the proxy validates identity before forwarding queries.

Architecture: kafscale.io/architecture
Source: github.com/kafscale/platform

Related guides:

If you need help with distributed systems, backend engineering, or data platforms, check my Services.

How to scale MySQL perfectly

When MySQL reaches its limits, scaling cannot rely on hardware alone. This article explains how strategic techniques such as caching, sharding and operational optimisation can drastically reduce load and improve application responsiveness. It outlines how in-memory systems like Redis or Memcached offload repeated reads, how horizontal sharding mechanisms distribute data for massive scale, and how tools such as Vitess, ProxySQL and HAProxy support routing, failover and cluster management. The summary also highlights essential practices including query tuning, indexing, replication and connection management. Together these approaches form a modern DevOps strategy that transforms MySQL from a single bottleneck into a resilient, scalable data layer able to grow with your application. When your MySQL database reaches its performance limits, vertical scaling through hardware upgrades provides a temporary solution. Long-term growth, though, requires a more comprehensive approach. This invo...

novatechflow | Alexander Alten

Search This Blog