December 17, 2025
Kafscale: Kafka-Compatible Streaming with Stateless Brokers and S3 Storage
Why Kafscale exists
Kafscale started three months ago when I was implementing a stateless Kafka alternative for a streaming intelligence platform. The existing solutions looked promising, but production revealed friction that compounded over time.
Control plane lock-in. Most stateless Kafka alternatives require a vendor-hosted control plane. Your agents depend on their infrastructure for consensus. For regulated industries or strict data sovereignty requirements, this external dependency was a dealbreaker.
No S3 resilience. Object storage is not infallible. When PutObject calls started timing out, brokers would retry indefinitely or fail opaquely. No backpressure, no health states, no way for operators to automate responses.
Pricing floor. Entry at $3,000/month excludes dev environments, staging clusters, and smaller workloads. We needed 100GB/day for under $150/month.
So I built it. Go for the broker runtime. gRPC and Protocol Buffers for internal communication. etcd as a self-hosted metadata store. And a proper S3 health state machine that surfaces degradation through Kafka protocol errors and Prometheus metrics.
Open source, sponsored by Scalytics.
Who Kafscale is for
Good fit: Platform teams running Kafka as plumbing. Data engineers using Kafka for CDC, log aggregation, or event sourcing with 100ms+ latency tolerance. Organizations where Kafka ops burden exceeds the value of unused features. Greenfield projects wanting Kafka compatibility without Kafka complexity.
Not a fit: Trading systems, real-time bidding, workloads requiring exactly-once semantics, compacted topics, or single-digit millisecond latency.
Architecture
Brokers accept Kafka protocol connections, buffer writes, flush segments to S3, serve reads with caching and read-ahead, and coordinate consumer groups. etcd stores metadata: topic configuration, partition state, consumer group membership, and committed offsets. S3 stores immutable segment and index objects that represent the message log.
Scope
In scope
- Kafka protocol compatibility for core producer and consumer workflows
- Produce and fetch paths backed by immutable segment storage
- Consumer groups, membership, heartbeats, and committed offsets
- Topic administration needed for everyday platform use
- Kubernetes operator integration via CRDs for cluster and topic lifecycle
Explicit non-goals
- Exactly-once semantics and transactions
- Compacted topics
- Kafka internal replication and ISR protocols
- Embedding stream processing inside the broker
Kafscale only does durable message transport. Stream processing remains the responsibility of compute engines such as Apache Flink, Apache Wayang, or any other stack that reads from Kafka topics. This keeps the broker surface area small and preserves compatibility with the Kafka ecosystem.
Storage and data model
Each topic is partitioned. Each partition is represented as an ordered sequence of immutable segment files plus a sparse index file used for offset-to-position lookup. Segment keys are based on base offsets so storage remains append-friendly and retention is handled with S3 lifecycle policies.
Topics and partitions
S3 key layout
s3://{bucket}/{namespace}/{topic}/{partition}/segment-{base_offset}.kfs
s3://{bucket}/{namespace}/{topic}/{partition}/segment-{base_offset}.index
Segment file format
Each segment is a self-contained file with messages and metadata. The format includes a header for identification and versioning, message batches containing the actual records, and a footer with checksums for integrity verification.
Write path
When a producer sends messages, the broker validates ownership, buffers the data, assigns offsets, and eventually flushes to S3. The acks setting controls when the producer receives confirmation.
Read path
When a consumer fetches messages, the broker locates the relevant segment, checks the cache, and retrieves data from S3 if needed. Read-ahead prefetching improves performance for sequential consumers.
S3 resiliency and backpressure
Kafscale deliberately avoids persistent local queues. When S3 misbehaves, the system surfaces it through protocol-native backpressure and operator automation instead of inventing new operational knobs.
Every broker tracks S3 health as Healthy, Degraded, or Unavailable based on sliding-window PutObject latency and error metrics. The same health monitor wraps the fetch path so degraded buckets slow read-ahead and emit REQUEST_TIMED_OUT; unavailability raises UNKNOWN_SERVER_ERROR immediately so consumers understand the outage.
Operator guardrails
The Kubernetes operator watches broker health via control-plane RPCs or Prometheus. When any broker is Degraded, rollouts are paused. If a quorum reports Unavailable, the operator halts HPA decisions, emits alerts, and optionally rechecks IAM credentials and endpoints before resuming.
Surfacing state
The broker exposes /metrics with Prometheus-style gauges and BrokerControl.GetStatus returns a sentinel partition named __s3_health whose state field reflects the current S3 state. Operators or HPAs can watch either interface to gate rollouts or trigger alerts. For ops teams that prefer push semantics, the broker also opens a StreamMetrics gRPC stream and continuously emits the latest health snapshot plus derived latency and error stats to the operator so automation can react without scraping delays.
Consumer group protocol
Kafscale implements the standard Kafka consumer group protocol. Groups transition through states as members join, leave, or fail heartbeats. The broker handles coordination, assignment, and offset tracking.
Operational defaults
- Bucket naming:
kafscale-{environment}-{region}to isolate IAM and retention policies - Region affinity: bucket region matches the Kubernetes cluster region to avoid cross-region cost and latency
- Encryption: SSE-KMS with a customer-managed CMK when provided; SSE-S3 fallback with a warning
- Lifecycle retention: operator-managed prefix rules derived from topic retention configuration
Current development status
Kafscale is open source and under active development. We are working toward production readiness with compatibility regression testing, fault injection coverage, and repeatable benchmarks.
Full technical documentation including quickstart guides, API reference, and configuration options is available at kafscale.io. This page focuses on the architecture decisions and use cases.
How to use Kafscale in an architecture
Kafscale is intended to be used as a Kafka-compatible transport layer. Producers and consumers connect using standard Kafka client libraries. Downstream compute engines such as Flink or Wayang read from Kafscale topics using their existing Kafka connectors. The platform focus remains durable delivery and replay, not embedded processing.
Related resources
If you are evaluating Kafscale or similar architectures for your organization:
- Create and lead distributed systems architecture - architectural patterns for stateless services and object storage backends
- Apache Flink architecture leadership - stream processing design that pairs well with Kafscale as a source
- Iceberg data platform architecture - lakehouse patterns where Kafscale can serve as the ingestion layer
I help teams assess streaming architectures, reduce operational burden, and design cost-effective data platforms.
→ See how I work with teams or book a call.
If you need help with distributed systems, backend engineering, or data platforms, check my Services.