Kafka-to-S3 Connector: Large Message Offloading and Scalable ETL

Apache Kafka® is designed for high-throughput message streaming but has strict limits on message sizes. Many workloads produce large payloads (documents, JSON blobs, binary exports, telemetry batches) that exceed broker limits or cause performance degradation.

Teams often need a way to handle these larger messages without destabilizing the Kafka cluster.

Project Objective

Architect and lead an effort to build a connector that allows applications to publish large messages through Kafka by offloading the payload to object storage, while keeping Kafka responsible only for lightweight references.

Solution Overview

The kaf-s3-connector solves this by uploading large message bodies to Amazon S3 (or compatible storage) and sending a compact Kafka message containing:

the S3 object key
metadata (size, checksum, etc)
optional contextual fields

A consumer can then:

read the Kafka reference
fetch the payload from S3
process it as required

Architecture and Technologies

Python implementation
Kafka producer/consumer
S3 or compatible storage (MinIO, etc.)
Configurable batching and payload-handling logic
Lightweight JSON metadata envelope

Repository:
https://github.com/2pk03/kaf-s3
PyPI:
https://pypi.org/project/kaf-s3-connector/

Implementation Notes

The producer accepts arbitrary-sized payloads.
Payload is uploaded to S3 with a deterministic key structure.
Only metadata goes into Kafka, ensuring cluster stability.
Consumers retrieve the referenced object and process it as part of an ETL flow.

Benefits

Removes message size limitations from Kafka.
Stabilizes Kafka clusters by preventing oversized messages.
Enables downstream ETL or data-lake ingestion directly from object storage.
Works with existing Kafka ecosystems with minimal change.

Use Cases

Ingesting files, documents, or binary data into pipelines.
Integrating with S3-based data lakes.
Pre-processing before writing to Iceberg or other formats.

Looking to build something similar?

→ See my Services

→ Book a call

→ Contact me

If you need help with distributed systems, backend engineering, or data platforms, check my Services.

novatechflow | Alexander Alten

Search This Blog