Problem
Kafka is designed for high-throughput message streaming but has strict limits on message sizes.
Many workloads produce large payloads (documents, JSON blobs, binary exports, telemetry batches) that exceed broker limits or cause performance degradation.
Teams often need a way to handle these larger messages without destabilizing the Kafka cluster.
Objective
Build a connector that allows applications to publish large messages through Kafka by offloading the payload to object storage, while keeping Kafka responsible only for lightweight references.
Solution Overview
The kaf-s3-connector solves this by uploading large message bodies to Amazon S3 (or compatible storage) and sending a compact Kafka message containing:
- the S3 object key
- metadata (size, checksum, etc)
- optional contextual fields
A consumer can then:
- read the Kafka reference
- fetch the payload from S3
- process it as required
Architecture and Technologies
- Python implementation
- Kafka producer/consumer
- S3 or compatible storage (MinIO, etc.)
- Configurable batching and payload-handling logic
- Lightweight JSON metadata envelope
Repository:
https://github.com/2pk03/kaf-s3
PyPI:
https://pypi.org/project/kaf-s3-connector/
Implementation Notes
- The producer accepts arbitrary-sized payloads.
- Payload is uploaded to S3 with a deterministic key structure.
- Only metadata goes into Kafka, ensuring cluster stability.
- Consumers retrieve the referenced object and process it as part of an ETL flow.
Benefits
- Removes message size limitations from Kafka.
- Stabilizes Kafka clusters by preventing oversized messages.
- Enables downstream ETL or data-lake ingestion directly from object storage.
- Works with existing Kafka ecosystems with minimal change.
Use Cases
- Ingesting files, documents, or binary data into pipelines.
- Integrating with S3-based data lakes.
- Pre-processing before writing to Iceberg or other formats.
If you need help with distributed systems, backend engineering, or data platforms, check my Services.