This page lists selected projects and systems I have built or contributed to. Each entry includes a short, factual description and links to source code or live documentation where available.
KafScale (creator)
A Kafka-protocol-compatible streaming platform built for S3-based durable message transport.
Repository: github.com/novatechflow/kafscale
Website: kafscale.io
Architecture rationale: novatechflow.com/p/kafscale.html
KafScale targets the 80% of Kafka workloads that function as durable pipes—producers write, consumers read, teams rely on replay—without requiring sub-millisecond latency, exactly-once transactions, or compacted topics.
The architecture separates concerns cleanly: stateless broker pods handle Kafka protocol traffic, S3 stores immutable log segments (11 nines durability), and etcd manages metadata, offsets, and consumer group state. Brokers are ephemeral compute; data remains durable externally.
Written in Go with a Kubernetes-native operator. Supports 21 Kafka APIs including Produce, Fetch, Metadata, and full consumer group coordination. Apache 2.0 licensed.
Stack: Go, gRPC, Protocol Buffers, S3, etcd, Kubernetes
Lucendex
A neutral, non-custodial execution layer for XRPL settlement.
Repository: github.com/2pk03/lucendex
Website: lucendex.com
Lucendex is a non-custodial, deterministic routing engine for the XRPL decentralized exchange. It indexes AMM pools and orderbook data, evaluates available paths, and produces quotes using a deterministic QuoteHash mechanism.
The service uses PostgreSQL and PL/pgSQL for indexing and routing logic, and provides Ed25519-authenticated API access.
Stack: PostgreSQL, PL/pgSQL, Ed25519 authentication
docAI Toolkit
A Python toolkit for document analysis workflows.
Repository: github.com/2pk03/docai
PyPI: pypi.org/project/docai-toolkit/
Provides utilities for loading documents, splitting and preprocessing text, integrating embeddings or ML-based processing, and preparing inputs for AI pipelines or further downstream processing.
Available as a published package on PyPI.
Stack: Python
kaf-s3 Connector
A Kafka-to-S3 connector implemented in Python.
Repository: github.com/2pk03/kaf-s3
PyPI: pypi.org/project/kaf-s3-connector/
Case study: Kafka-to-S3 Connector: Large Message Offloading and Scalable ETL
Consumes records from Kafka topics, batches them, and writes them to S3 (or compatible object storage) using configurable batching parameters and storage formats.
Published as a package on PyPI.
Stack: Python, Apache Kafka, S3
Scalytics-Federated / Schema→Iceberg Application
Internal system combining data ingestion, schema normalization and processing pipelines to produce "AI-ready" data views for analytics or ML.
Organization: scalytics.io
Case study: Apache Wayang Federated Multi-Engine Processing Case Study
Architecture integrates data from arbitrary source systems or message topics, normalizes schema, and writes unified results into an Iceberg-based data lakehouse. Processing is performed via Apache Flink for streaming and batch workload support.
Stack: Apache Flink, Apache Iceberg, Apache Wayang, data lakehouse architecture
Apache Wayang (PMC member and committer)
Project: Apache Wayang
Apache committer profile: wayang.apache.org/docs/community/team — listed as PMC, Committer (Apache ID: aloalt)
What Apache Wayang is:
Wayang is a unified data processing framework that allows developers to write data workflows in a platform-agnostic way. It translates logical plans into an intermediate representation (WayangPlan), then optimizes them and executes them across one or more processing engines—relational databases, batch engines, stream engines—without requiring users to write engine-specific code.
Why Wayang matters:
- Enables cross-platform execution: same data flow code can run on different engines (PostgreSQL, Spark, Flink, etc.) depending on workload and environment.
- Provides cost and performance optimization: its optimizer selects the most efficient execution plan across platforms.
- Supports federated or distributed data scenarios and heterogeneous data infrastructures, useful when data lives in multiple, different storage or processing systems.
My contributions:
I contribute as a committer and PMC member. Work includes development, architecture and integration of Wayang, bridging its core engine with data-processing pipelines (batch/stream), data ingestion integrations, and use cases for data-lake / AI-ready datasets.
Detailed write-up: What are performance implications of distributed data processing across multiple engines
Stack: Java, Scala, Apache Spark, Apache Flink, PostgreSQL, cross-platform optimization
If you need help with distributed systems, backend engineering, or data platforms, check my Services.