Skip to main content

Portfolio

This page lists selected projects and systems I have built or contributed to. Each entry includes a short, factual description and links to source code or live documentation where available.


KafScale (creator)

A Kafka-protocol-compatible streaming platform built for S3-based durable message transport.

Repository: github.com/novatechflow/kafscale
Website: kafscale.io
Architecture rationale: novatechflow.com/p/kafscale.html

KafScale targets the 80% of Kafka workloads that function as durable pipes—producers write, consumers read, teams rely on replay—without requiring sub-millisecond latency, exactly-once transactions, or compacted topics.

The architecture separates concerns cleanly: stateless broker pods handle Kafka protocol traffic, S3 stores immutable log segments (11 nines durability), and etcd manages metadata, offsets, and consumer group state. Brokers are ephemeral compute; data remains durable externally.

Written in Go with a Kubernetes-native operator. Supports 21 Kafka APIs including Produce, Fetch, Metadata, and full consumer group coordination. Apache 2.0 licensed.

Stack: Go, gRPC, Protocol Buffers, S3, etcd, Kubernetes


Lucendex

A neutral, non-custodial execution layer for XRPL settlement.

Repository: github.com/2pk03/lucendex
Website: lucendex.com

Lucendex is a non-custodial, deterministic routing engine for the XRPL decentralized exchange. It indexes AMM pools and orderbook data, evaluates available paths, and produces quotes using a deterministic QuoteHash mechanism.

The service uses PostgreSQL and PL/pgSQL for indexing and routing logic, and provides Ed25519-authenticated API access.

Stack: PostgreSQL, PL/pgSQL, Ed25519 authentication


docAI Toolkit

A Python toolkit for document analysis workflows.

Repository: github.com/2pk03/docai
PyPI: pypi.org/project/docai-toolkit/

Provides utilities for loading documents, splitting and preprocessing text, integrating embeddings or ML-based processing, and preparing inputs for AI pipelines or further downstream processing.

Available as a published package on PyPI.

Stack: Python


kaf-s3 Connector

A Kafka-to-S3 connector implemented in Python.

Repository: github.com/2pk03/kaf-s3
PyPI: pypi.org/project/kaf-s3-connector/
Case study: Kafka-to-S3 Connector: Large Message Offloading and Scalable ETL

Consumes records from Kafka topics, batches them, and writes them to S3 (or compatible object storage) using configurable batching parameters and storage formats.

Published as a package on PyPI.

Stack: Python, Apache Kafka, S3


Scalytics-Federated / Schema→Iceberg Application

Internal system combining data ingestion, schema normalization and processing pipelines to produce "AI-ready" data views for analytics or ML.

Organization: scalytics.io
Case study: Apache Wayang Federated Multi-Engine Processing Case Study

Architecture integrates data from arbitrary source systems or message topics, normalizes schema, and writes unified results into an Iceberg-based data lakehouse. Processing is performed via Apache Flink for streaming and batch workload support.

Stack: Apache Flink, Apache Iceberg, Apache Wayang, data lakehouse architecture


Apache Wayang (PMC member and committer)

Project: Apache Wayang
Apache committer profile: wayang.apache.org/docs/community/team — listed as PMC, Committer (Apache ID: aloalt)

What Apache Wayang is:

Wayang is a unified data processing framework that allows developers to write data workflows in a platform-agnostic way. It translates logical plans into an intermediate representation (WayangPlan), then optimizes them and executes them across one or more processing engines—relational databases, batch engines, stream engines—without requiring users to write engine-specific code.

Why Wayang matters:

  • Enables cross-platform execution: same data flow code can run on different engines (PostgreSQL, Spark, Flink, etc.) depending on workload and environment.
  • Provides cost and performance optimization: its optimizer selects the most efficient execution plan across platforms.
  • Supports federated or distributed data scenarios and heterogeneous data infrastructures, useful when data lives in multiple, different storage or processing systems.

My contributions:

I contribute as a committer and PMC member. Work includes development, architecture and integration of Wayang, bridging its core engine with data-processing pipelines (batch/stream), data ingestion integrations, and use cases for data-lake / AI-ready datasets.

Detailed write-up: What are performance implications of distributed data processing across multiple engines

Stack: Java, Scala, Apache Spark, Apache Flink, PostgreSQL, cross-platform optimization

If you need help with distributed systems, backend engineering, or data platforms, check my Services.

Most read articles

Why Is Customer Obsession Disappearing?

Many companies trade real customer-obsession for automated, low-empathy support. Through examples from Coinbase, PayPal, GO Telecommunications and AT&T, this article shows how reliance on AI chatbots, outsourced call centers, and KPI-driven workflows erodes trust, NPS and customer retention. It argues that human-centric support—treating support as strategic investment instead of cost—is still a core growth engine in competitive markets. It's wild that even with all the cool tech we've got these days, like AI solving complex equations and doing business across time zones in a flash, so many companies are still struggling with the basics: taking care of their customers. The drama around Coinbase's customer support is a prime example of even tech giants messing up. And it's not just Coinbase — it's a big-picture issue for the whole industry. At some point, the idea of "customer obsession" got replaced with "customer automation," and no...

How to scale MySQL perfectly

When MySQL reaches its limits, scaling cannot rely on hardware alone. This article explains how strategic techniques such as caching, sharding and operational optimisation can drastically reduce load and improve application responsiveness. It outlines how in-memory systems like Redis or Memcached offload repeated reads, how horizontal sharding mechanisms distribute data for massive scale, and how tools such as Vitess, ProxySQL and HAProxy support routing, failover and cluster management. The summary also highlights essential practices including query tuning, indexing, replication and connection management. Together these approaches form a modern DevOps strategy that transforms MySQL from a single bottleneck into a resilient, scalable data layer able to grow with your application. When your MySQL database reaches its performance limits, vertical scaling through hardware upgrades provides a temporary solution. Long-term growth, though, requires a more comprehensive approach. This invo...

What the Heck is Superposition and Entanglement?

This post is about superposition and interference in simple, intuitive terms. It describes how quantum states combine, how probability amplitudes add, and why interference patterns appear in systems such as electrons, photons and waves. The goal is to give a clear, non mathematical understanding of how quantum behavior emerges from the rules of wave functions and measurement. If you’ve ever heard the words superposition or entanglement thrown around in conversations about quantum physics, you may have nodded politely while your brain quietly filed them away in the "too confusing to deal with" folder.  These aren't just theoretical quirks; they're the foundation of mind-bending tech like Google's latest quantum chip, the Willow with its 105 qubits. Superposition challenges our understanding of reality, suggesting that particles don't have definite states until observed. This principle is crucial in quantum technologies, enabling phenomena like quantum comp...