I help teams fix systemic engineering issues: processes, architecture, and clarity.
→ See how I work with teams.
Apache Wayang® enables cross-platform execution across multiple data processing platforms such as Spark, Flink, Java Streams, PostgreSQL or GraphChi. This capability fundamentally changes the performance behavior of distributed data pipelines. Wayang reduces manual data movement by selecting where each operator should run, but crossing platform boundaries still introduces serialization cost, shifts in locality, different memory strategies and new tuning constraints. Understanding these dynamics is essential before adopting Wayang for multi-platform pipelines at scale.
Apache Wayang is a cross-platform data processing framework that lets developers run a single logical pipeline across engines such as Apache Spark, Apache Flink or a native Java backend. It provides an abstraction layer and a cost-based optimizer that selects the execution platform for each operator. This flexibility introduces new performance variables that do not exist in single-engine systems. Engine boundaries create data movement, serialization cost and shifts in memory and scheduling behavior. Hardware heterogeneity, cluster topology and partitioning all influence whether Wayang’s plan performs well or degrades throughput. Understanding these implications is essential before adopting cross-platform execution at scale.
Wayang in One Sentence
Wayang lets you write one logical pipeline and execute it across multiple heterogeneous engines, guided by a cost-based optimizer that tries to minimize runtime and data movement.This includes:
- Big data engines: Spark, Flink
- Local or JVM engines: Java, Java Streams
- Query engines: PostgreSQL, Java SQL
- Graph processing: GraphChi
- ML or custom backends through plugins
Why Cross-Platform Execution Changes Performance
In single-engine systems, performance is mostly shaped by:- partitioning
- operator placement
- memory pressure
- network shuffle patterns
- hardware characteristics of the cluster
1. Engine Boundaries
Each platform switch introduces:- serialization
- schema translation
- different memory layouts
- fragmented locality
- duplicated buffers
- different scheduling strategies
Spark → Flink is fundamentally different from Spark → PostgreSQL or Java → GraphChi.
2. Platform Capabilities
Each backend has strengths and weaknesses Wayang must respect:- Spark: high throughput batch + ML
- Flink: streaming + iterative workloads
- PostgreSQL: filtering, projections, local joins
- Java Streams: CPU-bound fast local operations
- GraphChi: efficient graph algorithms
3. Data Movement Across Engines
Wayang tries to minimize it, but when required:- network latency grows
- data locality resets
- intermediate results must be marshalled across formats
- partition sizes and counts change
4. Cost Model Interactions
Wayang estimates costs based on:- cardinality propagation
- operator profiles
- platform-specific execution characteristics
- data placement
- intermediate sizes
- network topology (implicit through platform selection)
Concrete Example: Spark → PostgreSQL → Java Pipeline
A pipeline may look like this:- Spark stage: Distributed join + aggregation
- PostgreSQL stage: Filter + index-based lookup
- Java stage: Custom CPU-heavy transformation
- Spark for large-scale transformations
- PostgreSQL for selective filtering
- Java for fine-grained transformations that do not benefit from distributed overhead
Performance implications:
Spark → PostgreSQL
- Requires repartitioning to a single-node store
- Requires serialization into a row-based format
- Can reduce parallelism sharply
PostgreSQL → Java
- Low overhead because execution becomes local
- Still requires materializing intermediate results
Java → Spark (if returning)
- High overhead due to deserialization
- Requires repartitioning across the Spark cluster
- Adds a new scheduler cycle and possible shuffle
Wayang reduces the manual engineering effort. But the performance is shaped by exactly where and when these transitions occur.
Abstraction Layer vs Real Overhead
- platform switches
- data marshaling
- format conversions
- loss of partitioning
- new scheduling cycles
- potential mismatch in parallelism level
Why Cloud Environments Matter
Wayang itself is cloud-agnostic, but cloud environments still matter because the underlying engines depend on:- network topology (AZ placement, VPC layout)
- storage layers (S3, Blob Storage, GCS)
- node families (memory-optimized vs compute-optimized)
- cluster managers (EMR, Dataproc, Kubernetes)
Cloud Impact Examples
- Spark on S3 behaves differently from Spark on HDFS
- PostgreSQL on a small VM interacts poorly with a high-throughput Spark stage
- Flink on Kubernetes can exhibit variability based on pod placement
- GraphChi performance depends strongly on local disk throughput
- Java Streams may outperform distributed engines for low-cardinality subplans
- filtering early reduces Spark cost
- local Java execution replaces expensive cluster jobs
- SQL filters run directly in PostgreSQL instead of Spark
Performance Challenges Unique to Wayang
1. Platform switching frequencyOptimization Principles for Wayang Pipelines
1. Minimize platform transitionsFAQ
Each engine has different execution models and memory strategies. Wayang exposes these differences as soon as a subplan moves from one engine to another.
Data movement between engines. Inter-platform shuffles increase cost and break locality.
No.It picks the cheapest based on the estimated cost of the subplan.
Platform switches may duplicate buffers, increase serialization and cause mismatched parallelism.
When different engines complement each other. For example:
- SQL filters in PostgreSQL
- big joins in Spark
- ML tasks in Python
- graph tasks in GraphChi
Yes, by minimizing engine switches, tuning partitioning and using optimizer hints.
Sources
[1] Apache Wayang Official Documentation - Describes the architecture, supported platforms, execution model, optimizer, and cross-platform abstractions.
[2] “Rheem: Efficient Multi-Platform Data Processing” (VLDB 2018 Paper) (Alexis Alvarado, Sebastian Kroll, et al.), - This is the scientific foundation of Wayang (formerly Rheem) and explains the optimizer, cost model, and platform-switching behavior.
[3] Wayang GitHub Repository - Shows how platform connectors (Spark, Flink, Java, PostgreSQL, GraphChi) are implemented and how operators translate between engines.
[4] “Cost-Based Optimization for Data Processing in Heterogeneous Systems”, (Krause et al., ACM SIGMOD) - Generalized model of cost estimation across multiple backends, relevant to understanding Wayang’s optimizer behavior.
[5] “Data Movement is All You Need: Latency and Throughput Effects in Distributed Systems”
Research discussions on cross-platform serialization, locality, and boundaries between engines.
(Arvind et al., MIT CSAIL)
[6] Apache Spark Performance Tuning Guide - Useful for understanding the behavior of Spark as a target platform selected by Wayang.
[7] Apache Flink Runtime & Memory Model Documentation - Provides background on memory strategies that matter when pipelines cross between Flink and other engines.
[8] PostgreSQL Query Planner - Relevant for understanding why Wayang may push down filters and projections into SQL engines.
[9] “GraphChi: Large-Scale Graph Computation on Just a PC” (Aapo Kyrola et al.) - Explains the execution model of GraphChi, a backend supported by Wayang for graph workloads.
If you need help with distributed systems, backend engineering, or data platforms, check my Services.