I help teams fix systemic engineering issues: processes, architecture, and clarity.
→ See how I work with teams.
The SQL-on-Hadoop Era: What We Learned and What Replaced It
In the early 2010s, Apache Hadoop became the backbone of large-scale data processing. As businesses demanded interactive analytics on top of HDFS, a wave of SQL engines emerged. The goal: bring familiar relational querying to a distributed storage layer originally designed for MapReduce batch jobs.
By 2015, SQL-on-Hadoop was the hottest category in big data. Today, in 2025, most of those systems have disappeared, evolved, or been replaced by lakehouse architectures and cloud query engines. This article revisits the technologies from that era and provides a modern perspective on what happened—and why.
The “SQL-on-Hadoop” Engines of the Time
Below is the historical list from the original 2015 post, updated with their 2025 status.
Apache Hive
Originally built for batch SQL on MapReduce, then extended with Tez and LLAP for interactive queries. In 2025, Hive is mostly legacy; its metastore survives as a central catalog in many ecosystems, but Spark SQL and Trino have overtaken it for execution.
Apache Drill
A schema-on-read, ANSI SQL engine. Innovative but struggled to gain long-term adoption. Archived in 2024; effectively sunset.
Apache Spark SQL
The survivor of the era. Spark SQL evolved into a foundational component of the data engineering ecosystem and later the lakehouse movement via Delta Lake. Still widely used in 2025.
Apache Phoenix
SQL layer on HBase. Served niche use cases requiring secondary indexing on top of a NoSQL ordered store. Still maintained but limited to environments that retained HBase.
Presto (now Trino)
One of the biggest success stories. Facebook’s Presto split into PrestoDB and PrestoSQL; PrestoSQL became Trino, one of the most important distributed query engines today, used heavily for federated analytics and lakehouse querying.
VoltDB
A high-performance in-memory relational database. Still exists, but used mostly for niche transactional workloads. Not part of the Hadoop ecosystem anymore.
MapR SQL (SQL-on-Hadoop)
MapR as a company no longer exists; HPE acquired the assets. The SQL components faded with the platform.
Apache Kylin
OLAP cube engine for Hadoop. Useful for ultra-fast aggregated reporting. Still maintained, but overshadowed by modern systems like Druid, Pinot, and cloud-native OLAP services.
Apache Tajo
Ambitious distributed MPP SQL engine. Eventually archived; did not survive the shift to Spark and Trino.
Cascading Lingual
Provided a JDBC abstraction over Hadoop workflows. Innovative but discontinued.
Commercial SQL-on-Hadoop Engines (Historical)
Splice Machine
A hybrid transactional/analytical system built on HBase + Derby. Pivoted multiple times; no longer mainstream.
Pivotal HAWQ
An MPP SQL engine adapted to Hadoop. Eventually open sourced as Apache HAWQ, then archived.
Cloudera Impala
An MPP SQL engine with low-latency performance on HDFS. Still exists in Cloudera Data Platform but primarily for legacy CDH environments. Over time, cloud warehouses and Trino overtook its role.
Impala delivered fast OLAP queries by bypassing MapReduce and reading cached HDFS blocks directly. It excelled at ad-hoc analytics but lacked the fault-tolerance needed for heavy ETL, where Hive and Spark remained dominant.
Why SQL-on-Hadoop Faded Away
Several structural factors led to the decline of SQL-on-Hadoop:
- HDFS was not designed for low-latency interactive queries.
- Metadata fragmentation across Hive, HBase, and proprietary catalogs caused friction.
- Operations overhead of HBase, HiveServer2, and distributed MPP engines was enormous.
- Cloud object storage (S3, GCS, ADLS) replaced HDFS as the dominant data layer.
- Lakehouse formats (Iceberg, Delta, Hudi) standardized table behavior beyond Hadoop.
- Cloud-native engines (BigQuery, Athena, Snowflake) changed user expectations permanently.
In short: the world moved from “Bring SQL to Hadoop” toward “Bring computation to a transactional table format on cheap, elastic cloud storage.”
What Replaced SQL-on-Hadoop (2025)
Modern data platforms rely on a completely different stack:
1. Lakehouse Table Formats
- Apache Iceberg
- Delta Lake
- Apache Hudi
These formats brought schema evolution, ACID transactions, time travel, and metadata pruning—features that SQL-on-Hadoop engines struggled to implement cleanly.
2. Distributed SQL Engines
- Trino
- Spark SQL
- Flink SQL (real-time SQL)
These engines made SQL-on-Hadoop obsolete by decoupling computation from HDFS and supporting high-performance querying on object storage.
3. Cloud-Native Warehouses and Query Engines
- Snowflake
- BigQuery
- Athena / Redshift Spectrum
They offered near-infinite elasticity, operational simplicity, and ecosystem integration that Hadoop distributions could not match.
What We Learned from the Era
SQL-on-Hadoop was an important transitional technology. It introduced entire industries to:
- distributed analytical execution
- columnar formats (Parquet, ORC)
- federated querying
- separation of storage and compute (before it became mainstream)
The lessons from that ecosystem directly shaped today’s lakehouse architectures and modern SQL engines.
Conclusion
Looking back from 2025, the SQL-on-Hadoop era feels like an evolutionary bridge between early big data systems and modern lakehouse platforms. Many projects faded, a few evolved, but all contributed to the distributed SQL tooling we rely on today.
Understanding this history helps platform architects make better decisions—recognizing why certain patterns failed, why others persisted, and how today’s systems build on a decade of innovation.
If you need help with distributed systems, backend engineering, or data platforms, check my Services.