I help teams fix systemic engineering issues across distributed systems, data platforms, and streaming pipelines.
→ See how I work with teams.
From Presto to Trino: A Look Back at the Early Days of Distributed SQL
In late 2013, Facebook released Presto as an open-source distributed SQL engine. At the time, Hadoop’s dominant SQL engines—Hive (MapReduce), Hive+Tez, and early Impala—were still bound to batch-oriented execution models. Presto introduced something radically different: a low-latency, MPP-style SQL engine designed for interactive analytics at petabyte scale.
This article is a 2025 retrospective based on a hands-on write-up from 2014, preserving early installation notes and benchmark results while reflecting on how the technology evolved into Trino, one of the most important components in modern data platforms.
For additional background, see my related articles:
- The SQL-on-Hadoop Era: What We Learned and What Replaced It
- Linux & Kernel Tuning for Hadoop and Distributed Systems
Presto in 2014: A New Approach to Interactive SQL
When Facebook open-sourced Presto, it immediately stood out:
- Interactive SQL over Hive, HDFS, and Cassandra
- An extensible connector architecture
- A fully in-memory execution model (no MapReduce)
- Petabyte-scale internal deployments at Facebook
Back then, Presto shipped with a minimal JDBC driver, early Hive connector support, and limited documentation—but it was already game-changing.
Installation in 2014: Simple, but with Caveats
Setting up Presto 0.68 required attention to a few sharp edges:
- Java 7 only
- No Kerberos support
- No built-in authentication or authorization
- Manual configuration of
node.propertiesandconfig.properties
A typical setup looked like this:
uuidgen > node.id
datasources=jmx,hive
export PATH=/usr/jdk64/jdk1.7.0_45/bin:$PATH && presto-server-0.68/bin/launcher start
The Web UI showed splits, stages, and query execution paths—primitive compared to modern tooling, but revolutionary at the time.
The CLI was a standalone self-executing JAR:
export PATH=/usr/jdk64/jdk1.7.0_45/bin:$PATH && \
/software/presto --server host:port --catalog hive --schema default
Benchmarking Presto Against Hive and Tez (2014)
To evaluate performance, I generated sample datasets and executed a set of analytical queries. Here are two representative examples:
1. Finding Highest Gainers
select id, sum(amount) as amount
from (
select sender as id, amount * -1 as amount from transactions
union all
select recipient as id, amount from transactions
) t
group by id
order by amount desc
limit 10;
Results
Hive: 39.078 seconds
Tez: 18.227 seconds
Presto: 2 seconds
2. Finding Fraudsters (self-join with time window)
select count(*)
from (
select a.sender, a.recipient, b.recipient as c
from transactions a
join transactions b on a.recipient = b.sender
where a.time < b.time and b.time - a.time < 5
) t;
Results
Hive: 208 seconds
Tez: 101 seconds
Presto: 62 seconds
These benchmarks showed why Presto gained instant traction in large organizations: its interactive performance was often two orders of magnitude faster than Hive.
How Presto Evolved Into Trino
Over the next years, the Presto project diverged:
- PrestoDB → the original Facebook-maintained version
- PrestoSQL → spun out by the original creators
In 2020, PrestoSQL became Trino, now one of the most important engines in modern data platforms.
What Trino added:
- Kerberos, LDAP, OAuth2 authentication
- Fine-grained authorization (SQL grants, Ranger integration)
- Iceberg, Delta Lake, Hudi connectors
- Object storage support (S3, ADLS, GCS)
- Fault-tolerant execution
- Kubernetes-native deployments
- Massive connector ecosystem
What started as a fast SQL engine for Hive tables became a universal analytics engine capable of querying dozens of data sources with ANSI SQL.
Modern Perspective (2025)
Looking back, Presto’s early design decisions reshaped an entire generation of data systems:
- Interactive SQL became a baseline expectation
- Federated analytics replaced ETL-heavy workflows
- MPP execution became standard for analytical engines
- The connector abstraction influenced modern lakehouse engines
The SQL-on-Hadoop ecosystem around Presto, Impala, Hive+Tez, Drill, and Tajo eventually gave way to the lakehouse ecosystem—Spark SQL, Trino, Flink SQL, Iceberg, Delta Lake, and cloud-native engines like BigQuery and Athena.
Presto may have started as a simple experimental engine running on Java 7, but it paved the way for the interactive analytical engines we rely on today.
Conclusion
Presto’s impact on the evolution of distributed SQL cannot be overstated. The 2014 benchmarks already showed the future: low-latency, MPP-style execution on open data formats. A decade later, this architecture has become the foundation of modern data platforms, with Trino standing as the direct descendant of the ideas introduced by early Presto.
Understanding this evolution helps modern data architects design systems that balance performance, openness, and ecosystem integration—principles that remain just as relevant today.
If platform instability, unclear ownership, or architecture drift are slowing your teams down,
review my Services
or book a 30-minute call.