Skip to main content

Facebook's Presto Evolution into Trino and Starburst

This retrospective revisits the early days of Facebook’s Presto engine in 2014, including installation quirks, missing security features, and benchmark comparisons with Hive and Tez. It explains how Presto’s performance and connector architecture reshaped SQL-on-Hadoop and ultimately led to the creation of Trino, the modern distributed SQL engine used today across large-scale data platforms.

From Presto to Trino: A Look Back at the Early Days of Distributed SQL

In late 2013, Facebook released Presto as an open-source distributed SQL engine. At the time, Hadoop’s dominant SQL engines—Hive (MapReduce), Hive+Tez, and early Impala—were still bound to batch-oriented execution models. Presto introduced something radically different: a low-latency, MPP-style SQL engine designed for interactive analytics at petabyte scale.

This article is a 2025 retrospective based on a hands-on write-up from 2014, preserving early installation notes and benchmark results while reflecting on how the technology evolved into Trino, one of the most important components in modern data platforms.

For additional background, see my related articles:

Presto in 2014: A New Approach to Interactive SQL

When Facebook open-sourced Presto, it immediately stood out:

  • Interactive SQL over Hive, HDFS, and Cassandra
  • An extensible connector architecture
  • A fully in-memory execution model (no MapReduce)
  • Petabyte-scale internal deployments at Facebook

Back then, Presto shipped with a minimal JDBC driver, early Hive connector support, and limited documentation—but it was already game-changing.

Installation in 2014: Simple, but with Caveats

Setting up Presto 0.68 required attention to a few sharp edges:

  • Java 7 only
  • No Kerberos support
  • No built-in authentication or authorization
  • Manual configuration of node.properties and config.properties

A typical setup looked like this:

uuidgen > node.id
datasources=jmx,hive
export PATH=/usr/jdk64/jdk1.7.0_45/bin:$PATH && presto-server-0.68/bin/launcher start

The Web UI showed splits, stages, and query execution paths—primitive compared to modern tooling, but revolutionary at the time.

The CLI was a standalone self-executing JAR:

export PATH=/usr/jdk64/jdk1.7.0_45/bin:$PATH && \
/software/presto --server host:port --catalog hive --schema default

Benchmarking Presto Against Hive and Tez (2014)

To evaluate performance, I generated sample datasets and executed a set of analytical queries. Here are two representative examples:

1. Finding Highest Gainers


select id, sum(amount) as amount 
from (
  select sender as id, amount * -1 as amount from transactions
  union all
  select recipient as id, amount from transactions
) t
group by id
order by amount desc
limit 10;

Results

Hive: 39.078 seconds
Tez:  18.227 seconds
Presto: 2 seconds

2. Finding Fraudsters (self-join with time window)


select count(*) 
from (
  select a.sender, a.recipient, b.recipient as c
  from transactions a 
  join transactions b on a.recipient = b.sender
  where a.time < b.time and b.time - a.time < 5
) t;

Results

Hive: 208 seconds
Tez:  101 seconds
Presto: 62 seconds

These benchmarks showed why Presto gained instant traction in large organizations: its interactive performance was often two orders of magnitude faster than Hive.

How Presto Evolved Into Trino

Over the next years, the Presto project diverged:

  • PrestoDB → the original Facebook-maintained version
  • PrestoSQL → spun out by the original creators

In 2020, PrestoSQL became Trino, now one of the most important engines in modern data platforms.

What Trino added:

  • Kerberos, LDAP, OAuth2 authentication
  • Fine-grained authorization (SQL grants, Ranger integration)
  • Iceberg, Delta Lake, Hudi connectors
  • Object storage support (S3, ADLS, GCS)
  • Fault-tolerant execution
  • Kubernetes-native deployments
  • Massive connector ecosystem

What started as a fast SQL engine for Hive tables became a universal analytics engine capable of querying dozens of data sources with ANSI SQL.

Modern Perspective (2025)

Looking back, Presto’s early design decisions reshaped an entire generation of data systems:

  • Interactive SQL became a baseline expectation
  • Federated analytics replaced ETL-heavy workflows
  • MPP execution became standard for analytical engines
  • The connector abstraction influenced modern lakehouse engines

The SQL-on-Hadoop ecosystem around Presto, Impala, Hive+Tez, Drill, and Tajo eventually gave way to the lakehouse ecosystem—Spark SQL, Trino, Flink SQL, Iceberg, Delta Lake, and cloud-native engines like BigQuery and Athena.

Presto may have started as a simple experimental engine running on Java 7, but it paved the way for the interactive analytical engines we rely on today.

Conclusion

Presto’s impact on the evolution of distributed SQL cannot be overstated. The 2014 benchmarks already showed the future: low-latency, MPP-style execution on open data formats. A decade later, this architecture has become the foundation of modern data platforms, with Trino standing as the direct descendant of the ideas introduced by early Presto.

Understanding this evolution helps modern data architects design systems that balance performance, openness, and ecosystem integration—principles that remain just as relevant today.

If you need help with distributed systems, backend engineering, or data platforms, check my Services.