novatechflow | Alexander Alten

Posts

Showing posts from January, 2016

Hive on Tez: Why It Was Faster and Why Manual CDH Integration Is Now Legacy

Apache Tez was introduced as a faster, DAG-based execution engine for Hive and other Hadoop workloads, delivering 30–50% speedups over classic MapReduce in many ETL pipelines. This article explains what Tez brought to Hive, how it fit into CDH-era deployments, and why the old practice of hand-compiling Tez against CDH 5.x is now a legacy pattern rather than a recommended approach. Apache Tez was designed as a low-latency, DAG-based execution engine for Hadoop. It replaced many of the heavyweight MapReduce patterns used by early Hive deployments with more efficient execution graphs, reusing containers and avoiding unnecessary materialization steps. In practical terms, switching Hive from MapReduce to Tez often yielded 30–50% faster ETL and reporting jobs on the same hardware, especially for complex multi-stage queries. What Tez Brought to Hive DAG execution : instead of chaining MapReduce jobs, Tez represents the query plan as a directed acyclic graph of tasks. Contai...