Skip to main content

Posts

Showing posts from 2014

Linux & Kernel Tuning for Hadoop and Large Distributed Systems (2025 Update)

This guide explains the essential Linux, kernel, memory, and network tuning techniques required to operate high-performance Hadoop and distributed systems. It covers modern configuration practices for swappiness, transparent huge pages, overcommit behavior, socket and port tuning, file descriptor limits, disk behavior, and DNS resolution. Legacy options are included where still relevant, with updated recommendations for modern kernels and systemd-based Linux distributions. Running Hadoop or any large distributed system at scale requires more than good cluster design. Performance and stability depend heavily on the underlying Linux configuration. This guide revisits the classic Hadoop tuning principles from a modern 2025 perspective, explains what still matters, and documents what has changed in recent kernel versions. These tuning practices apply not just to Hadoop, but also to Kafka, HBase, Zookeeper, Flink, object storage gateways, and high-ingest distributed systems wh...

Why HiveServer2 Replaced the Hive CLI (and Why It Still Matters)

HiveServer2 replaced the old Hive CLI because the CLI bypassed all security and governance layers, could not support multi-user concurrency, and created operational risks that modern data platforms cannot tolerate. This updated version explains the historical context, what changed in today’s Hadoop and Hive environments, and why Beeline and JDBC remain the only correct way to access Hive securely and predictably. When Hive 0.11 introduced HiveServer2 (HS2), it marked a necessary break with the legacy Hive CLI model. While the original post explained this transition for early Hadoop distributions, the underlying reasons remain valid even in modern Hive deployments. Today Hive CLI is effectively obsolete, and all secure or governed environments require HS2 as the mandatory entry point. Why the Hive CLI Had to Die 1. The CLI Bypassed All Security The original Hive CLI talked directly to the Hive Metastore and launched MapReduce or Tez jobs without going through a controlled ser...

Understanding HDFS Extended Attributes (XAttr) in Modern Hadoop

HDFS extended attributes (XAttr) allow files and directories to carry custom metadata such as encryption markers, checksums, lineage tags or security labels. Introduced years ago and now fully stable in Hadoop 3.x, they provide a flexible way for governance tools and applications to attach structured or free-form information directly to filesystem objects. This updated version explains how namespaces work, how limits are configured and how to read and write attributes using current HDFS commands. Extended Attributes (XAttr), familiar from UNIX-like filesystems, allow HDFS to store custom metadata alongside files and directories. Modern data platforms use these attributes for encryption tagging, data classification, backup markers, application metadata and security frameworks. How HDFS Stores Extended Attributes HDFS supports four namespaces, aligned with Linux kernel semantics: user – user-defined metadata security – security-related attributes (superuser only) sy...

Remove HDP and Ambari completely

Its a bit hard to remove HDP and Ambari completely - so I share my removal script here. Works for me perfect, just adjust the HDFS directory. In my case it was /hadoop #!/bin/bash echo "==> Stop Ambari and Hue" ambari-server stop && ambari-agent stop /etc/init.d/hue stop sleep 10 echo "==> Erase HDP and Ambari completely" yum -y erase ambari-agent ambari-server ambari-log4j hadoop libconfuse nagios ganglia sqoop hcatalog\* hive\* hbase\* zookeeper\* oozie\* pig\* snappy\* hadoop-lzo\* knox\* hadoop\* storm\* hue\* # remove configs rm -rf /var/lib/ambari-*/keys /etc/hadoop/ /etc/hive /etc/hbase/ /etc/oozie/ /etc/zookeeper/ /etc/falcon/ /etc/ambari-* /etc/hue/ # remove ambaris default hdfs dir rm -rf /hadoop # remove the repos echo "==> Remove HDP and Ambari Repo" rm -rf /etc/yum.repos.d/HDP.repo /etc/yum.repos.d/ambari.repo # delete all HDP related users echo "==> Delete the user accounts" userdel -f hdfs && userdel -...

Facebook's Presto Evolution into Trino and Starburst

This retrospective revisits the early days of Facebook’s Presto engine in 2014, including installation quirks, missing security features, and benchmark comparisons with Hive and Tez. It explains how Presto’s performance and connector architecture reshaped SQL-on-Hadoop and ultimately led to the creation of Trino, the modern distributed SQL engine used today across large-scale data platforms. From Presto to Trino: A Look Back at the Early Days of Distributed SQL In late 2013, Facebook released Presto as an open-source distributed SQL engine. At the time, Hadoop’s dominant SQL engines—Hive (MapReduce), Hive+Tez, and early Impala—were still bound to batch-oriented execution models. Presto introduced something radically different: a low-latency, MPP-style SQL engine designed for interactive analytics at petabyte scale. This article is a 2025 retrospective based on a hands-on write-up from 2014, preserving early installation notes and benchmark results while reflecting on...

Cloudera Manager fails to upgrade Sqoop2 when parcels are enabled

Cloudera Manager fails to update the generic Sqoop2 connectors when parcels are enabled, and the Sqoop2 server won't start anymore. In the logs a error like: Caused by: org.apache.sqoop.common.SqoopException: JDBCREPO_0026:Upgrade required but not allowed - Connector: generic-jdbc-connector is shown. This issue can be fixed by adding two properties into the service safety valve of sqoop: org.apache.sqoop.connector.autoupgrade=true org.apache.sqoop.framework.autoupgrade=true This happen trough the missing autoupdate of the default sqoop connectors in Cloudera Manager. After the properties are added, SqoopServer should be able to update the drivers and will start sucessfully.

Who Really Led the Hadoop Market? A Look Back at the 2014 Forrester Wave

In 2014 every Hadoop vendor claimed to be the market leader, but the Forrester Wave told a different story: the ecosystem was crowded, overlapping, and full of marketing noise. Looking back from 2025, it’s clear that none of the commercial players won—open source won, and the industry evolved far beyond the Hadoop vendors of that era. In early 2014, Forrester Research published its well-known Forrester Wave: Big Data Hadoop Solutions, Q1 2014 . The report evaluated the major players of that time—Cloudera, Hortonworks, MapR, IBM, Teradata—and declared them all “leaders.” Not surprisingly, each vendor immediately launched a marketing campaign claiming they were the one true leader. From the outside it looked almost comedic: five companies staring at the same chart, each insisting the dot representing them was the real champion. The reality? The Hadoop distribution market was crowded, competitive, and full of overlapping capabilities. Nobody led decisively—and that matters. Th...