Skip to main content

Posts

Showing posts from 2015

Building RPMs with Maven for Reliable Software Deployment

This article explains how to package applications as revisionable RPM artifacts using Maven, a technique that remains valuable for DevOps and platform engineering teams who operate in controlled, reproducible, or air-gapped environments. It covers prerequisites, rpm-maven-plugin configuration, directory mappings, permissions, and how to integrate RPM creation into CI pipelines to deliver consistent deployment units for Java services. Modern DevOps Packaging: Building RPMs with Maven for Reliable Software Deployment In modern DevOps and platform engineering, one of the most underrated tools is still the RPM. Even with the rise of containers, many organizations rely on RPM-based delivery to manage internal services, JVM applications, and deployment flows in secure or air-gapped environments. A revisionable, reproducible, OS-native package is often the cleanest way to promote artifacts through development, staging, and production. Back in 2015, the motivation was simple: ...

How Spark Integrates with Hive Today (and Why Early CDH Versions Required Manual Setup)

Modern Spark integrates with Hive through the SparkSession catalog, allowing unified access to Hive tables without manual classpath or configuration hacks. Earlier CDH 5.x deployments required copying hive-site.xml and adjusting classpaths because Hive on Spark was not fully supported. This updated guide explains the current approach and provides historical context for engineers maintaining legacy clusters. In modern Hadoop and Spark deployments, Spark connects to Hive through the SparkSession catalog. Hive metastore integration is stable, supported and no longer requires manual configuration steps such as copying hive-site.xml or modifying executor classpaths. Using Hive from Spark Today Create a SparkSession with Hive support enabled: val spark = SparkSession.builder() .appName("SparkHive") .enableHiveSupport() .getOrCreate() Once enabled, Spark can query Hive tables directly: spark.sql("SELECT COUNT(*) FROM sample_07").show() Spark hand...

Setting Up MIT Kerberos ↔ Active Directory Cross-Realm Trust for Secure Hadoop Clusters

This post explains how to configure a secure cross-realm Kerberos trust between a MIT KDC and Active Directory for Hadoop environments. It covers modern Kerberos settings, realm definitions, encryption choices, KDC configuration, AD trust creation, and Hadoop’s auth_to_local mapping rules. A final section preserves legacy compatibility for older Windows Server versions, ensuring the article can be used across mixed enterprise environments. Integrating Hadoop with enterprise identity systems often requires establishing a cross-realm Kerberos trust between a local MIT KDC and an Active Directory (AD) domain. This setup allows Hadoop services to authenticate users from AD while maintaining a separate Hadoop-managed realm. We walk through a full MIT Kerberos ↔ AD trust configuration using a modern setup, while preserving legacy notes for older Windows environments still found in long-lived clusters. Example Realms Replace these with your actual realms and hosts: ALO.LOCA...

The Rise and Fall of SQL-on-Hadoop: What Happened and What Replaced It

SQL-on-Hadoop once promised interactive analytics on distributed storage and transformed early big data architectures. Many engines emerged—Hive, Impala, Drill, Phoenix, Presto, Spark SQL, Kylin, and others—each attempting to bridge the gap between Hadoop’s batch-processing roots and the need for low-latency SQL. This article revisits that era, explains why most of these systems faded, and outlines the modern successors that dominate today’s lakehouse and distributed SQL landscape. The SQL-on-Hadoop Era: What We Learned and What Replaced It In the early 2010s, Apache Hadoop became the backbone of large-scale data processing. As businesses demanded interactive analytics on top of HDFS, a wave of SQL engines emerged. The goal: bring familiar relational querying to a distributed storage layer originally designed for MapReduce batch jobs. By 2015, SQL-on-Hadoop was the hottest category in big data. Today, in 2025, most of those systems have disappeared, evolved, or been replac...

How to Locate an HBase Region for a Row Key and Trigger a Major Compaction

This guide explains how to inspect a row key, find the region responsible for that row and perform a targeted major compaction using modern HBase shell commands. Region-level compaction is useful for maintenance, skewed regions and cleanup of deleted data, but should be used carefully due to its I/O impact. Inspecting a Row Key To view a sample of rows from a table: scan 'your_table', { LIMIT => 5 } To inspect a specific row key: get 'your_table', "\x00\x01" Locate the Region for a Specific Row Modern HBase shells allow you to query region boundaries directly. locate_region 'your_table', "\x00\x01" This returns the region name, start key, end key and hosting RegionServer. You can also list all regions for the table: get_regions 'your_table' Triggering a Major Compaction on a Region Once you know the region name (e.g. your_table,,1712087434000.abc123 ), you can run: major_compact 'your_table,,1...