Skip to main content

Posts

Showing posts from October, 2011

Syncing Hadoop Clusters Safely Using Modern DistCp and Retention Workflows

This updated guide shows how to synchronize data between Hadoop clusters using modern DistCp practices, HA NameNode paths, correct retention handling, and safe permissions. It modernizes a legacy 2011 workflow into a reliable daily rebase pattern for development and staging clusters without overloading production systems. When developing new pipelines, it is often necessary to rebase a development or staging Hadoop environment with data from production. Historically this was done with simple DistCp scripts. Today, DistCp v2, YARN scheduling, and improved HDFS tooling allow for safer, more performant cluster synchronization while avoiding operational pitfalls. Below is an updated version of a classic rebase workflow: copying the previous day's log data from a production cluster to a development cluster and applying retention by removing older datasets. 1. Variables and runtime setup COPYDATE=$(date -d '-1 day' +"%Y-%m-%d") DELDATE=$(date -d '-3 day...

Hardening Hadoop Clusters with Active Directory, Kerberos, and SELinux

This guide combines legacy experience from early Hadoop deployments with modern best practices to harden clusters using Active Directory, Kerberos, and SELinux. It explains how to integrate Linux hosts with AD, run MapReduce tasks under real user identities, and enforce OS-level controls with SELinux, while highlighting where today’s YARN-based stacks and tools differ from the classic JobTracker and LinuxTaskController era. Securing a Hadoop cluster has always been more than just flipping a switch. It touches authentication, operating system hardening, networking, and the way jobs are executed. This article merges and modernizes two older posts from 2011, walking through: Integrating Hadoop nodes with a Windows Active Directory forest using Kerberos and LDAP Running MapReduce tasks under real user identities (legacy LinuxTaskController) Hardening nodes with SELinux in targeted or strict mode and building custom policies How these patterns map to modern Hadoop (YARN, ...

Apache Sqoop and Microsoft SQL Server: Updated Integration Guide

This guide updates an older 2011 post on using Microsoft’s SQL Server–Hadoop Connector with Sqoop. It explains how the legacy connector worked, how to configure Sqoop with the current Microsoft JDBC driver for SQL Server, and what to consider now that Apache Sqoop is retired and modern data stacks often use other ingestion tools. Back in 2011, Microsoft shipped a dedicated SQL Server–Hadoop Connector that plugged into Apache Sqoop as an additional connector package. It automated import and export between SQL Server, HDFS, and Hive, and required a separate download plus the Microsoft JDBC driver. Today, the situation is different: Apache Sqoop is retired and has been moved to the Apache Attic (June 2021). It still works but is no longer actively developed or recommended for new projects.:contentReference[oaicite:0]{index=0} The original SQL Server–Hadoop Connector tarball is effectively legacy; most distributions simply use Sqoop’s generic JDBC support plus the Micros...

From Scribe to Apache Flume: How Early Log Pipelines Scaled Beyond a Terabyte a Day

In the early 2010s, large-scale web companies needed reliable pipelines capable of moving terabytes of log data per day. Facebook’s Scribe pioneered this space, and Apache Flume became the first open, community-driven pipeline for distributed log collection. This refreshed article modernizes the original example and explains how Flume’s agents, collectors and failover chains enabled durable ingestion into HDFS long before today’s streaming platforms. Before today’s streaming stacks existed, Facebook’s Scribe was one of the first systems that could reliably manage multi-terabyte daily log volumes . Soon after, Apache Flume emerged as an incubator project and quickly became a standard solution for large-scale log transport into Hadoop environments. Flume introduced a simple, extensible model: sources pull or receive log events, channels buffer them reliably and sinks deliver them into storage systems such as HDFS. For the Hadoop era, this was a breakthrough—fully open sour...