Skip to main content

Posts

Showing posts from 2011

Exporting HDFS via FUSE and Samba: Experiment and Performance Results (Updated)

This article walks through exporting HDFS via FUSE and serving it through Samba so that clients can mount it over CIFS. It shows the configuration on RHEL, how to mount HDFS, how to test read and write performance with rsync and dd, and compares the results to direct SCP transfers. The setup is historical but still useful to understand the trade-offs and overhead of exposing HDFS as a network filesystem. Some years ago I experimented with exposing HDFS to non-Hadoop clients. After playing with libhdfs and NFS and not getting the results I wanted, the next idea was: why not try Samba? Samba 3.x was stable, and most operating systems could mount a CIFS share without additional software. The main goal was to understand the setup and performance characteristics of this scenario, because Samba has many tuning options. The environment used in this experiment was RHEL 5.7 with the stock Samba packages. 1. Environment and packages # rpm -qa | grep samba samba-3.0.33-3.29.el5_7.4....

Mounting HDFS via FUSE and Exporting Over NFS

This guide shows how to mount an HDFS filesystem using FUSE and then export part of it over NFS so that remote systems can access HDFS like a local filesystem. The approach is based on classic Hadoop and Linux tools, and includes notes on security, user mapping, and kernel limitations. In some environments, it can be useful to make an HDFS filesystem available across networks as an exported share. This walkthrough describes a working scenario using Linux and Hadoop with tools that are typically included in older Hadoop distributions. The setup uses hadoop-fuse-dfs and libhdfs to mount HDFS locally, and then exports that mount over NFS. Replace namenode.local and <PORT> with values appropriate for your cluster. 1. Install FUSE and libhdfs yum install hadoop-0.20-fuse.x86_64 hadoop-0.20-libhdfs.x86_64 2. Create a mountpoint mkdir /hdfs-mount 3. Test mounting HDFS via FUSE hadoop-fuse-dfs dfs://namenode.local:<PORT> /hdfs-mount -d If the mount succee...

Debugging Hadoop Performance Issues: Legacy JobTracker + Modern YARN Techniques

This article explains how to debug runtime issues in classical Hadoop MapReduce clusters using JobTracker stack traces, jps, vmstat, and thread-level CPU analysis. Updated notes show how these same debugging principles apply in modern YARN-based clusters, including ResourceManager, NodeManager and NameNode troubleshooting, improved commands, JMX endpoints, and best practices for memory, networking, and virtualization. In early Hadoop deployments (MRv1), one of the most effective ways to diagnose cluster issues was to inspect JobTracker stack traces and JVM thread states. While modern Hadoop clusters use YARN, the same root causes—network latency, RPC timeouts, NameNode memory pressure, GC stalls, and overloaded system services—still apply today. 1. Legacy Method: Inspect JobTracker Stack Traces In MRv1 clusters, you could view all active JobTracker JVM threads via: http://jobtracker:50030/stacks Example thread dump: Process Thread Dump: 43 active threads Thread 3203101 ...

Syncing Hadoop Clusters Safely Using Modern DistCp and Retention Workflows

This updated guide shows how to synchronize data between Hadoop clusters using modern DistCp practices, HA NameNode paths, correct retention handling, and safe permissions. It modernizes a legacy 2011 workflow into a reliable daily rebase pattern for development and staging clusters without overloading production systems. When developing new pipelines, it is often necessary to rebase a development or staging Hadoop environment with data from production. Historically this was done with simple DistCp scripts. Today, DistCp v2, YARN scheduling, and improved HDFS tooling allow for safer, more performant cluster synchronization while avoiding operational pitfalls. Below is an updated version of a classic rebase workflow: copying the previous day's log data from a production cluster to a development cluster and applying retention by removing older datasets. 1. Variables and runtime setup COPYDATE=$(date -d '-1 day' +"%Y-%m-%d") DELDATE=$(date -d '-3 day...

Hardening Hadoop Clusters with Active Directory, Kerberos, and SELinux

This guide combines legacy experience from early Hadoop deployments with modern best practices to harden clusters using Active Directory, Kerberos, and SELinux. It explains how to integrate Linux hosts with AD, run MapReduce tasks under real user identities, and enforce OS-level controls with SELinux, while highlighting where today’s YARN-based stacks and tools differ from the classic JobTracker and LinuxTaskController era. Securing a Hadoop cluster has always been more than just flipping a switch. It touches authentication, operating system hardening, networking, and the way jobs are executed. This article merges and modernizes two older posts from 2011, walking through: Integrating Hadoop nodes with a Windows Active Directory forest using Kerberos and LDAP Running MapReduce tasks under real user identities (legacy LinuxTaskController) Hardening nodes with SELinux in targeted or strict mode and building custom policies How these patterns map to modern Hadoop (YARN, ...

Apache Sqoop and Microsoft SQL Server: Updated Integration Guide

This guide updates an older 2011 post on using Microsoft’s SQL Server–Hadoop Connector with Sqoop. It explains how the legacy connector worked, how to configure Sqoop with the current Microsoft JDBC driver for SQL Server, and what to consider now that Apache Sqoop is retired and modern data stacks often use other ingestion tools. Back in 2011, Microsoft shipped a dedicated SQL Server–Hadoop Connector that plugged into Apache Sqoop as an additional connector package. It automated import and export between SQL Server, HDFS, and Hive, and required a separate download plus the Microsoft JDBC driver. Today, the situation is different: Apache Sqoop is retired and has been moved to the Apache Attic (June 2021). It still works but is no longer actively developed or recommended for new projects.:contentReference[oaicite:0]{index=0} The original SQL Server–Hadoop Connector tarball is effectively legacy; most distributions simply use Sqoop’s generic JDBC support plus the Micros...

From Scribe to Apache Flume: How Early Log Pipelines Scaled Beyond a Terabyte a Day

In the early 2010s, large-scale web companies needed reliable pipelines capable of moving terabytes of log data per day. Facebook’s Scribe pioneered this space, and Apache Flume became the first open, community-driven pipeline for distributed log collection. This refreshed article modernizes the original example and explains how Flume’s agents, collectors and failover chains enabled durable ingestion into HDFS long before today’s streaming platforms. Before today’s streaming stacks existed, Facebook’s Scribe was one of the first systems that could reliably manage multi-terabyte daily log volumes . Soon after, Apache Flume emerged as an incubator project and quickly became a standard solution for large-scale log transport into Hadoop environments. Flume introduced a simple, extensible model: sources pull or receive log events, channels buffer them reliably and sinks deliver them into storage systems such as HDFS. For the Hadoop era, this was a breakthrough—fully open sour...

Optimizing Hadoop JobHistory Retention: Why Old Clusters Slowed Down and How to Fix It

Early Hadoop installations often struggled with slow-loading JobHistory pages because history files accumulated for weeks on busy clusters. This article explains how JobHistory retention worked before Hadoop 0.21, how the maxage setting introduced fixable cleanup behavior, and how administrators could safely tune or automate log pruning to keep job tracking responsive and storage under control. In early Hadoop versions, administrators frequently noticed that the MapReduce JobHistory UI ( /jobhistory.jsp ) loaded very slowly on high-traffic clusters. The root cause was simple: the JobTracker kept far too many history files, sometimes accumulating tens of gigabytes of metadata that had to be parsed when rendering the page. Why JobHistory Became Slow on Pre-Hadoop 0.21 Clusters Before Hadoop 0.21, the retention policy for jobhistory logs was hardcoded to 30 days . On active clusters this produced enormous history directories—20 GB or more was common. With such volume...

Analyzing IIS Logs with Hadoop and Hive: From Ingestion to Partitioned Traffic Analysis

This guide explains how to collect IIS W3C logs from Windows servers, ingest them into HDFS, define Hive tables for structured analysis, and run partitioned queries that calculate traffic per IP. The workflow reflects an early big-data pattern: mount, copy, structure, partition and analyze. The refreshed version clarifies each step and highlights how automated partition creation made near-real-time traffic analytics possible. Apache-driven webfarms have always been easy to integrate into Hadoop and Hive, but many teams struggled with the same question: How do we collect and analyze IIS logs at scale? This walkthrough shows how IIS W3C logs can be exported, ingested into HDFS, structured in Hive, and processed into partitioned traffic tables for efficient reporting. Configuring IIS for W3C Logging Inside IIS Manager, under a website’s configuration, choose: Logging → Format: W3C From there you can configure: Log output path Fields to capture (IP, URI, user agent...

Optimizing Sqoop Exports: Generating and Tuning Custom Job JARs

Sqoop was the standard tool for moving data between relational databases and Hadoop. One of its most useful capabilities was generating a custom job JAR for optimizing export performance. This guide explains how to create the JAR, inspect the generated classes and rerun Sqoop with your precompiled job code to achieve faster, more stable export pipelines. Apache Sqoop (SQL-to-Hadoop) bridged traditional databases and Hadoop ecosystems. A lesser-known feature allowed developers to generate a standalone job JAR directly from an export command, enabling performance tuning and customizations. Generating a Sqoop Export Job JAR Example export command that produces a JAR file: sqoop export \ --connect jdbc:RDBMS:thin:@HOSTNAME:PORT:DBNAME \ --table TABLENAME \ --username USERNAME \ --password PASSWORD \ --export-dir HDFS_DIR \ --direct \ --fields-terminated-by ',' \ --package-name JOBNAME.IDENTIFIER \ --outdir OUTPUT_DIR \ --bindir BIN_DIR After ru...

Configuring a MySQL Metastore for Hive: From Embedded Derby to Multi-User Clusters

Hive’s embedded Derby database is fine for local testing, but it breaks down as soon as multiple users and services need to share metadata. This guide shows how to move Hive from the default single-user Derby setup to a shared MySQL metastore: configuring MySQL, creating the Hive schema, wiring Hive to the external database, and distributing drivers and configuration across a Hadoop cluster. Apache Hive provides a SQL-like query language (HiveQL) on top of HDFS, making large-scale data analysis accessible to anyone with SQL experience. By default, however, Hive uses an embedded Derby database for its metastore, which is not suited for multi-user or multi-service environments. To run Hive in a real cluster, you need an external metastore database . The database stores metadata about: Databases and tables Partitions and storage descriptors SerDes, column definitions and privileges This article walks through configuring a MySQL-based Hive metastore , suitable for...

Don’t Put Hadoop or HBase Temp Directories in /tmp: Protecting Your Metadata

Many early Hadoop installations lost HDFS metadata because temporary directories were left under /tmp, which is routinely cleaned by Linux on reboot. This article explains why hadoop.tmp.dir and hbase.tmp.dir must point to a stable path, how this impacts NameNode edit logs and checkpoints, and what a safe configuration looks like today for both Hadoop and HBase clusters. In early Hadoop 0.20 clusters, it was common to leave the default hadoop.tmp.dir configuration unchanged. On many Linux systems, however, /tmp is automatically cleaned during reboot. If your Hadoop or HBase temp directories live there, you risk losing critical metadata when the operating system wipes that directory. For the SecondaryNameNode and related components, this can mean the loss of: Recent edit logs Checkpointed fsimage files Other temporary metadata needed for recovery If these files are removed, the NameNode may not be able to replay its edits after a crash, leading to a failed restart or a ...