Skip to main content

Posts

Showing posts from September, 2011

Optimizing Hadoop JobHistory Retention: Why Old Clusters Slowed Down and How to Fix It

Early Hadoop installations often struggled with slow-loading JobHistory pages because history files accumulated for weeks on busy clusters. This article explains how JobHistory retention worked before Hadoop 0.21, how the maxage setting introduced fixable cleanup behavior, and how administrators could safely tune or automate log pruning to keep job tracking responsive and storage under control. In early Hadoop versions, administrators frequently noticed that the MapReduce JobHistory UI ( /jobhistory.jsp ) loaded very slowly on high-traffic clusters. The root cause was simple: the JobTracker kept far too many history files, sometimes accumulating tens of gigabytes of metadata that had to be parsed when rendering the page. Why JobHistory Became Slow on Pre-Hadoop 0.21 Clusters Before Hadoop 0.21, the retention policy for jobhistory logs was hardcoded to 30 days . On active clusters this produced enormous history directories—20 GB or more was common. With such volume...

Analyzing IIS Logs with Hadoop and Hive: From Ingestion to Partitioned Traffic Analysis

This guide explains how to collect IIS W3C logs from Windows servers, ingest them into HDFS, define Hive tables for structured analysis, and run partitioned queries that calculate traffic per IP. The workflow reflects an early big-data pattern: mount, copy, structure, partition and analyze. The refreshed version clarifies each step and highlights how automated partition creation made near-real-time traffic analytics possible. Apache-driven webfarms have always been easy to integrate into Hadoop and Hive, but many teams struggled with the same question: How do we collect and analyze IIS logs at scale? This walkthrough shows how IIS W3C logs can be exported, ingested into HDFS, structured in Hive, and processed into partitioned traffic tables for efficient reporting. Configuring IIS for W3C Logging Inside IIS Manager, under a website’s configuration, choose: Logging → Format: W3C From there you can configure: Log output path Fields to capture (IP, URI, user agent...

Optimizing Sqoop Exports: Generating and Tuning Custom Job JARs

Sqoop was the standard tool for moving data between relational databases and Hadoop. One of its most useful capabilities was generating a custom job JAR for optimizing export performance. This guide explains how to create the JAR, inspect the generated classes and rerun Sqoop with your precompiled job code to achieve faster, more stable export pipelines. Apache Sqoop (SQL-to-Hadoop) bridged traditional databases and Hadoop ecosystems. A lesser-known feature allowed developers to generate a standalone job JAR directly from an export command, enabling performance tuning and customizations. Generating a Sqoop Export Job JAR Example export command that produces a JAR file: sqoop export \ --connect jdbc:RDBMS:thin:@HOSTNAME:PORT:DBNAME \ --table TABLENAME \ --username USERNAME \ --password PASSWORD \ --export-dir HDFS_DIR \ --direct \ --fields-terminated-by ',' \ --package-name JOBNAME.IDENTIFIER \ --outdir OUTPUT_DIR \ --bindir BIN_DIR After ru...