In early Hadoop versions, administrators frequently noticed that the MapReduce JobHistory UI
(/jobhistory.jsp) loaded very slowly on high-traffic clusters.
The root cause was simple: the JobTracker kept far too many history files, sometimes accumulating tens of gigabytes of metadata that had to be parsed when rendering the page.
Why JobHistory Became Slow on Pre-Hadoop 0.21 Clusters
Before Hadoop 0.21, the retention policy for jobhistory logs was hardcoded to 30 days. On active clusters this produced enormous history directories—20 GB or more was common. With such volume, loading the history UI became unreliable and audits of long-running jobs were often impractical.
This made operational visibility painful: the JobTracker scanned all retained logs when generating the UI, creating latency spikes and unnecessary disk pressure.
Hadoop 0.21: Configurable JobHistory Retention
Starting with Hadoop 0.21, retention became configurable via:
Key: mapreduce.jobtracker.jobhistory.maxage
Default: 7 * 24 * 60 * 60 * 1000L (one week)
To reduce retention to 3 days:
mapreduce.jobtracker.jobhistory.maxage = 3 * 24 * 60 * 60 * 1000L
This value consists of:
• 3 days
• 24 hours
• 60 minutes
• 60 seconds
• 1000 milliseconds
With a shorter window, the UI becomes significantly faster because fewer job metadata files need to be parsed or displayed.
Alternative Cleanup for Legacy Distributions
Some environments that could not upgrade used a simple cron-based cleanup job to remove old history files:
find /var/log/hadoop-0.20/history/done/ -type f -mtime +1 | xargs rm -f
While this worked, it was more of a hack and required careful coordination to avoid removing logs still referenced by the JobTracker.
Why Retention Still Matters Today
Even in modern Hadoop and YARN-based deployments, log retention remains a core operational consideration. Large jobhistory directories impact performance, audit workflows, storage quotas and incident investigations. Keeping retention tight and aligned with compliance requirements ensures cluster responsiveness and predictable disk usage.
If you need help with distributed systems, backend engineering, or data platforms, check my Services.