Skip to main content

Posts

Showing posts from 2013

How HDFS Protects Your Data: Modern Reliability Patterns in Hadoop

HDFS is still one of the most battle-tested storage layers for large-scale data platforms. It combines replication (and erasure coding in newer Hadoop versions), rack-aware placement, continuous checksum verification, and high-availability metadata services to detect failures early and repair them automatically. This makes HDFS a solid foundation for modern data platform engineering and distributed systems work, not just a legacy Hadoop component. Teams still ask how HDFS protects data and what mechanisms exist to prevent corruption or silent data loss. The durability model of HDFS has been described in detail in books like Hadoop Operations by Eric Sammer, and most of the ideas are still relevant for modern Hadoop 3.x clusters. Beyond the built-in mechanisms described below, many organizations also operate a second cluster or a remote backup target (for example using snapshots and distcp ) to protect against human mistakes, such as accidentally deleting important data sets. ...

How to Tune Sqoop Export for High-Volume RDBMS Loads

Sqoop export performance depends on the number of parallel mappers, JDBC batching, and how many rows are grouped into each INSERT and transaction. This updated guide explains how to safely tune these parameters without overwhelming the source database, and how to apply them through Sqoop’s -D configuration flags. Sqoop is still widely used in existing Hadoop environments for exporting data from HDFS or Hive back into relational databases. When exporting more than a few thousand rows, tuning the export settings can significantly improve throughput and reduce load on the target RDBMS. Parallelism: --num-mappers This parameter controls how many parallel processes Sqoop uses for the export. Each mapper opens its own JDBC connection and writes a slice of the data. Higher values increase throughput but risk overloading the RDBMS. Lower values reduce pressure but slow down the export. Always verify the database’s connection limits and transaction log capacity before rai...

Connecting SQuirreL SQL to HiveServer2 with Kerberos (Updated)

Kerberos-secured HiveServer2 environments require proper JAAS configuration, a valid Kerberos ticket and the correct Hive JDBC driver. This updated guide explains how to configure SQuirreL SQL with current Hive JDBC drivers, how Kerberos authentication works today and how to avoid the outdated manual classpath and script edits used in early Hadoop distributions. SQuirreL SQL remains a lightweight and reliable JDBC client for connecting to HiveServer2, especially in on-prem Kerberized clusters. While older Hadoop versions required assembling dozens of JAR dependencies manually, modern Hive distributions ship a shaded JDBC driver that simplifies configuration significantly. Prerequisites for Kerberos Authentication Before launching SQuirreL, ensure that you obtain a valid Kerberos ticket: kinit your_user@YOUR.REALM Most Hadoop distributions automatically pick up krb5.conf from system paths ( /etc/krb5.conf on Linux, /Library/Preferences on macOS). If you need to over...

Understanding HBase Cross-Cluster Replication for Disaster Recovery

HBase cross-cluster replication provides asynchronous disaster recovery by streaming WAL edits from a source cluster to one or more destination clusters. It is not a high-availability solution—applications must handle failover logic. This updated guide explains replication modes, requirements, configuration steps and operational considerations for modern HBase deployments. HBase offers built-in multi-site replication for disaster recovery (DR). Replication streams write-ahead log (WAL) edits from one cluster to another cluster or a set of clusters. Because replication is asynchronous, it does not provide automatic failover or zero-data-loss guarantees; applications must handle HA logic at the architectural level. Replication Topologies Modern HBase supports several replication patterns: Master → Slave : a primary cluster replicates edits to one or more secondary clusters. Simple and widely used for DR. Master ↔ Master : two clusters replicate to each other. HBase pre...

List Hive Table Sizes in HDFS with a Single Shell Command

This quick tip shows how to list all Hive tables in a database together with their HDFS locations and human-readable sizes using a single bash one-liner. It still works on classic Hive CLI setups and can be adapted easily for Beeline or modern Hive deployments. When you run benchmarks, clean up old data or just want to understand how much space each Hive table consumes, it is useful to see HDFS locations and sizes side by side. Instead of clicking through UIs, you can ask Hive for every table location and then call hdfs dfs -du -h on each path. The Hive + HDFS one-liner The following bash one-liner queries Hive for table locations, extracts the HDFS paths and then prints a human-readable size for each table directory: for file in $(hive -S -e "SHOW TABLE EXTENDED LIKE '\*'" \ | grep "location:" \ | awk 'BEGIN { FS=":" } { printf("hdfs:%s:%s\n",$3,$4) }'); do hdfs dfs -du -h "$file" done Typical outp...

Querying HBase Data with Impala via Hive’s HBaseStorageHandler

This is a legacy but still useful walkthrough that shows how to expose HBase-resident data to Impala by going through Hive’s Metastore and the HBaseStorageHandler. Using US census ZIP code income data, we create an HBase table, map it with an external Hive table, bulk load the CSV data with Pig and finally query it from Impala. The pattern is mainly relevant today if you are keeping old CDH clusters alive or planning a migration away from Impala-on-HBase towards Parquet or Iceberg tables. Note (2025): This article describes an older CDH/Impala/HBase pattern based on Hive’s HBaseStorageHandler . It is useful if you still maintain legacy Impala-on-HBase workloads or need to understand how such systems were wired. For new designs you will usually land data in Parquet or Iceberg tables and query them with Impala, Trino or Spark instead of reading directly from HBase. Context: Impala, Hive Metastore and HBase Impala uses the Hive Metastore Service to discover tables and their un...

Tuning HBase GC: CMS vs ParallelGC and MSLAB

This short guide explains how legacy HBase clusters were tuned with JVM options for CMS and ParallelGC and how MemStore-Local Allocation Buffers (MSLAB) help reduce GC pauses caused by old-generation fragmentation. Use this as a reference if you still run JDK7/8-based HBase 0.92-era installations or are migrating them, not as guidance for modern Java and HBase versions. Note (2025): The GC flags in this article apply to older Java 7/8 HBase deployments where CMS and ParallelGC were common choices. In newer JVMs, the Concurrent Mark-Sweep (CMS) collector is deprecated and removed in favour of collectors like G1GC. Treat the configuration below as legacy tuning for old HBase stacks, not as recommendations for new clusters. Why HBase GC tuning mattered so much HBase region servers keep a lot of data in memory (memstores and block cache). Without careful JVM tuning, garbage collection can easily cause long pauses, region server timeouts and client-visible outages. Adjusting HBA...

Flume 1.3.1 Windows binary release online

Andy Blozhou, a Chinese Flume enthusiast provide precompiled Windows binaries of Flume-1.3.1, including a startup bat and Avro bat. You can grap this build on their website http://abloz.com/flume/windows_download.html  : ======== snip ======== This is the flume-ng 1.3.1 windows version for download. apache-flume-1.3.1-bin.zip   simple usage: unzip the apache-flume-1.3.1-bin.zip run bin/flume.bat for agent.  run bin/flume-avroclient.bat for avro-client.  Need modify for your own env.  detail: (To compile flume-ng on windows, please reference http://mapredit.blogspot.com/2012/07/run-flume-13x-on-windows.html or my chinese version http://abloz.com/2013/02/18/compile-under-windows-flume-1-3-1.html) 1.download the windows version of flume 1.3.1 file apache-flume-1.3.1-bin.zip from http://abloz.com/flume/windows_download.html 2.unzip the apache-flume-1.3.1-bin.zip to a directory. 3.install jdk 1.6 from oracle,and set JAVA_HOME...

Fixing Oozie LZO ClassNotFound Errors and ShareLib Issues

When LZO compression is enabled in Hadoop, Oozie may fail to launch MapReduce jobs with a ClassNotFoundException for the LzoCodec. This article explains why it happens, how to fix the missing hadoop-lzo classes on the Oozie server, how to correctly deploy the Oozie sharelib and how to enable the legacy uber-jar feature for MapReduce actions in CDH-era clusters. Note (2025): This article documents behaviour from the CDH 4.x/5.x Oozie + MapReduce stack. Oozie is legacy and many teams migrate to Apache Airflow, Dagster, Argo or cloud-native schedulers, but troubleshooting old clusters still requires understanding these patterns. The solutions here apply to Hadoop MR/LZO deployments still in maintenance or migration mode. Symptom: Oozie fails after enabling LZO compression When you add LZO codecs to core-site.xml , Oozie may suddenly fail to start any MapReduce job. Typical configuration looks like: <property> <name>io.compression.codecs</name> <value...