Skip to main content

Posts

Showing posts from 2012

Historical Impala Setup on RHEL 6 with Kerberos (Impala 0.3 Era)

This is a historical walkthrough from the early Impala 0.3 beta days, showing how Impala was installed manually on RHEL 6, wired to Hive/HDFS configs, integrated with Kerberos using service principals and keytabs, and started via simple scripts. Treat it as a reference for understanding Impala’s components and legacy Kerberos patterns, not as a modern installation guide. Note (2025): The commands, package names and versions in this article describe an Impala 0.3 beta setup on RHEL/CentOS 6 with Oracle JDK 6 and Cloudera’s early repos. Modern Impala deployments use different packaging, Java versions and security defaults. Use this only for maintaining or understanding legacy CDH-era clusters. What Impala is (in this historical context) Impala provides fast, interactive SQL directly on data stored in Apache Hadoop, primarily HDFS and HBase. It reuses: The Hive Metastore and table metadata Hive-compatible SQL syntax ODBC/JDBC drivers and UI components (e.g. Hue Bee...

Using Hive’s HBaseStorageHandler with Existing HBase Tables

Hive’s HBaseStorageHandler lets you expose HBase tables as Hive tables so you can run SELECT and INSERT statements over HBase data. This article shows how to configure Hive to talk to HBase, create a new HBase table through Hive and attach Hive EXTERNAL TABLE definitions to existing HBase tables using the correct column family mappings. It reflects a Hive 0.9 / HBase 0.92 era setup and is mainly useful for legacy clusters and migrations. Note (2025): This guide describes the original HBase integration introduced around HIVE-705 for Hive 0.9 and HBase 0.92. Modern Hadoop stacks often favour storing analytics data in Parquet or Iceberg and querying it via engines like Hive, Impala, Trino or Spark. Use this pattern primarily when you need to understand or maintain existing Hive-on-HBase tables, not for new designs. Hive–HBase integration in a nutshell Hive can read from and write to HBase tables using a dedicated storage handler. Once configured, you can: Run SELECT que...

Fixing Hanging Hive DROP TABLE on PostgreSQL Metastore

On some older Hive deployments with PostgreSQL as the metastore database, DROP TABLE can hang while PostgreSQL shows UPDATE locks on metastore tables. This often happens when certain privilege tables and indexes were not created correctly during an upgrade or manual schema setup. This note shows a legacy DDL patch you can apply to add the missing tables and indexes so DROP TABLE completes successfully again. Always back up your metastore before running any DDL. Important legacy note: The SQL below matches a specific generation of the Hive metastore schema from around 2013. You should only apply it if you have confirmed that these tables and indexes are missing in your metastore and that the definitions match your Hive version. Always test on a non-production copy of your metastore first. Symptom When using PostgreSQL as the Hive metastore database, a statement like: DROP TABLE xyz; may hang indefinitely. On the PostgreSQL side, you see long-running transactions and loc...

Fixing Hive “Too Many Counters” MapReduce Failures

When Hive queries use many operators, MapReduce can hit its default counter limit and fail with a “Too many counters” exception. This short note explains why it happens (Hive creates multiple counters per operator), how to raise mapreduce.job.counters.max safely, and how to estimate how many operators your query uses with EXPLAIN so you can tune the setting without guessing. Symptom A Hive job fails with an error like: Ended Job = job_xxxxxx with exception 'org.apache.hadoop.mapreduce.counters.LimitExceededException (Too many counters: 201 max=200)' FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask Intercepting System.exit(1) The query might be complex, but otherwise looks syntactically fine. The problem is not your SQL, it’s the number of counters that MapReduce is willing to track. Why this happens: operators and counters Hive uses counters to track statistics for built-in operators (see the Hive LanguageManual UDF / op...

Practical Memory Sizing for Apache Flume Sources, Sinks and File Channels

Apache Flume still appears in many legacy data estates, and most operational issues come from undersized heap or direct memory. This updated guide explains how to estimate memory requirements for Flume sources, sinks and file channels, how batch sizing impacts heap usage, and how replay behavior can drastically increase memory demand. The goal is to give operators a reliable sizing baseline instead of trial-and-error tuning. Memory Requirements for Flume Sources and Sinks The dominant memory cost for each event comes from its body plus a small overhead for headers (typically around 100 bytes, depending on the transaction agent). To estimate memory for a batch: Take the average or p90 event size. Add a buffer for headers and variability. Multiply by the maximum batch size. This result approximates the memory required to hold a batch in a Source or Sink. A Sink needs memory for one batch at a time. A Source needs memory for one batch multiplied by the number of ...

Automating HBase Major Compactions with Cron and Kerberos

Major compactions in HBase can be scheduled during low-traffic hours to reduce load on RegionServers. This guide shows how to trigger a compaction from the HBase shell using a simple Ruby script and how to wrap it in a Kerberos-aware cron job. It reflects common operational practice in legacy Hadoop clusters where maintenance windows still matter. Why Schedule Major Compactions? Major compactions rewrite all store files of an HBase table, improving read performance but putting additional pressure on the cluster. Many administrators run them during off-peak windows. HBase itself does not provide built-in scheduling, so automation is typically handled with cron or at . Ruby Script for HBase Shell HBase shell executes commands through JRuby, so a simple script triggers the compaction: # m_compact.rb major_compact 't1' exit Cron-Compatible Shell Wrapper Below is an example daily_compact script that refreshes a Kerberos ticket and runs the compaction via the HBase...

Lightweight Kerberos Setup for a Secure Hadoop Cluster (Updated Guide)

Kerberos remains the most reliable way to secure Hadoop services without pulling in a full enterprise identity system. This updated guide shows how to bootstrap a minimal MIT Kerberos setup for a small or lab Hadoop cluster, how to configure principals and keytabs, and how to enable authentication in HDFS and YARN. It is designed for simple security needs, rapid testing environments and isolated data platforms. Many engineers still ask for a simple and fast way to secure a Hadoop cluster without integrating Active Directory or enterprise-wide authentication. This guide modernizes the classic lightweight approach: deploy a small MIT Kerberos KDC, create only the required Hadoop service principals, and enable Kerberos-based authentication across HDFS and YARN. If you need a deeper dive into multi-realm trust, security architecture or production-grade Kerberos setups, see the extended guide: Hadoop and Trusted MITv5 Kerberos . 1. Install and Configure a MIT Kerberos KDC Insta...

Enabling JMX Monitoring for Apache Flume Agents

Apache Flume exposes operational metrics through Ganglia and JMX, but JMX remains the most flexible option for modern monitoring stacks. This updated guide shows how to enable JMX inside Flume’s environment configuration, choose a safe port, and connect with jconsole or any JMX-capable monitoring system. A simple JVM flag change gives full visibility into Flume performance and resource usage. Why Use JMX for Flume Monitoring? While legacy setups often used Ganglia, most modern environments integrate JVM metrics into Prometheus, Datadog or similar platforms. JMX is the simplest and most universal way to expose Flume internals such as channel size, event throughput, memory usage and sink performance. Enable JMX in flume-env.sh Edit $FLUME_HOME/conf/flume-env.sh and add or update the JAVA_OPTS line. Choose a port that is free and accessible from your monitoring host. JAVA_OPTS=" -Xms100m -Xmx200m -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.por...

BigData - eine Übersicht

(Dieser Artikel ist auch als Slideshow verfügbar:  http://www.slideshare.net/mapredit/big-data-mit-apache-hadoop ) Mehr und mehr drängt sich BigData als nebulöser Begriff in die Fachpresse. Klar ist, wer mithalten will im Business und innovativ zukünftige Projekte erfolgreich zum Abschluss führen will, kommt um das Thema nicht herum. Doch warum kommt man nicht darum herum? Was ist der Beweggrund für das Sammeln riesiger Datenmengen? Der Weg dahin ist recht einfach und wird von vielen Unternehmen bereits seit Jahren betrieben, nur mit ungleich höherem Aufwand an Manpower und finanziellen Investments. Ein Beispiel: Es werden Logfiles durch riesige Datenfarmen zusammengeführt; wochenlange Jobs laufen über Terrabyte an den gewonnen und aufbereiteten Daten. Tritt in der Kette ein Fehler auf, beginnt der Lauf im Idealfall an der unterbrochenen Stelle - oder von vorn. Doch bis dahin muss eine lange Prozesskette eingehalten werden, um brauchbare Daten für eben diesen einen Job ...

Flume 1.2.0 released

The Apache Flume Team released yesterday the next large release with number 1.2.0. Here a overview about the fixes and additions (thanks Mike, I copy your overview): Apache Flume 1.2.0 is the third release under the auspices of Apache of the so-called "NG" codeline, and our first release as a top-level Apache project! Flume 1.2.0 has been put through many stress and regression tests, is stable, production-ready software, and is backwards-compatible with Flume 1.1.0. Four months of very active development went into this release: a whopping 192 patches were committed since 1.1.0, representing many features, enhancements, and bug fixes. While the full change log can be found in the link below, here are a few new feature highlights: * New durable file channel  * New client API  * New HBase sinks (two different implementations)  * New Interceptor interface (a plugin processing API)  * New JMX-based monitoring support With this release - the first after e...

Get Apache Flume 1.3.x running on Windows

Since we found an increasing interest in the flume community to get Apache Flume running on Windows systems again, I spent some time to figure out how we can reach that. Finally, the good news - Apache Flume runs on Windows. You need some tweaks to get them running. Prerequisites Build system: maven 3x, git, jdk1.6.x, WinRAR (or similar program) Apache Flume agent: jdk1.6.x, WinRAR (or similar program), Ultraedit++ or similar texteditor Tweak the Windows build box 1. Download and install JDK 1.6x from Oracle 2. Set the environment variables    => Start - type " env " into the search box, select " E dit system environment variables ", click Environment Variables, Select " New " from the " Systems variables " box, type " JAVA_HOME " into " variable name " and the path to your JDK installation into "Variable value" (Example:  C:\Program Files (x86)\Java\jdk1.6.0_33 ) 3. Download maven from Apache 4. Set...

Using the Apache Flume HBase Sink: How the Integration Works and How to Configure It

The first Apache Flume HBase sink introduced a simple way to stream events directly into HBase tables. This modernized walkthrough explains how the sink works, what its limitations are, how Flume resolves HBase configuration files, and how to set up a minimal but functional Flume-to-HBase pipeline. Although this feature originated in early Flume versions, many legacy Hadoop deployments still rely on it today. Overview The HBase sink was added to the Flume trunk and provided direct write support from Flume channels into HBase tables. It relies on synchronous HBase client operations and requires that HBase table metadata already exists. The sink handles flushes, transactions and rollbacks, allowing Flume to treat HBase as a durable storage target. Building Flume from Trunk In early versions the HBase sink was only available in the trunk source. The following sequence checks out Flume and builds it using Maven: git clone git://git.apache.org/flume.git cd flume git checkout tr...

How to Filter Multiple Columns in HBase Using SingleColumnValueFilter

Filtering multiple columns in HBase requires combining several SingleColumnValueFilter instances inside a FilterList. By enforcing boolean AND logic across column families, only rows where all required columns contain valid values are returned. This pattern helps control scan output precisely even in older HBase deployments. HBase is a column-oriented database, storing data by column family and qualifier. When executing a scan, filters help reduce the returned data set to only rows matching specific criteria. A frequent challenge is filtering on more than one column simultaneously . For example, you may require that two or more specific columns must contain valid values before a row qualifies. The practical solution is to use multiple SingleColumnValueFilter objects combined in a FilterList . This gives you boolean AND logic across all defined filters. List<Filter> list = new ArrayList<Filter>(2); // Filter on family "fam1", qualifier "VALUE1...

How to Stop New Hadoop MapReduce Jobs Using Queue ACLs

This article shows how to temporarily stop new Hadoop MapReduce jobs from being submitted by enabling ACLs and configuring mapred-queue-acls.xml. Existing jobs continue to run, which makes this pattern useful for maintenance windows or decommissioning work on a classic MapReduce cluster. In a classic Hadoop MapReduce (MRv1) cluster, there are situations where you want to stop accepting new MapReduce jobs while allowing already running jobs to finish. This is especially useful during maintenance, node decommissioning or cluster reconfiguration. One simple way to achieve this is to enable ACLs on the MapReduce job queue and then configure the submission ACL so that effectively nobody is allowed to submit new jobs. 1. Enable ACLs for MapReduce queues First, configure queue ACLs in $HADOOP/conf/mapred-queue-acls.xml . A typical configuration might allow a set of users and groups to submit jobs and a smaller set of admins to manage them: <configuration> <property...

How to Access Kerberized Hadoop Web UIs Using SPNEGO

Kerberized Hadoop clusters use SPNEGO for browser authentication. To sign into UIs such as NameNode, ResourceManager, Oozie or HiveServer2, your browser must support SPNEGO, your client must have a valid Kerberos ticket and DNS and realm mappings must match. This guide explains how to enable SPNEGO for modern Firefox, Chrome and Edge. Most Hadoop Web UIs rely on SPNEGO (Simple and Protected GSSAPI Negotiation) to authenticate users through Kerberos. When a browser accesses a Kerberos-protected endpoint such as: http://namenode-host:9870 http://rm-host:8088 http://oozie-host:11000/oozie the server expects the browser to negotiate Kerberos credentials automatically. If the browser is not configured correctly, the user will see repeated login prompts or 401: Unauthorized . Prerequisites You must have a valid Kerberos ticket: kinit your_user@YOUR.REALM DNS and reverse DNS for the Hadoop services must be correct The SPN for the UI must match: HTTP/hostname@...

Getting Started with Apache Flume NG: Flows, Agents and Syslog-to-HDFS Examples

Apache Flume NG replaced the original master/collector architecture with lightweight agents that can be wired together to form flexible data flows. This guide explains what changed with Flume NG, how the agent–channel–sink model works, and walks through simple configurations for syslog ingestion to a console logger and to HDFS. It’s aimed at engineers who still operate Flume in legacy estates or need to understand it for migrations. From Flume to Flume NG Apache Flume is a distributed log and event collection service. With Flume NG, the project moved away from the original master/client and node/collector design and adopted a simpler, more robust architecture based on standalone agents. Key changes introduced by Flume NG: No external coordination service required for basic operation. No master/client or node/collector roles—only agents . Agents can be chained together to build arbitrary flows and fan-in/fan-out patterns. Lightweight runtime; small heap sizes are s...

How to Use Snappy Compression with Hive and Hadoop (Updated)

Snappy is a fast compression codec widely used in Hadoop ecosystems. This updated guide shows how to generate data, upload it to HDFS, process it with Hive using Snappy compression, verify the output files, and load Snappy-compressed data back into Hive using modern Hadoop commands. Snappy is a high-performance compression and decompression library originally developed at Google. It is optimized for speed rather than maximum compression ratio, making it a preferred codec in many Hadoop and Hive pipelines. Snappy is integrated in all modern Hadoop distributions (Hadoop 2.x and 3.x) and works with Hive, MapReduce, and increasingly Tez or Spark-based Hive deployments. 1. Create sample input data Generate a small test file: $ seq 1 1000 | awk '{OFS="\001";print $1, $1 % 10}' > test_input.hive $ head -5 test_input.hive 1^A1 2^A2 3^A3 4^A4 5^A5 2. Upload the data into HDFS Updated command: Hadoop now uses hdfs dfs instead of hadoop dfs . $ hdfs dfs ...