Skip to main content

Using Hive’s HBaseStorageHandler with Existing HBase Tables

Hive’s HBaseStorageHandler lets you expose HBase tables as Hive tables so you can run SELECT and INSERT statements over HBase data. This article shows how to configure Hive to talk to HBase, create a new HBase table through Hive and attach Hive EXTERNAL TABLE definitions to existing HBase tables using the correct column family mappings. It reflects a Hive 0.9 / HBase 0.92 era setup and is mainly useful for legacy clusters and migrations.

Note (2025): This guide describes the original HBase integration introduced around HIVE-705 for Hive 0.9 and HBase 0.92. Modern Hadoop stacks often favour storing analytics data in Parquet or Iceberg and querying it via engines like Hive, Impala, Trino or Spark. Use this pattern primarily when you need to understand or maintain existing Hive-on-HBase tables, not for new designs.

Hive–HBase integration in a nutshell

Hive can read from and write to HBase tables using a dedicated storage handler. Once configured, you can:

  • Run SELECT queries over HBase-backed tables.
  • Use INSERT to write data into HBase via Hive.
  • Expose existing HBase tables to Hive using CREATE EXTERNAL TABLE.

The feature is implemented by HBaseStorageHandler and was introduced in HIVE-705. At the time of writing, Hive 0.9 required HBase 0.92 or newer for this integration to work.

Prerequisites and storage handler

The HBase storage handler ships with Hive and should be available in the Hive library directory:

$HIVE_HOME/lib/hive-hbase-handler*.jar

In that era, the handler required:

  • Hadoop 0.20.x or later
  • Zookeeper 3.3.4 or later
  • Matching HBase and Hive versions (e.g. HBase 0.92 with Hive 0.9)

Configuring Hive to see HBase

Hive needs to know where to find the HBase configuration so that it can locate Zookeeper, the HBase master and region servers. One simple way is to add HBase’s configuration directory to hive-site.xml:

<property>
  <name>hive.aux.jars.path</name>
  <value>file:///etc/hbase/conf</value>
</property>

After editing hive-site.xml, distribute:

  • hbase-site.xml
  • hive-site.xml

to all Hive clients and nodes that will run HiveServer or Hive CLI. This ensures that all Hive components resolve the same HBase/Zookeeper configuration.

Create a new HBase table from Hive

You can define a Hive table that is backed by a new HBase table using HBaseStorageHandler and the hbase.columns.mapping property. For example:

CREATE TABLE hbase_test (
  key1 STRING,
  col1 STRING
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
  "hbase.columns.mapping" = ":key,cf1:c1"
)
TBLPROPERTIES (
  "hbase.table.name" = "hive_test"
);

This statement tells Hive to:

  • Create an HBase table named hive_test if it does not exist.
  • Map:
    • key1 to the HBase row key (denoted by :key).
    • col1 to column c1 in column family cf1.

In HBase shell, the table then looks like:

hbase(main):001:0> describe 'hive_test'
DESCRIPTION                                                                 ENABLED
{NAME => 'hive_test',
 FAMILIES => [{NAME => 'cf1', BLOOMFILTER => 'NONE',
 REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE',
 MIN_VERSIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536',
 IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}                             true
1 row(s) in 0.1190 seconds

From here, you can run Hive queries like SELECT key1, col1 FROM hbase_test and Hive will read directly from HBase.

Attaching Hive to an existing HBase table

More often, you already have data in HBase and want to query it from Hive. In that case, you use CREATE EXTERNAL TABLE so Hive does not own the HBase table lifecycle.

Inspect the HBase schema first

Start by inspecting the HBase table to find its column families and qualifiers:

hbase(main):003:0> describe 't1'
DESCRIPTION                                                                 ENABLED
{NAME => 't1',
 FAMILIES => [{NAME => 'f1', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0',
 COMPRESSION => 'NONE', VERSIONS => '1', TTL => '2147483647',
 MIN_VERSIONS => '0', BLOCKSIZE => '65536', IN_MEMORY => 'false',
 BLOCKCACHE => 'true'}]}                                                    true
1 row(s) in 0.0700 seconds

In this example, the table t1 has a single column family f1. We’ll map a Hive column to one of its qualifiers, for example f1:c1.

Create an EXTERNAL table in Hive

Hive does not support ALTER for non-native (HBase-backed) tables in this old integration. Instead, you define an EXTERNAL table from the start:

CREATE EXTERNAL TABLE hbase_test2 (
  key1 STRING,
  col1 STRING
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
  "hbase.columns.mapping" = ":key,f1:c1"
)
TBLPROPERTIES (
  "hbase.table.name" = "t1"
);

Here we:

  • Map key1 to the HBase row key (:key).
  • Map col1 to column c1 in family f1.
  • Tell Hive the underlying HBase table is t1 via hbase.table.name.

From Hive’s perspective, the table looks like a regular two-column table:

hive> describe hbase_test2;
OK
key1    string  from deserializer
col1    string  from deserializer
Time taken: 0.106 seconds

You can now run SELECT queries on hbase_test2, and Hive will read from the existing HBase table t1. Depending on your setup, INSERT operations can also write into HBase through this mapping.

Things to keep in mind

  • This integration is tied to specific Hive/HBase versions (Hive 0.9 with HBase 0.92 in this example).
  • Schema evolution is limited: non-native tables cannot be freely altered.
  • Performance depends heavily on HBase table design (row key, regions, compression and block cache).
  • For new workloads, prefer modern storage formats (Parquet/Iceberg) and engines that optimize for analytical queries.

If you still run legacy Hive-on-HBase tables, this pattern provides a clear way to expose them to SQL without rewriting your storage layer immediately. It also helps when planning a migration by making HBase data visible in SQL for validation and export.

If you need help with distributed systems, backend engineering, or data platforms, check my Services.

Most read articles

Why Is Customer Obsession Disappearing?

Many companies trade real customer-obsession for automated, low-empathy support. Through examples from Coinbase, PayPal, GO Telecommunications and AT&T, this article shows how reliance on AI chatbots, outsourced call centers, and KPI-driven workflows erodes trust, NPS and customer retention. It argues that human-centric support—treating support as strategic investment instead of cost—is still a core growth engine in competitive markets. It's wild that even with all the cool tech we've got these days, like AI solving complex equations and doing business across time zones in a flash, so many companies are still struggling with the basics: taking care of their customers. The drama around Coinbase's customer support is a prime example of even tech giants messing up. And it's not just Coinbase — it's a big-picture issue for the whole industry. At some point, the idea of "customer obsession" got replaced with "customer automation," and no...

What the Heck is Superposition and Entanglement?

This post is about superposition and interference in simple, intuitive terms. It describes how quantum states combine, how probability amplitudes add, and why interference patterns appear in systems such as electrons, photons and waves. The goal is to give a clear, non mathematical understanding of how quantum behavior emerges from the rules of wave functions and measurement. If you’ve ever heard the words superposition or entanglement thrown around in conversations about quantum physics, you may have nodded politely while your brain quietly filed them away in the "too confusing to deal with" folder.  These aren't just theoretical quirks; they're the foundation of mind-bending tech like Google's latest quantum chip, the Willow with its 105 qubits. Superposition challenges our understanding of reality, suggesting that particles don't have definite states until observed. This principle is crucial in quantum technologies, enabling phenomena like quantum comp...

SynthLink Compared to Google’s Natural Questions: A Practical Evaluation

SynthLink evaluates reasoning, synthesis and internal consistency across diverse question types. Google’s Natural Questions evaluates extractive QA: finding short text spans inside structured documents. Because real workloads require interpretation, abstraction and multi-step logic, SynthLink exposes capabilities and failure modes that NQ cannot measure. The two benchmarks are complementary, but SynthLink is more aligned with production tasks. Benchmarks such as Google’s Natural Questions (NQ) dominate model evaluation. They provide a reliable, academically stable test for extractive question answering: short queries, grounded answers, and constrained context ranges. But real workloads rarely look like NQ. Production systems must handle ambiguous inputs, multi-step reasoning, poorly structured prompts, and cases where no canonical answer exists. SynthLink was designed for this broader landscape. It focuses on evaluating reasoning, synthesis and internal consistency rather than snippe...