Skip to main content

Configuring a MySQL Metastore for Hive: From Embedded Derby to Multi-User Clusters

Hive’s embedded Derby database is fine for local testing, but it breaks down as soon as multiple users and services need to share metadata. This guide shows how to move Hive from the default single-user Derby setup to a shared MySQL metastore: configuring MySQL, creating the Hive schema, wiring Hive to the external database, and distributing drivers and configuration across a Hadoop cluster.

Apache Hive provides a SQL-like query language (HiveQL) on top of HDFS, making large-scale data analysis accessible to anyone with SQL experience. By default, however, Hive uses an embedded Derby database for its metastore, which is not suited for multi-user or multi-service environments.

To run Hive in a real cluster, you need an external metastore database. The database stores metadata about:

  • Databases and tables
  • Partitions and storage descriptors
  • SerDes, column definitions and privileges

This article walks through configuring a MySQL-based Hive metastore, suitable for multi-user clusters and for replication or failover setups.

Why Move from Derby to MySQL?

The limitations of the embedded Derby metastore include:

  • Single-user access: not safe for concurrent writes
  • No realistic HA story
  • Harder to manage and back up in production

Using MySQL (or another supported RDBMS) provides:

  • Proper concurrency and isolation
  • Flexible backup and replication options
  • Centralized metadata for all Hive clients in the cluster

Example MySQL Configuration for a Hive Metastore

Install MySQL server (version 5.1+ in the original setup) and configure it with InnoDB and binary logging. A simplified example of /etc/my.cnf:

[mysqld_safe]
socket = /var/lib/mysql/mysql.sock

[mysqld]
user = mysql
pid-file = /var/run/mysqld/mysqld.pid
socket = /var/lib/mysql/mysql.sock
log-error = /var/log/mysqld.log
datadir = /opt/hadoop/mysql

default-storage_engine = InnoDB
skip-bdb = 1
skip_name_resolve

# Timeouts
connect_timeout = 30
wait_timeout = 30
interactive_timeout = 100

# Cache and memory settings (example values)
key_buffer = 128M
thread_concurrency = 4
thread_cache = 16
thread_stack = 256K
table_cache = 512
tmp_table_size = 64M
max_heap_table_size = 64M

# Binary logging for replication or PITR
server-id = 1001
log_bin = /var/log/mysql/mysqlserver-bin.log
expire_logs_days = 3
max_binlog_size = 256M

# InnoDB tuning
innodb_file_per_table = 1
innodb_data_file_path = ibdata1:10M:autoextend
innodb_buffer_pool_size = 512M
innodb_log_file_size = 16M
innodb_flush_log_at_trx_commit = 1

# Slow query logging and query cache (legacy setup)
long_query_time = 2
log_slow_queries = /var/log/mysql/mysql-slow.log
query_cache_size = 64M
query_cache_type = 1
query_cache_limit = 4M

This configuration uses:

  • InnoDB as the default storage engine
  • Binary logs to support replication or recovery
  • Basic tuning for memory, caches and slow query diagnostics

Creating the Hive Metastore Database and User

After restarting MySQL (for example, service mysqld restart), create a dedicated user and database for Hive:

mysql> CREATE USER 'hive'@'%' IDENTIFIED BY 'strong_password';
mysql> GRANT ALL PRIVILEGES ON hive_live.* TO 'hive'@'%' WITH GRANT OPTION;
mysql> CREATE DATABASE hive_live;

For a dedicated metastore host, this simple privilege model is acceptable, though in modern setups you would typically grant more restrictive privileges and use stricter network controls.

Configuring Hive to Use MySQL as Metastore

Next, configure Hive to point to the MySQL backend. In classic deployments this was placed in hive-site.xml or hive-default.xml:

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://mysqlserver:3306/hive_live?createDatabaseIfNotExist=true</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>hive</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>strong_password</value>
</property>

These properties tell the Hive metastore service how to connect to MySQL and which database to use for metadata.

Distributing the MySQL JDBC Driver and Hive Configuration

Hive requires the MySQL JDBC driver on every node that runs Hive services or clients. Download the MySQL Connector/J driver and distribute it across the cluster, for example:

for i in $(cat /etc/hadoop-0.20/conf/slaves); do
  scp -r /usr/lib/hive/lib/mysql-connector-java-5.1.11-bin.jar \
    "$i":/usr/lib/hive/lib/
done

Then copy the Hive configuration to the same hosts:

for i in $(cat /etc/hadoop-0.20/conf/slaves); do
  scp -r /etc/hive/conf/hive-* \
    "$i":/etc/hive/conf.dist/
done

The exact paths will vary between distributions, but the core idea is consistent: every Hive node needs the same hive-site.xml and the JDBC driver.

Verifying the Metastore Schema

On the first Hive query, the metastore will create its schema in the MySQL database. You can verify this by checking the tables:

mysql> USE hive_live;
mysql> SHOW TABLES;
+-------------------------+
| Tables_in_hive_live     |
+-------------------------+
| BUCKETING_COLS          |
| COLUMNS                 |
| DATABASE_PARAMS         |
| DBS                     |
| DB_PRIVS                |
| GLOBAL_PRIVS            |
| IDXS                    |
| INDEX_PARAMS            |
| PARTITIONS              |
| PARTITION_KEYS          |
| PARTITION_KEY_VALS      |
| PARTITION_PARAMS        |
| PART_COL_PRIVS          |
| PART_PRIVS              |
| ROLES                   |
| ROLE_MAP                |
| SDS                     |
| SD_PARAMS               |
| SEQUENCE_TABLE          |
| SERDES                  |
| SERDE_PARAMS            |
| SORT_COLS               |
| TABLE_PARAMS            |
| TBLS                    |
| TBL_COL_PRIVS           |
| TBL_PRIVS               |
| TYPES                   |
| TYPE_FIELDS             |
+-------------------------+

These tables are the internal structures Hive uses to track databases, tables, partitions, storage descriptors and privileges.

Upgrading the Hive Metastore Schema

When upgrading Hive, you may need to upgrade the metastore schema as well. Hive ships SQL scripts for this purpose, typically located under:

/var/lib/hive/metastore/scripts/upgrade/mysql/

To upgrade the schema, connect to MySQL and source the appropriate script (example for a historical version):

mysql> USE hive_live;
mysql> SOURCE /var/lib/hive/metastore/scripts/upgrade/mysql/hive-schema-0.7.0.mysql.sql;

Always perform a full database backup before applying upgrades.

How This Pattern Maps to Modern Setups

While modern deployments often use managed metastores, Postgres or cloud-native services, the core pattern remains the same:

  • External RDBMS for shared metadata
  • Strong guarantees for concurrency and durability
  • Schema upgrade scripts managed alongside Hive versions

Understanding this MySQL-based metastore configuration helps when maintaining legacy clusters and when migrating toward newer Hive and lakehouse architectures.

Reference: MySQL Connector/J

If you need help with distributed systems, backend engineering, or data platforms, check my Services.

Most read articles

Why Is Customer Obsession Disappearing?

Many companies trade real customer-obsession for automated, low-empathy support. Through examples from Coinbase, PayPal, GO Telecommunications and AT&T, this article shows how reliance on AI chatbots, outsourced call centers, and KPI-driven workflows erodes trust, NPS and customer retention. It argues that human-centric support—treating support as strategic investment instead of cost—is still a core growth engine in competitive markets. It's wild that even with all the cool tech we've got these days, like AI solving complex equations and doing business across time zones in a flash, so many companies are still struggling with the basics: taking care of their customers. The drama around Coinbase's customer support is a prime example of even tech giants messing up. And it's not just Coinbase — it's a big-picture issue for the whole industry. At some point, the idea of "customer obsession" got replaced with "customer automation," and no...

What the Heck is Superposition and Entanglement?

This post is about superposition and interference in simple, intuitive terms. It describes how quantum states combine, how probability amplitudes add, and why interference patterns appear in systems such as electrons, photons and waves. The goal is to give a clear, non mathematical understanding of how quantum behavior emerges from the rules of wave functions and measurement. If you’ve ever heard the words superposition or entanglement thrown around in conversations about quantum physics, you may have nodded politely while your brain quietly filed them away in the "too confusing to deal with" folder.  These aren't just theoretical quirks; they're the foundation of mind-bending tech like Google's latest quantum chip, the Willow with its 105 qubits. Superposition challenges our understanding of reality, suggesting that particles don't have definite states until observed. This principle is crucial in quantum technologies, enabling phenomena like quantum comp...

SynthLink Compared to Google’s Natural Questions: A Practical Evaluation

SynthLink evaluates reasoning, synthesis and internal consistency across diverse question types. Google’s Natural Questions evaluates extractive QA: finding short text spans inside structured documents. Because real workloads require interpretation, abstraction and multi-step logic, SynthLink exposes capabilities and failure modes that NQ cannot measure. The two benchmarks are complementary, but SynthLink is more aligned with production tasks. Benchmarks such as Google’s Natural Questions (NQ) dominate model evaluation. They provide a reliable, academically stable test for extractive question answering: short queries, grounded answers, and constrained context ranges. But real workloads rarely look like NQ. Production systems must handle ambiguous inputs, multi-step reasoning, poorly structured prompts, and cases where no canonical answer exists. SynthLink was designed for this broader landscape. It focuses on evaluating reasoning, synthesis and internal consistency rather than snippe...