Skip to main content

Configuring a MySQL Metastore for Hive: From Embedded Derby to Multi-User Clusters

Struggling with delivery, architecture alignment, or platform stability?

I help teams fix systemic engineering issues: processes, architecture, and clarity.
→ See how I work with teams.


Hive’s embedded Derby database is fine for local testing, but it breaks down as soon as multiple users and services need to share metadata. This guide shows how to move Hive from the default single-user Derby setup to a shared MySQL metastore: configuring MySQL, creating the Hive schema, wiring Hive to the external database, and distributing drivers and configuration across a Hadoop cluster.

Apache Hive provides a SQL-like query language (HiveQL) on top of HDFS, making large-scale data analysis accessible to anyone with SQL experience. By default, however, Hive uses an embedded Derby database for its metastore, which is not suited for multi-user or multi-service environments.

To run Hive in a real cluster, you need an external metastore database. The database stores metadata about:

  • Databases and tables
  • Partitions and storage descriptors
  • SerDes, column definitions and privileges

This article walks through configuring a MySQL-based Hive metastore, suitable for multi-user clusters and for replication or failover setups.

Why Move from Derby to MySQL?

The limitations of the embedded Derby metastore include:

  • Single-user access: not safe for concurrent writes
  • No realistic HA story
  • Harder to manage and back up in production

Using MySQL (or another supported RDBMS) provides:

  • Proper concurrency and isolation
  • Flexible backup and replication options
  • Centralized metadata for all Hive clients in the cluster

Example MySQL Configuration for a Hive Metastore

Install MySQL server (version 5.1+ in the original setup) and configure it with InnoDB and binary logging. A simplified example of /etc/my.cnf:

[mysqld_safe]
socket = /var/lib/mysql/mysql.sock

[mysqld]
user = mysql
pid-file = /var/run/mysqld/mysqld.pid
socket = /var/lib/mysql/mysql.sock
log-error = /var/log/mysqld.log
datadir = /opt/hadoop/mysql

default-storage_engine = InnoDB
skip-bdb = 1
skip_name_resolve

# Timeouts
connect_timeout = 30
wait_timeout = 30
interactive_timeout = 100

# Cache and memory settings (example values)
key_buffer = 128M
thread_concurrency = 4
thread_cache = 16
thread_stack = 256K
table_cache = 512
tmp_table_size = 64M
max_heap_table_size = 64M

# Binary logging for replication or PITR
server-id = 1001
log_bin = /var/log/mysql/mysqlserver-bin.log
expire_logs_days = 3
max_binlog_size = 256M

# InnoDB tuning
innodb_file_per_table = 1
innodb_data_file_path = ibdata1:10M:autoextend
innodb_buffer_pool_size = 512M
innodb_log_file_size = 16M
innodb_flush_log_at_trx_commit = 1

# Slow query logging and query cache (legacy setup)
long_query_time = 2
log_slow_queries = /var/log/mysql/mysql-slow.log
query_cache_size = 64M
query_cache_type = 1
query_cache_limit = 4M

This configuration uses:

  • InnoDB as the default storage engine
  • Binary logs to support replication or recovery
  • Basic tuning for memory, caches and slow query diagnostics

Creating the Hive Metastore Database and User

After restarting MySQL (for example, service mysqld restart), create a dedicated user and database for Hive:

mysql> CREATE USER 'hive'@'%' IDENTIFIED BY 'strong_password';
mysql> GRANT ALL PRIVILEGES ON hive_live.* TO 'hive'@'%' WITH GRANT OPTION;
mysql> CREATE DATABASE hive_live;

For a dedicated metastore host, this simple privilege model is acceptable, though in modern setups you would typically grant more restrictive privileges and use stricter network controls.

Configuring Hive to Use MySQL as Metastore

Next, configure Hive to point to the MySQL backend. In classic deployments this was placed in hive-site.xml or hive-default.xml:

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://mysqlserver:3306/hive_live?createDatabaseIfNotExist=true</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>hive</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>strong_password</value>
</property>

These properties tell the Hive metastore service how to connect to MySQL and which database to use for metadata.

Distributing the MySQL JDBC Driver and Hive Configuration

Hive requires the MySQL JDBC driver on every node that runs Hive services or clients. Download the MySQL Connector/J driver and distribute it across the cluster, for example:

for i in $(cat /etc/hadoop-0.20/conf/slaves); do
  scp -r /usr/lib/hive/lib/mysql-connector-java-5.1.11-bin.jar \
    "$i":/usr/lib/hive/lib/
done

Then copy the Hive configuration to the same hosts:

for i in $(cat /etc/hadoop-0.20/conf/slaves); do
  scp -r /etc/hive/conf/hive-* \
    "$i":/etc/hive/conf.dist/
done

The exact paths will vary between distributions, but the core idea is consistent: every Hive node needs the same hive-site.xml and the JDBC driver.

Verifying the Metastore Schema

On the first Hive query, the metastore will create its schema in the MySQL database. You can verify this by checking the tables:

mysql> USE hive_live;
mysql> SHOW TABLES;
+-------------------------+
| Tables_in_hive_live     |
+-------------------------+
| BUCKETING_COLS          |
| COLUMNS                 |
| DATABASE_PARAMS         |
| DBS                     |
| DB_PRIVS                |
| GLOBAL_PRIVS            |
| IDXS                    |
| INDEX_PARAMS            |
| PARTITIONS              |
| PARTITION_KEYS          |
| PARTITION_KEY_VALS      |
| PARTITION_PARAMS        |
| PART_COL_PRIVS          |
| PART_PRIVS              |
| ROLES                   |
| ROLE_MAP                |
| SDS                     |
| SD_PARAMS               |
| SEQUENCE_TABLE          |
| SERDES                  |
| SERDE_PARAMS            |
| SORT_COLS               |
| TABLE_PARAMS            |
| TBLS                    |
| TBL_COL_PRIVS           |
| TBL_PRIVS               |
| TYPES                   |
| TYPE_FIELDS             |
+-------------------------+

These tables are the internal structures Hive uses to track databases, tables, partitions, storage descriptors and privileges.

Upgrading the Hive Metastore Schema

When upgrading Hive, you may need to upgrade the metastore schema as well. Hive ships SQL scripts for this purpose, typically located under:

/var/lib/hive/metastore/scripts/upgrade/mysql/

To upgrade the schema, connect to MySQL and source the appropriate script (example for a historical version):

mysql> USE hive_live;
mysql> SOURCE /var/lib/hive/metastore/scripts/upgrade/mysql/hive-schema-0.7.0.mysql.sql;

Always perform a full database backup before applying upgrades.

How This Pattern Maps to Modern Setups

While modern deployments often use managed metastores, Postgres or cloud-native services, the core pattern remains the same:

  • External RDBMS for shared metadata
  • Strong guarantees for concurrency and durability
  • Schema upgrade scripts managed alongside Hive versions

Understanding this MySQL-based metastore configuration helps when maintaining legacy clusters and when migrating toward newer Hive and lakehouse architectures.

Reference: MySQL Connector/J

If you need help with distributed systems, backend engineering, or data platforms, check my Services.

Most read articles

Why Is Customer Obsession Disappearing?

Many companies trade real customer-obsession for automated, low-empathy support. Through examples from Coinbase, PayPal, GO Telecommunications and AT&T, this article shows how reliance on AI chatbots, outsourced call centers, and KPI-driven workflows erodes trust, NPS and customer retention. It argues that human-centric support—treating support as strategic investment instead of cost—is still a core growth engine in competitive markets. It's wild that even with all the cool tech we've got these days, like AI solving complex equations and doing business across time zones in a flash, so many companies are still struggling with the basics: taking care of their customers. The drama around Coinbase's customer support is a prime example of even tech giants messing up. And it's not just Coinbase — it's a big-picture issue for the whole industry. At some point, the idea of "customer obsession" got replaced with "customer automation," and no...

How to scale MySQL perfectly

When MySQL reaches its limits, scaling cannot rely on hardware alone. This article explains how strategic techniques such as caching, sharding and operational optimisation can drastically reduce load and improve application responsiveness. It outlines how in-memory systems like Redis or Memcached offload repeated reads, how horizontal sharding mechanisms distribute data for massive scale, and how tools such as Vitess, ProxySQL and HAProxy support routing, failover and cluster management. The summary also highlights essential practices including query tuning, indexing, replication and connection management. Together these approaches form a modern DevOps strategy that transforms MySQL from a single bottleneck into a resilient, scalable data layer able to grow with your application. When your MySQL database reaches its performance limits, vertical scaling through hardware upgrades provides a temporary solution. Long-term growth, though, requires a more comprehensive approach. This invo...

What the Heck is Superposition and Entanglement?

This post is about superposition and interference in simple, intuitive terms. It describes how quantum states combine, how probability amplitudes add, and why interference patterns appear in systems such as electrons, photons and waves. The goal is to give a clear, non mathematical understanding of how quantum behavior emerges from the rules of wave functions and measurement. If you’ve ever heard the words superposition or entanglement thrown around in conversations about quantum physics, you may have nodded politely while your brain quietly filed them away in the "too confusing to deal with" folder.  These aren't just theoretical quirks; they're the foundation of mind-bending tech like Google's latest quantum chip, the Willow with its 105 qubits. Superposition challenges our understanding of reality, suggesting that particles don't have definite states until observed. This principle is crucial in quantum technologies, enabling phenomena like quantum comp...