I help teams fix systemic engineering issues: processes, architecture, and clarity.
→ See how I work with teams.
Apache Hive provides a SQL-like query language (HiveQL) on top of HDFS, making large-scale data analysis accessible to anyone with SQL experience. By default, however, Hive uses an embedded Derby database for its metastore, which is not suited for multi-user or multi-service environments.
To run Hive in a real cluster, you need an external metastore database. The database stores metadata about:
- Databases and tables
- Partitions and storage descriptors
- SerDes, column definitions and privileges
This article walks through configuring a MySQL-based Hive metastore, suitable for multi-user clusters and for replication or failover setups.
Why Move from Derby to MySQL?
The limitations of the embedded Derby metastore include:
- Single-user access: not safe for concurrent writes
- No realistic HA story
- Harder to manage and back up in production
Using MySQL (or another supported RDBMS) provides:
- Proper concurrency and isolation
- Flexible backup and replication options
- Centralized metadata for all Hive clients in the cluster
Example MySQL Configuration for a Hive Metastore
Install MySQL server (version 5.1+ in the original setup) and configure it with InnoDB and binary logging.
A simplified example of /etc/my.cnf:
[mysqld_safe]
socket = /var/lib/mysql/mysql.sock
[mysqld]
user = mysql
pid-file = /var/run/mysqld/mysqld.pid
socket = /var/lib/mysql/mysql.sock
log-error = /var/log/mysqld.log
datadir = /opt/hadoop/mysql
default-storage_engine = InnoDB
skip-bdb = 1
skip_name_resolve
# Timeouts
connect_timeout = 30
wait_timeout = 30
interactive_timeout = 100
# Cache and memory settings (example values)
key_buffer = 128M
thread_concurrency = 4
thread_cache = 16
thread_stack = 256K
table_cache = 512
tmp_table_size = 64M
max_heap_table_size = 64M
# Binary logging for replication or PITR
server-id = 1001
log_bin = /var/log/mysql/mysqlserver-bin.log
expire_logs_days = 3
max_binlog_size = 256M
# InnoDB tuning
innodb_file_per_table = 1
innodb_data_file_path = ibdata1:10M:autoextend
innodb_buffer_pool_size = 512M
innodb_log_file_size = 16M
innodb_flush_log_at_trx_commit = 1
# Slow query logging and query cache (legacy setup)
long_query_time = 2
log_slow_queries = /var/log/mysql/mysql-slow.log
query_cache_size = 64M
query_cache_type = 1
query_cache_limit = 4M
This configuration uses:
- InnoDB as the default storage engine
- Binary logs to support replication or recovery
- Basic tuning for memory, caches and slow query diagnostics
Creating the Hive Metastore Database and User
After restarting MySQL (for example, service mysqld restart), create a dedicated user and database for Hive:
mysql> CREATE USER 'hive'@'%' IDENTIFIED BY 'strong_password';
mysql> GRANT ALL PRIVILEGES ON hive_live.* TO 'hive'@'%' WITH GRANT OPTION;
mysql> CREATE DATABASE hive_live;
For a dedicated metastore host, this simple privilege model is acceptable, though in modern setups you would typically grant more restrictive privileges and use stricter network controls.
Configuring Hive to Use MySQL as Metastore
Next, configure Hive to point to the MySQL backend. In classic deployments this was placed in hive-site.xml or hive-default.xml:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://mysqlserver:3306/hive_live?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>strong_password</value>
</property>
These properties tell the Hive metastore service how to connect to MySQL and which database to use for metadata.
Distributing the MySQL JDBC Driver and Hive Configuration
Hive requires the MySQL JDBC driver on every node that runs Hive services or clients. Download the MySQL Connector/J driver and distribute it across the cluster, for example:
for i in $(cat /etc/hadoop-0.20/conf/slaves); do
scp -r /usr/lib/hive/lib/mysql-connector-java-5.1.11-bin.jar \
"$i":/usr/lib/hive/lib/
done
Then copy the Hive configuration to the same hosts:
for i in $(cat /etc/hadoop-0.20/conf/slaves); do
scp -r /etc/hive/conf/hive-* \
"$i":/etc/hive/conf.dist/
done
The exact paths will vary between distributions, but the core idea is consistent: every Hive node needs the same hive-site.xml and the JDBC driver.
Verifying the Metastore Schema
On the first Hive query, the metastore will create its schema in the MySQL database. You can verify this by checking the tables:
mysql> USE hive_live;
mysql> SHOW TABLES;
+-------------------------+
| Tables_in_hive_live |
+-------------------------+
| BUCKETING_COLS |
| COLUMNS |
| DATABASE_PARAMS |
| DBS |
| DB_PRIVS |
| GLOBAL_PRIVS |
| IDXS |
| INDEX_PARAMS |
| PARTITIONS |
| PARTITION_KEYS |
| PARTITION_KEY_VALS |
| PARTITION_PARAMS |
| PART_COL_PRIVS |
| PART_PRIVS |
| ROLES |
| ROLE_MAP |
| SDS |
| SD_PARAMS |
| SEQUENCE_TABLE |
| SERDES |
| SERDE_PARAMS |
| SORT_COLS |
| TABLE_PARAMS |
| TBLS |
| TBL_COL_PRIVS |
| TBL_PRIVS |
| TYPES |
| TYPE_FIELDS |
+-------------------------+
These tables are the internal structures Hive uses to track databases, tables, partitions, storage descriptors and privileges.
Upgrading the Hive Metastore Schema
When upgrading Hive, you may need to upgrade the metastore schema as well. Hive ships SQL scripts for this purpose, typically located under:
/var/lib/hive/metastore/scripts/upgrade/mysql/
To upgrade the schema, connect to MySQL and source the appropriate script (example for a historical version):
mysql> USE hive_live;
mysql> SOURCE /var/lib/hive/metastore/scripts/upgrade/mysql/hive-schema-0.7.0.mysql.sql;
Always perform a full database backup before applying upgrades.
How This Pattern Maps to Modern Setups
While modern deployments often use managed metastores, Postgres or cloud-native services, the core pattern remains the same:
- External RDBMS for shared metadata
- Strong guarantees for concurrency and durability
- Schema upgrade scripts managed alongside Hive versions
Understanding this MySQL-based metastore configuration helps when maintaining legacy clusters and when migrating toward newer Hive and lakehouse architectures.
Reference: MySQL Connector/J
If you need help with distributed systems, backend engineering, or data platforms, check my Services.