novatechflow | Alexander Alten

Posts

Showing posts from August, 2011

Configuring a MySQL Metastore for Hive: From Embedded Derby to Multi-User Clusters

Hive’s embedded Derby database is fine for local testing, but it breaks down as soon as multiple users and services need to share metadata. This guide shows how to move Hive from the default single-user Derby setup to a shared MySQL metastore: configuring MySQL, creating the Hive schema, wiring Hive to the external database, and distributing drivers and configuration across a Hadoop cluster. Apache Hive provides a SQL-like query language (HiveQL) on top of HDFS, making large-scale data analysis accessible to anyone with SQL experience. By default, however, Hive uses an embedded Derby database for its metastore, which is not suited for multi-user or multi-service environments. To run Hive in a real cluster, you need an external metastore database . The database stores metadata about: Databases and tables Partitions and storage descriptors SerDes, column definitions and privileges This article walks through configuring a MySQL-based Hive metastore , suitable for...

Don’t Put Hadoop or HBase Temp Directories in /tmp: Protecting Your Metadata

Many early Hadoop installations lost HDFS metadata because temporary directories were left under /tmp, which is routinely cleaned by Linux on reboot. This article explains why hadoop.tmp.dir and hbase.tmp.dir must point to a stable path, how this impacts NameNode edit logs and checkpoints, and what a safe configuration looks like today for both Hadoop and HBase clusters. In early Hadoop 0.20 clusters, it was common to leave the default hadoop.tmp.dir configuration unchanged. On many Linux systems, however, /tmp is automatically cleaned during reboot. If your Hadoop or HBase temp directories live there, you risk losing critical metadata when the operating system wipes that directory. For the SecondaryNameNode and related components, this can mean the loss of: Recent edit logs Checkpointed fsimage files Other temporary metadata needed for recovery If these files are removed, the NameNode may not be able to replay its edits after a crash, leading to a failed restart or a ...