I help teams fix systemic engineering issues: processes, architecture, and clarity.
→ See how I work with teams.
In modern Hadoop and Spark deployments, Spark connects to Hive through the SparkSession catalog. Hive metastore integration is stable, supported and no longer requires manual configuration steps such as copying hive-site.xml or modifying executor classpaths.
Using Hive from Spark Today
Create a SparkSession with Hive support enabled:
val spark = SparkSession.builder()
.appName("SparkHive")
.enableHiveSupport()
.getOrCreate()
Once enabled, Spark can query Hive tables directly:
spark.sql("SELECT COUNT(*) FROM sample_07").show()
Spark handles:
- Hive metastore connectivity
- Classpaths for Hive libraries
- Execution through Spark SQL (not Hive execution engines)
- Security integration (Kerberos, Ranger) through the environment
Important Clarification
Spark does not execute HiveQL through Hive’s execution engines (MR, Tez, LLAP). Spark parses HiveQL using its own Catalyst engine and executes through Spark’s physical planning layer.
Historical Context: Why CDH 5.x Needed Manual Steps
Around the CDH 5.3 era, “Hive on Spark” was experimental and not officially supported. Engineers had to manually:
- Copy
hive-site.xmlinto Spark’s configuration directory - Insert the correct Hive libraries into the driver and executor classpaths
- Ensure Spark gateway nodes were also Hive gateway nodes
- Reapply configuration after client redeployments or CM upgrades
A typical workaround involved launching the Spark shell with extended classpaths, for example:
spark-shell \
--master yarn-client \
--driver-class-path "/opt/cloudera/parcels/CDH-VERSION/lib/hive/lib/*" \
--conf spark.executor.extraClassPath="/opt/.../hive/lib/*"
Then creating a HiveContext:
val hive = new org.apache.spark.sql.hive.HiveContext(sc)
hive.sql("SELECT COUNT(*) FROM sample_07").collect()
This approach is no longer required or recommended and applies only to legacy CDH 5.x clusters.
Modern Best Practices
- Use
SparkSession.enableHiveSupport()instead of HiveContext - Avoid manual classpath manipulation
- Ensure Hive metastore (HCatalog) connectivity is properly configured in the environment
- Upgrade any surviving CDH 5.x workloads to supported distributions
This rewrite preserves the historical steps for reference while providing accurate and supported guidance for today’s Spark–Hive integration.
If you need help with distributed systems, backend engineering, or data platforms, check my Services.