Skip to main content

Lightweight Kerberos Setup for a Secure Hadoop Cluster (Updated Guide)

Kerberos remains the most reliable way to secure Hadoop services without pulling in a full enterprise identity system. This updated guide shows how to bootstrap a minimal MIT Kerberos setup for a small or lab Hadoop cluster, how to configure principals and keytabs, and how to enable authentication in HDFS and YARN. It is designed for simple security needs, rapid testing environments and isolated data platforms.

Many engineers still ask for a simple and fast way to secure a Hadoop cluster without integrating Active Directory or enterprise-wide authentication. This guide modernizes the classic lightweight approach: deploy a small MIT Kerberos KDC, create only the required Hadoop service principals, and enable Kerberos-based authentication across HDFS and YARN.

If you need a deeper dive into multi-realm trust, security architecture or production-grade Kerberos setups, see the extended guide: Hadoop and Trusted MITv5 Kerberos.

1. Install and Configure a MIT Kerberos KDC

Install the Kerberos server packages and adjust the default configuration. Replace EXAMPLE.COM with your own realm. Below is an example for a small cluster:

[libdefaults]
  default_realm = ALO.ALT
  dns_lookup_realm = false
  dns_lookup_kdc = false

[realms]
  ALO.ALT = {
    kdc = HADOOP1.ALO.ALT:88
    admin_server = HADOOP1.ALO.ALT:749
    default_domain = ALO.ALT
  }

[domain_realm]
  .alo.alt = ALO.ALT
  alo.alt  = ALO.ALT

[logging]
  kdc = FILE:/var/log/krb5kdc.log
  admin_server = FILE:/var/log/kadmin.log
  default = FILE:/var/log/krb5lib.log

Ensure your DNS or /etc/hosts is consistent across nodes:

192.168.56.101  hadoop1.alo.alt  hadoop1
172.22.2.130    hadoop2.alo.alt  hadoop2

Allow administrative principals:

# cat /var/kerberos/krb5kdc/kadm5.acl
*/admin@ALO.ALT *

Initialize your realm:

kdb5_util create -s
systemctl start krb5kdc kadmin

2. Add Hadoop Service Principals

Use kadmin.local on the KDC host to create service principals. For HDFS, YARN, MapReduce and HTTP SPNEGO:

addprinc -randkey hdfs/hadoop1.alo.alt@ALO.ALT
addprinc -randkey mapred/hadoop1.alo.alt@ALO.ALT
addprinc -randkey yarn/hadoop1.alo.alt@ALO.ALT
addprinc -randkey hbase/hadoop1.alo.alt@ALO.ALT
addprinc -randkey HTTP/hadoop1.alo.alt@ALO.ALT

addprinc <USERNAME>@ALO.ALT

Your user principal will require a password. After creation, test authentication using:

su - <USERNAME> && kinit

3. Export Keytabs for Hadoop Services

xst -norandkey -k hdfs.keytab   hdfs/hadoop1.alo.alt@ALO.ALT HTTP/hadoop1.alo.alt@ALO.ALT
xst -norandkey -k mapred.keytab mapred/hadoop1.alo.alt@ALO.ALT HTTP/hadoop1.alo.alt@ALO.ALT
xst -norandkey -k yarn.keytab   yarn/hadoop1.alo.alt@ALO.ALT  HTTP/hadoop1.alo.alt@ALO.ALT

Fix permissions and deploy them to the proper Hadoop configuration directory:

chown hdfs:hadoop hdfs.keytab && chmod 400 hdfs.keytab
chown mapred:hadoop mapred.keytab && chmod 400 mapred.keytab
chown yarn:hadoop yarn.keytab && chmod 400 yarn.keytab

4. Enable Kerberos in HDFS

Update hdfs-site.xml:

dfs.block.access.token.enable = true
dfs.namenode.keytab.file = <PATH>/hdfs.keytab
dfs.namenode.kerberos.principal = hdfs/_HOST@ALO.ALT
dfs.namenode.kerberos.internal.spnego.principal = HTTP/_HOST@ALO.ALT
dfs.secondary.namenode.keytab.file = <PATH>/hdfs.keytab
dfs.secondary.namenode.kerberos.principal = hdfs/_HOST@ALO.ALT
dfs.datanode.keytab.file = <PATH>/hdfs.keytab
dfs.datanode.kerberos.principal = hdfs/_HOST@ALO.ALT
dfs.web.authentication.kerberos.principal = HTTP/_HOST@ALO.ALT
dfs.web.authentication.kerberos.keytab = <PATH>/hdfs.keytab

Start the NameNode and validate access:

hadoop fs -ls /

Set correct permissions for temporary directories:

sudo -u hdfs kinit -k -t hdfs.keytab hdfs/hadoop1.alo.alt@ALO.ALT
sudo -u hdfs hadoop fs -chmod 1777 /tmp

5. Enable Kerberos in MapReduce and YARN

Example properties for mapred-site.xml:

mapreduce.jobtracker.kerberos.principal = mapred/_HOST@ALO.ALT
mapreduce.jobtracker.keytab.file = <PATH>/mapred.keytab
mapreduce.tasktracker.kerberos.principal = mapred/_HOST@ALO.ALT
mapreduce.tasktracker.keytab.file = <PATH>/mapred.keytab

Ensure the TaskTracker or NodeManager uses the correct group and directories:

# /etc/hadoop/conf/taskcontroller.cfg
hadoop.log.dir=/var/log/hadoop-mapreduce/
mapred.local.dir=/opt/hadoop/hdfs/mapred/local
mapreduce.tasktracker.group=mapred
banned.users=mapred,hdfs,bin
min.user.id=500

6. Validate the Secure Cluster

Restart the YARN and MapReduce daemons, authenticate using kinit, and run a simple job (e.g., the PI example) to confirm that secure submission works end-to-end. Use klist to verify ticket validity.

For more advanced Kerberos trust and production hardening techniques, refer to: Hadoop and Trusted MITv5 Kerberos.

If you need help with distributed systems, backend engineering, or data platforms, check my Services.

Most read articles

Why Is Customer Obsession Disappearing?

Many companies trade real customer-obsession for automated, low-empathy support. Through examples from Coinbase, PayPal, GO Telecommunications and AT&T, this article shows how reliance on AI chatbots, outsourced call centers, and KPI-driven workflows erodes trust, NPS and customer retention. It argues that human-centric support—treating support as strategic investment instead of cost—is still a core growth engine in competitive markets. It's wild that even with all the cool tech we've got these days, like AI solving complex equations and doing business across time zones in a flash, so many companies are still struggling with the basics: taking care of their customers. The drama around Coinbase's customer support is a prime example of even tech giants messing up. And it's not just Coinbase — it's a big-picture issue for the whole industry. At some point, the idea of "customer obsession" got replaced with "customer automation," and no...

What are the performance implications of cross-platform execution within Wayang?

Apache Wayang ® enables cross-platform execution across multiple data processing platforms such as Spark, Flink, Java Streams, PostgreSQL or GraphChi. This capability fundamentally changes the performance behavior of distributed data pipelines. Wayang reduces manual data movement by selecting where each operator should run, but crossing platform boundaries still introduces serialization cost, shifts in locality, different memory strategies and new tuning constraints. Understanding these dynamics is essential before adopting Wayang for multi-platform pipelines at scale. Apache Wayang is a cross-platform data processing framework that lets developers run a single logical pipeline across engines such as Apache Spark, Apache Flink or a native Java backend. It provides an abstraction layer and a cost-based optimizer that selects the execution platform for each operator. This flexibility introduces new performance variables that do not exist in single-engine systems. Engine boundaries ...

Building a Model-Agnostic Multi-Agent System with OpenClaw

Over one week we rebuilt our AI stack around OpenClaw’s multi-agent architecture to avoid provider lock-in and stop wasting premium tokens. By aligning models to tasks, diversifying fallbacks across providers, enforcing minimal tool access, and switching to memory-first workflows with ephemeral sessions, we reduced token usage per task by about 70% and cut our monthly bill by 77% while improving operational resilience. How We Achieved 77% Cost Reduction and Provider Independence Over the past week, we rebuilt our AI infrastructure around OpenClaw’s multi-agent architecture. The result was a 77% cost reduction , provider independence , and a delegation system that routes work to the most cost-effective model for each job. Below is the technical journey of optimizing a 7-agent squad with OpenClaw. The Challenge: Model Provider Lock-In We started with a simple problem: our entire squad defaulted to a single model provider. This created three issues: Cost inefficiency beca...