Skip to main content

Secure your hadoop cluster, Part I


Use mapred with Active Directory (basic auth)

The most cases I managed in past weeks concerned hadoop security. That means firewalling, SELinux, authentication and user management. It is usually a difficult process, related to the companys security profile and processes.

So I start with the most interesting part - authentication. And, in all cases I worked on the main authentication system was a Windows Active Directory Forest (AD). Since hadoop is shipped with more taskcontroller-classes we can use LinuxTaskController. I use RHEL5 server, but it can be adapted similar to other installations.

To enable the UNIX services in Windows Server > 2003 you have to extend the existing schema with UNIX templates, delivered from Microsoft. After that you have to install the "Identity Management for UNIX", in 2008 located in Server Manager => Roles => AD DS => common tasks => Add Role => Select Role Services. Install the software, restart your server and it should be done. Now create a default bind-account, configure the AD server and create a test-user with group hdfs.

For testing we use these settings:
Binding-Acc: main
Group: hdfs

Now we take a closer look at the RedHat box. Since kerberos5 is fully supported the task is really simple. 3 problems could occur: time, DNS and wrong schema on the AD Server(s).
Setup the ldap authentication with:
# authconfig-tui

Authentication => Use LDAP (use MD5 Password + Use shadow Password + Use Kerberos) => Next.

Server => Base DN (FQDN or IP) + DN (dc=hadoop,dc=company,dc=local) => Next.

Kerberos Settings => REALM in uppercase + KDC and admin server (FQDN or IP of AD Server) + Use DNS to resolve hosts to realms + Use DNS to locate KDCs for realms => OK

AD does not allow anonymous connection, so you have to use the bind-account in /etc/ldap.conf (see above).

add in /etc/nsswitch.conf ldap service after files:
passwd:     files ldap
shadow:     files ldap
group:      files ldap

Now edit the /etc/ldap.conf:

base dc=hadoop,dc=company,dc=local
uri ldap://
bindpw <PASSWORD>
scope sub
ssl no
nss_base_passwd dc=hadoop,dc=company,dc=local?sub
nss_base_shadow dc=hadoop,dc=company,dc=local?sub
nss_base_group dc=hadoop,dc=company,dc=local?sub? \
nss_map_objectclass posixAccount user
nss_map_objectclass shadowAccount user
nss_map_objectclass posixGroup group
nss_map_attribute gecos cn
nss_map_attribute homeDirectory unixHomeDirectory
nss_map_attribute uniqueMember member
nss_map_objectclass posixGroup Group
tls_cacertdir /etc/openldap/cacerts
pam_password md5
pam_login_attribute sAMAccountName
pam_filter objectclass=User

After you have written your file you should test your config:
# getent passwd
if you get no errors all works as expected.

Add to /etc/pam.d/system-auth:
session      required skel=/etc/skel umask=0022

Now it is time to use mapred with your AD. For that we use the shipped class org.apache.hadoop.mapred.LinuxTaskController, configuration will be done in mapred-site.xml:


Now jobs will be submitted in the given usercontext via pam. Here you have to keep in mind that the group should be set to the group you setup in your AD.

Known issues mostly depend on your setup. Be sure you have a syncronized time in your network (usually done with ntpd), a working DNS infrastructure and the user and groups are known in AD.


Post a Comment

Popular posts from this blog

Deal with corrupted messages in Apache Kafka

Under some strange circumstances it can happen that a message in a Kafka topic is corrupted. This happens often by using 3rd party frameworks together with Kafka. Additionally, Kafka < 0.9 has no lock at at the consumer read level, but has a lock on Log.write(). This can cause a rare race condition, as described in KAKFA-2477 [1]. Probably a log entry looks like: ERROR Error processing message, stopping consumer: ($) kafka.message.InvalidMessageException: Message is corrupt (stored crc = xxxxxxxxxx, computed crc = yyyyyyyyyy Kafka-Tools Kafka stores the offset of every consumer in Zookeeper. To read out the offsets, Kafka provides handy tools [2]. But also can be used, at least to display the consumer and the stored offsets. First we need to find the consumer for a topic (> Kafka 0.9): bin/ --zookeeper management01:2181 --describe --group test Prior to Kafka 0.9 the only possibility to get this inform

Hive query shows ERROR "too many counters"

A hive job face the odd " Too many counters:"  like Ended Job = job_xxxxxx with exception 'org.apache.hadoop.mapreduce.counters.LimitExceededException(Too many counters: 201 max=200)' FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask Intercepting System.exit(1) These happens when operators are used in queries ( Hive Operators ). Hive creates 4 counters per operator, max upto 1000, plus a few additional counters like file read/write, partitions and tables. Hence the number of counter required is going to be dependent upon the query.  To avoid such exception, configure " mapreduce.job.counters.max " in mapreduce-site.xml to a value above 1000. Hive will fail when he is hitting the 1k counts, but other MR jobs not. A number around 1120 should be a good choice. Using " EXPLAIN EXTENDED " and " grep -ri operators | wc -l " print out the used numbers of operators. Use this value to tweak the MR s

GPT & GenAI for Startup Storytelling

OpenAI and Bard   are the most used GenAI tools today; the first one has a massive Microsoft investment, and the other one is an experiment from Google. But did you know that you can also use them to optimize and hack your startup?  For startups, creating pitch scripts, sales emails, and elevator pitches with generative AI (GenAI) can help you not only save time but also validate your marketing and wording. Curious? Here are a few prompt hacks for startups to create,improve, and validate buyer personas, your startup's mission/vision statements, and unique selling proposition (USP) definitions. First Step: Introduce yourself and your startup Introduce yourself, your startup, your website, your idea, your position, and in a few words what you are doing to the chatbot: Prompt : I'm NAME and our startup NAME, with website URL, is doing WHATEVER. With PRODUCT NAME, we aim to change or disrupt INDUSTRY. Bard is able to pull information from your website. I'm not sure if ChatGPT