Skip to main content

Secure your hadoop cluster, Part I

Listen:

Use mapred with Active Directory (basic auth)

The most cases I managed in past weeks concerned hadoop security. That means firewalling, SELinux, authentication and user management. It is usually a difficult process, related to the companys security profile and processes.

So I start with the most interesting part - authentication. And, in all cases I worked on the main authentication system was a Windows Active Directory Forest (AD). Since hadoop is shipped with more taskcontroller-classes we can use LinuxTaskController. I use RHEL5 server, but it can be adapted similar to other installations.

To enable the UNIX services in Windows Server > 2003 you have to extend the existing schema with UNIX templates, delivered from Microsoft. After that you have to install the "Identity Management for UNIX", in 2008 located in Server Manager => Roles => AD DS => common tasks => Add Role => Select Role Services. Install the software, restart your server and it should be done. Now create a default bind-account, configure the AD server and create a test-user with group hdfs.

For testing we use these settings:
DN:hadoop.company.local
Binding-Acc: main
Group: hdfs

Now we take a closer look at the RedHat box. Since kerberos5 is fully supported the task is really simple. 3 problems could occur: time, DNS and wrong schema on the AD Server(s).
Setup the ldap authentication with:
# authconfig-tui

Authentication => Use LDAP (use MD5 Password + Use shadow Password + Use Kerberos) => Next.






Server => Base DN (FQDN or IP) + DN (dc=hadoop,dc=company,dc=local) => Next.






Kerberos Settings => REALM in uppercase + KDC and admin server (FQDN or IP of AD Server) + Use DNS to resolve hosts to realms + Use DNS to locate KDCs for realms => OK





AD does not allow anonymous connection, so you have to use the bind-account in /etc/ldap.conf (see above).

add in /etc/nsswitch.conf ldap service after files:
passwd:     files ldap
shadow:     files ldap
group:      files ldap

Now edit the /etc/ldap.conf:

base dc=hadoop,dc=company,dc=local
uri ldap://hadoop.company.local/
binddn main@hadoop.company.local
bindpw <PASSWORD>
scope sub
ssl no
nss_base_passwd dc=hadoop,dc=company,dc=local?sub
nss_base_shadow dc=hadoop,dc=company,dc=local?sub
nss_base_group dc=hadoop,dc=company,dc=local?sub? \
&(objectCategory=group)(gidnumber=*)
nss_map_objectclass posixAccount user
nss_map_objectclass shadowAccount user
nss_map_objectclass posixGroup group
nss_map_attribute gecos cn
nss_map_attribute homeDirectory unixHomeDirectory
nss_map_attribute uniqueMember member
nss_map_objectclass posixGroup Group
tls_cacertdir /etc/openldap/cacerts
pam_password md5
pam_login_attribute sAMAccountName
pam_filter objectclass=User

After you have written your file you should test your config:
# getent passwd
and
# kinit <YOUR USERNAME>@hadoop.company.local
if you get no errors all works as expected.

Add to /etc/pam.d/system-auth:
session      required      pam_mkhomedir.so skel=/etc/skel umask=0022

Now it is time to use mapred with your AD. For that we use the shipped class org.apache.hadoop.mapred.LinuxTaskController, configuration will be done in mapred-site.xml:

<property>
<name>mapred.task.tracker.task-controller</name>
<value>org.apache.hadoop.mapred.LinuxTaskController</value>
</property>
<property>
<name>mapreduce.tasktracker.group</name>
<value>hdfs</value>
</property>

Now jobs will be submitted in the given usercontext via pam. Here you have to keep in mind that the group should be set to the group you setup in your AD.

Known issues mostly depend on your setup. Be sure you have a syncronized time in your network (usually done with ntpd), a working DNS infrastructure and the user and groups are known in AD.

Comments

Post a Comment

Popular posts from this blog

Deal with corrupted messages in Apache Kafka

Under some strange circumstances, it can happen that a message in a Kafka topic is corrupted. This often happens when using 3rd party frameworks with Kafka. In addition, Kafka < 0.9 does not have a lock on Log.read() at the consumer read level, but does have a lock on Log.write(). This can lead to a rare race condition as described in KAKFA-2477 [1]. A likely log entry looks like this: ERROR Error processing message, stopping consumer: (kafka.tools.ConsoleConsumer$) kafka.message.InvalidMessageException: Message is corrupt (stored crc = xxxxxxxxxx, computed crc = yyyyyyyyyy Kafka-Tools Kafka stores the offset of each consumer in Zookeeper. To read the offsets, Kafka provides handy tools [2]. But you can also use zkCli.sh, at least to display the consumer and the stored offsets. First we need to find the consumer for a topic (> Kafka 0.9): bin/kafka-consumer-groups.sh --zookeeper management01:2181 --describe --group test Prior to Kafka 0.9, the only way to get this inform

Hive query shows ERROR "too many counters"

A hive job face the odd " Too many counters:"  like Ended Job = job_xxxxxx with exception 'org.apache.hadoop.mapreduce.counters.LimitExceededException(Too many counters: 201 max=200)' FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask Intercepting System.exit(1) These happens when operators are used in queries ( Hive Operators ). Hive creates 4 counters per operator, max upto 1000, plus a few additional counters like file read/write, partitions and tables. Hence the number of counter required is going to be dependent upon the query.  To avoid such exception, configure " mapreduce.job.counters.max " in mapreduce-site.xml to a value above 1000. Hive will fail when he is hitting the 1k counts, but other MR jobs not. A number around 1120 should be a good choice. Using " EXPLAIN EXTENDED " and " grep -ri operators | wc -l " print out the used numbers of operators. Use this value to tweak the MR s

AI's False Reality: Understanding Hallucination

Artificial Intelligence (AI) has leapfrogged to the poster child of technological innovation, on track to transform industries in a scale similar to the Industrial Revolution of the 1800s. But in this case, as cutting-edge technology, AI presents its own unique challenge, exploiting our human behavior of "love to trust", we as humans face a challenge: AI hallucinations. This phenomenon, where AI models generate outputs that are factually incorrect, misleading, or entirely fabricated, raises complex questions about the reliability and trust of AI models and larger systems. The tendency for AI to hallucinate comes from several interrelated factors. Overfitting – a condition where models become overly specialized to their training data – can lead to confident but wildly inaccurate responses when presented with novel scenarios (Guo et al., 2017). Moreover, biases embedded within datasets shape the models' understanding of the world; if these datasets are flawed or unreprese