Skip to main content

Get an Hadoop cluster running in 20 minutes

Listen:

Since Cloudera was starting to build a complete stack it is more easy to get a cluster up and running. They build rpm's for Redhat based systems, I use CentOS, the newest CDH build is CDH2, which comes with a lot nice addons. Thanks for the work, guys!

What hardware we should use? Simple, moderate servers are pretty nice for. 2 CPU, 8 GB RAM, 4x750GB HDD will be enough. The master should have 4 CPU, 8 or more GB RAM and 100 GB RAID 5, we need that hardware twice.Why? The master is the most important server, hosting the namenode, jobtracker and, if you setup, ganglia. The hardware you use depends on your usecases, what I describe there works for eventanalysis and weblogs. But it will be a good start to see what power hadoop can deliver. The systems are located in the same rack so we can rack awareness moving in background. Use lvm and create a large directory, I use /opt/hadoop:


# df -h /opt/hadoop/
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/myvg-hadoopvol
                      1.4T  213G  1.1T  17% /opt/hadoop


Important: Grab from java.sun.com the jdk-rpm (1.6.x) and install it.

Hint: A good idea will be to deploy a SSH key to all nodes, so we can authenticate without a password. Done with cp rsa_id.pub from masternode to slaves /root/.ssh/authorized_keys

All servers are running, first get the repo working in your box. Simply add a file in /etc/yum.repos.d/:

# cat cloudera-cdh2.repo
[cloudera-cdh2]
name=Cloudera's Distribution for Hadoop, Version 2
mirrorlist=http://archive.cloudera.com/redhat/cdh/2/mirrors
gpgkey = http://archive.cloudera.com/redhat/cdh/RPM-GPG-KEY-cloudera
gpgcheck=1
enabled=1

deploy that file on all nodes.

Now we give a bit more limits:
# cat /etc/security/limits.conf
 hdfs            soft     nofile         5000
 hdfs            hard     nofile         5000
 mapred          hard     nofile         5000
 mapred          soft     nofile         5000
 hadoop          hard     nofile         5000
 hadoop          soft     nofile         5000

# cat /etc/sysctl.conf
 fs.file-max=200000

Setup the namenode and jobtracker on the master-box:
"yum install hadoop-0.20-pipes hadoop-0.20-native hadoop-0.20-jobtracker hadoop-0.20 hadoop-zookeeper hadoop-0.20-namenode -y && mkdir -p /opt/hadoop/name && mkdir -p /opt/hadoop/hdfs/mapred/local && chown -R hdfs:hadoop /opt/hadoop/hdfs/ && chown -R mapred:hadoop /opt/hadoop/hdfs/mapred"

To prevent the configs from a unwanted update we use alternatives here:
cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.used
alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.used 50

I use port 9000 for hdfs, but that depends on your environment.

Disable IPv6:
edit hadoop-env.sh:
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

Edit the main config files:

core-site.xml:
<property>
  <name>fs.default.name</name>
  <value>hdfs://<your-server-name>:9000</value>
</property>

hdfs-site.xml:
<property>
    <name>dfs.data.dir</name>
    <value>/opt/hadoop/hdfs/data</value>
</property>
<property>
  <name>dfs.name.dir</name>
  <value>/opt/hadoop/hdfs/name</value>
</property>
<property>
  <name>dfs.secondary.http.address</name>
  <value>namenode2:50070</value>
</property>

mapred-site.xml:
<property>
<name>mapred.local.dir</name>
  <value>/opt/hadoop/hdfs/mapred/local</value>
</property>
<name>mapred.job.tracker</name>
  <value><your-server-name>:54311</value>
</property>

Describe your cluster. Here we have to edit 2 files, master and slave:
# cat /etc/hadoop-0.20/conf.used/masters
<your-namenode-server>

# cat /etc/hadoop-0.20/conf.used/slaves
datanode1
datanode2
datanode3
datanode4
datanode5
datanode6

Setup the secondary namenode:
# yum install hadoop-0.20-secondarynamenode -y

Format the namenode:
# sudo -u hdfs hadoop namenode -format
You have to wait, simple watch the logs (tail -f /var/log/hadoop-0.20/*.log)

and start:
# for x in /etc/init.d/hadoop-0.20-* ; do $x restart ; done

create the mapred-dirs in hdfs:
# sudo -u hdfs hadoop fs -mkdir /mapred/system
# sudo -u hdfs hadoop fs -chown -R mapred /mapred

Login at the secondary namenode and restart also:
# for x in /etc/init.d/hadoop-0.20-* ; do $x restart ; done

Watch the logs for errors.

Now let us install the datanodes and tasktrackers. Before you let run the script be sure all nodes are via DNS available, SSH keys deployed and hosts are known by the SSH subsystem.

# for i in $(cat /etc/hadoop-0.20/conf.used/slaves); do ssh $i 'yum install hadoop-0.20-datanode.noarch hadoop-0.20-native.x86_64 hadoop-0.20-tasktracker.noarch -y && \ 
mkdir -p /opt/hadoop/hdfs/name && mkdir -p /opt/hadoop/hdfs/mapred/local && chown -R hdfs:hadoop /opt/hadoop/hdfs/ && chown -R mapred:hadoop /opt/hadoop/hdfs/mapred'; done

copy the config:
# for i in $(cat /etc/hadoop-0.20/conf.used/slaves); do scp -r /etc/hadoop-0.20/conf.used $i:/etc/hadoop-0.20/ ; done

activate via alternatives:
# for i in $(cat /etc/hadoop-0.20/conf/hadoop_slaves); \ 
do ssh $i alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.as24 50; done

Now it is time to get the cluster running the first time:
# for i in $(cat /etc/hadoop-0.20/conf.used/slaves); do ssh $i 'for x in /etc/init.d/hadoop-0.20-* ; do $x restart; done'; done

Again, watch the logs for errors, and also you can check the whole nodes with:
# for i in $(cat /etc/hadoop-0.20/conf.used/slaves); do echo $i && ssh $i 'for x in /etc/init.d/hadoop-0.20-* ; do $x status; done'; done

If all goes well you can see your running cluster via web:
http://namenode:50070/dfshealth.jsp

Another good idea will be to edit your hosts according your cluster-nodes:

# cat /etc/hosts
...

# hadoop nodes
IP datanode1.FQDN datanode1
IP datanode2.FQDN datanode2
IP datanode3.FQDN datanode3
IP datanode4.FQDN datanode4
IP datanode5.FQDN datanode5
IP datanode6.FQDN datanode6
#hadoop master
IP namenode1.FQDN namenode1
#hadoop secondary
IP namenode2.FQDN namenode2

Enjoy!

created: 25.November 2010

Comments

Popular posts from this blog

Why Is Customer Obsession Disappearing?

 It's wild that even with all the cool tech we've got these days, like AI solving complex equations and doing business across time zones in a flash, so many companies are still struggling with the basics: taking care of their customers.The drama around Coinbase's customer support is a prime example of even tech giants messing up. And it's not just Coinbase — it's a big-picture issue for the whole industry. At some point, the idea of "customer obsession" got replaced with "customer automation," and now we're seeing the problems that came with it. "Cases" What Not to Do Coinbase, as main example, has long been synonymous with making cryptocurrency accessible. Whether you’re a first-time buyer or a seasoned trader, their platform was once the gold standard for user experience. But lately, their customer support practices have been making headlines for all the wrong reasons: Coinbase - Stuck in the Loop:  Users have reported being caugh...

MySQL Scaling in 2024

When your MySQL database reaches its performance limits, vertical scaling through hardware upgrades provides a temporary solution. Long-term growth, though, requires a more comprehensive approach. This involves optimizing the database strategically and integrating complementary technologies. Caching The implementation of a caching layer, such as Memcached or Redis , can result in a notable reduction in the load and an increase ni performance at MySQL. In-memory stores cache data that is accessed frequently, enabling near-instantaneous responses and freeing the database for other tasks. For applications with heavy read traffic on relatively static data (e.g. product catalogues, user profiles), caching represents a low-effort, high-impact solution. Consider a online shop product catalogue with thousands of items. With each visit to the website, the application queries the database in order to retrieve product details. By using caching, the retrieved details can be stored in Memcached (a...

Deal with corrupted messages in Apache Kafka

Under some strange circumstances, it can happen that a message in a Kafka topic is corrupted. This often happens when using 3rd party frameworks with Kafka. In addition, Kafka < 0.9 does not have a lock on Log.read() at the consumer read level, but does have a lock on Log.write(). This can lead to a rare race condition as described in KAKFA-2477 [1]. A likely log entry looks like this: ERROR Error processing message, stopping consumer: (kafka.tools.ConsoleConsumer$) kafka.message.InvalidMessageException: Message is corrupt (stored crc = xxxxxxxxxx, computed crc = yyyyyyyyyy Kafka-Tools Kafka stores the offset of each consumer in Zookeeper. To read the offsets, Kafka provides handy tools [2]. But you can also use zkCli.sh, at least to display the consumer and the stored offsets. First we need to find the consumer for a topic (> Kafka 0.9): bin/kafka-consumer-groups.sh --zookeeper management01:2181 --describe --group test Prior to Kafka 0.9, the only way to get this in...