Skip to main content

Get an Hadoop cluster running in 20 minutes

Listen:

Since Cloudera was starting to build a complete stack it is more easy to get a cluster up and running. They build rpm's for Redhat based systems, I use CentOS, the newest CDH build is CDH2, which comes with a lot nice addons. Thanks for the work, guys!

What hardware we should use? Simple, moderate servers are pretty nice for. 2 CPU, 8 GB RAM, 4x750GB HDD will be enough. The master should have 4 CPU, 8 or more GB RAM and 100 GB RAID 5, we need that hardware twice.Why? The master is the most important server, hosting the namenode, jobtracker and, if you setup, ganglia. The hardware you use depends on your usecases, what I describe there works for eventanalysis and weblogs. But it will be a good start to see what power hadoop can deliver. The systems are located in the same rack so we can rack awareness moving in background. Use lvm and create a large directory, I use /opt/hadoop:


# df -h /opt/hadoop/
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/myvg-hadoopvol
                      1.4T  213G  1.1T  17% /opt/hadoop


Important: Grab from java.sun.com the jdk-rpm (1.6.x) and install it.

Hint: A good idea will be to deploy a SSH key to all nodes, so we can authenticate without a password. Done with cp rsa_id.pub from masternode to slaves /root/.ssh/authorized_keys

All servers are running, first get the repo working in your box. Simply add a file in /etc/yum.repos.d/:

# cat cloudera-cdh2.repo
[cloudera-cdh2]
name=Cloudera's Distribution for Hadoop, Version 2
mirrorlist=http://archive.cloudera.com/redhat/cdh/2/mirrors
gpgkey = http://archive.cloudera.com/redhat/cdh/RPM-GPG-KEY-cloudera
gpgcheck=1
enabled=1

deploy that file on all nodes.

Now we give a bit more limits:
# cat /etc/security/limits.conf
 hdfs            soft     nofile         5000
 hdfs            hard     nofile         5000
 mapred          hard     nofile         5000
 mapred          soft     nofile         5000
 hadoop          hard     nofile         5000
 hadoop          soft     nofile         5000

# cat /etc/sysctl.conf
 fs.file-max=200000

Setup the namenode and jobtracker on the master-box:
"yum install hadoop-0.20-pipes hadoop-0.20-native hadoop-0.20-jobtracker hadoop-0.20 hadoop-zookeeper hadoop-0.20-namenode -y && mkdir -p /opt/hadoop/name && mkdir -p /opt/hadoop/hdfs/mapred/local && chown -R hdfs:hadoop /opt/hadoop/hdfs/ && chown -R mapred:hadoop /opt/hadoop/hdfs/mapred"

To prevent the configs from a unwanted update we use alternatives here:
cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.used
alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.used 50

I use port 9000 for hdfs, but that depends on your environment.

Disable IPv6:
edit hadoop-env.sh:
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

Edit the main config files:

core-site.xml:
<property>
  <name>fs.default.name</name>
  <value>hdfs://<your-server-name>:9000</value>
</property>

hdfs-site.xml:
<property>
    <name>dfs.data.dir</name>
    <value>/opt/hadoop/hdfs/data</value>
</property>
<property>
  <name>dfs.name.dir</name>
  <value>/opt/hadoop/hdfs/name</value>
</property>
<property>
  <name>dfs.secondary.http.address</name>
  <value>namenode2:50070</value>
</property>

mapred-site.xml:
<property>
<name>mapred.local.dir</name>
  <value>/opt/hadoop/hdfs/mapred/local</value>
</property>
<name>mapred.job.tracker</name>
  <value><your-server-name>:54311</value>
</property>

Describe your cluster. Here we have to edit 2 files, master and slave:
# cat /etc/hadoop-0.20/conf.used/masters
<your-namenode-server>

# cat /etc/hadoop-0.20/conf.used/slaves
datanode1
datanode2
datanode3
datanode4
datanode5
datanode6

Setup the secondary namenode:
# yum install hadoop-0.20-secondarynamenode -y

Format the namenode:
# sudo -u hdfs hadoop namenode -format
You have to wait, simple watch the logs (tail -f /var/log/hadoop-0.20/*.log)

and start:
# for x in /etc/init.d/hadoop-0.20-* ; do $x restart ; done

create the mapred-dirs in hdfs:
# sudo -u hdfs hadoop fs -mkdir /mapred/system
# sudo -u hdfs hadoop fs -chown -R mapred /mapred

Login at the secondary namenode and restart also:
# for x in /etc/init.d/hadoop-0.20-* ; do $x restart ; done

Watch the logs for errors.

Now let us install the datanodes and tasktrackers. Before you let run the script be sure all nodes are via DNS available, SSH keys deployed and hosts are known by the SSH subsystem.

# for i in $(cat /etc/hadoop-0.20/conf.used/slaves); do ssh $i 'yum install hadoop-0.20-datanode.noarch hadoop-0.20-native.x86_64 hadoop-0.20-tasktracker.noarch -y && \ 
mkdir -p /opt/hadoop/hdfs/name && mkdir -p /opt/hadoop/hdfs/mapred/local && chown -R hdfs:hadoop /opt/hadoop/hdfs/ && chown -R mapred:hadoop /opt/hadoop/hdfs/mapred'; done

copy the config:
# for i in $(cat /etc/hadoop-0.20/conf.used/slaves); do scp -r /etc/hadoop-0.20/conf.used $i:/etc/hadoop-0.20/ ; done

activate via alternatives:
# for i in $(cat /etc/hadoop-0.20/conf/hadoop_slaves); \ 
do ssh $i alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.as24 50; done

Now it is time to get the cluster running the first time:
# for i in $(cat /etc/hadoop-0.20/conf.used/slaves); do ssh $i 'for x in /etc/init.d/hadoop-0.20-* ; do $x restart; done'; done

Again, watch the logs for errors, and also you can check the whole nodes with:
# for i in $(cat /etc/hadoop-0.20/conf.used/slaves); do echo $i && ssh $i 'for x in /etc/init.d/hadoop-0.20-* ; do $x status; done'; done

If all goes well you can see your running cluster via web:
http://namenode:50070/dfshealth.jsp

Another good idea will be to edit your hosts according your cluster-nodes:

# cat /etc/hosts
...

# hadoop nodes
IP datanode1.FQDN datanode1
IP datanode2.FQDN datanode2
IP datanode3.FQDN datanode3
IP datanode4.FQDN datanode4
IP datanode5.FQDN datanode5
IP datanode6.FQDN datanode6
#hadoop master
IP namenode1.FQDN namenode1
#hadoop secondary
IP namenode2.FQDN namenode2

Enjoy!

created: 25.November 2010

Comments

Popular posts from this blog

Deal with corrupted messages in Apache Kafka

Under some strange circumstances it can happen that a message in a Kafka topic is corrupted. This happens often by using 3rd party frameworks together with Kafka. Additionally, Kafka < 0.9 has no lock at Log.read() at the consumer read level, but has a lock on Log.write(). This can cause a rare race condition, as described in KAKFA-2477 [1]. Probably a log entry looks like: ERROR Error processing message, stopping consumer: (kafka.tools.ConsoleConsumer$) kafka.message.InvalidMessageException: Message is corrupt (stored crc = xxxxxxxxxx, computed crc = yyyyyyyyyy Kafka-Tools Kafka stores the offset of every consumer in Zookeeper. To read out the offsets, Kafka provides handy tools [2]. But also zkCli.sh can be used, at least to display the consumer and the stored offsets. First we need to find the consumer for a topic (> Kafka 0.9): bin/kafka-consumer-groups.sh --zookeeper management01:2181 --describe --group test Prior to Kafka 0.9 the only possibility to get this inform

Hive query shows ERROR "too many counters"

A hive job face the odd " Too many counters:"  like Ended Job = job_xxxxxx with exception 'org.apache.hadoop.mapreduce.counters.LimitExceededException(Too many counters: 201 max=200)' FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask Intercepting System.exit(1) These happens when operators are used in queries ( Hive Operators ). Hive creates 4 counters per operator, max upto 1000, plus a few additional counters like file read/write, partitions and tables. Hence the number of counter required is going to be dependent upon the query.  To avoid such exception, configure " mapreduce.job.counters.max " in mapreduce-site.xml to a value above 1000. Hive will fail when he is hitting the 1k counts, but other MR jobs not. A number around 1120 should be a good choice. Using " EXPLAIN EXTENDED " and " grep -ri operators | wc -l " print out the used numbers of operators. Use this value to tweak the MR s

Life hacks for your startup with OpenAI and Bard prompts

OpenAI and Bard   are the most used GenAI tools today; the first one has a massive Microsoft investment, and the other one is an experiment from Google. But did you know that you can also use them to optimize and hack your startup?  For startups, reating pitch scripts, sales emails, and elevator pitches with one (or both) of them helps you not only save time but also validate your marketing and wording. Curios? Here a few prompt hacks for startups to create / improve / validate buyer personas, your startups mission / vision statements, and USP definitions. First Step: Introduce yourself and your startup Introduce yourself, your startup, your website, your idea, your position, and in a few words what you are doing to the chatbot: Prompt : I'm NAME and our startup NAME, with website URL, is doing WHATEVER. With PRODUCT NAME, we aim to change or disrupt INDUSTRY. Bard is able to pull information from your website. I'm not sure if ChatGPT can do that, though. But nevertheless, now