Skip to main content


Showing posts from 2011

Export HDFS over CIFS (Samba3)

Three weeks ago I played with libhdfs and NFS, but I did not get the results I expected. Then my next idea was, why not to use Samba? Samba3x is stable and most OS can mount an exported share. The main task was to research the performance and setup of this scenario, because samba has a lot of tuning mechanisms inside. Let's go! I used a RHEL 5.7 and the delivered RPMs:   #> rpm -qa|grep samba  samba-3.0.33-3.29.el5_7.4.x86_64  samba-common-3.0.33-3.29.el5_7.4.x86_64 Like I described in " NFS exported HDFS " I mounted hdfs over fuse into the directory /123 via /etc/fstab:   #> cat /etc/fstab  [..]  hadoop-fuse-dfs#dfs://NAMENODE:9000 /123/hdfs fuse usetrash,rw 0 0 and checked it:   #> mount  [..]  fuse on /123/hdfs type fuse (rw,nosuid,nodev,allow_other,default_permissions)   #> ls -la /123  total 16  drwxr-xr-x  3 root root   4096 Dec  9 16:36 .  drwxr-xr-x 27 root root   4096 Dec  9 12:11 ..  drwxr-xr-x  5 hdfs nobody 4096 Dec  9 02:14

NFS exported HDFS (CDH3)

For some reasons it could be a good idea to make a hdfs filesystem available across networks as a exported share. Here I describe a working scenario with linux and hadoop with tools both have on board. I used fuse and libhdfs to mount a hdfs filesystem. Change namenode.local and <PORT> to fit your environment. Install:   yum install hadoop-0.20-fuse.x86_64 hadoop-0.20-libhdfs.x86_64 Create a mountpoint:   mkdir /hdfs-mount Mount your hdfs (testing):   hadoop-fuse-dfs dfs://namenode.local:<PORT> /hdfs-mount -d You will show like that:   INFO fuse_options.c:162 Adding FUSE arg /hdfs-mount  INFO fuse_options.c:110 Ignoring option -d  unique: 1, opcode: INIT (26), nodeid: 0, insize: 56  INIT: 7.10  flags=0x0000000b  max_readahead=0x00020000  INFO fuse_init.c:101 Mounting namenode.local:<PORT>  INIT: 7.8  flags=0x00000001  max_readahead=0x00020000  max_write=0x00020000  unique: 1, error: 0 (Success), outsize: 40 Hit crtl-C after you see "Su

All in one HDFS Cluster for your pocket

Update 1 (Nov 21, 2011): - added 3rd interface as host-only-adapter (hadoop1) - enabled trusted device eth2 About one year ago, I created a small XEN-environment for my engineering pourposes. When I was traveling for hours it was very helpful to track some issues or test new features. The problem was that I had to carry 2 notebooks with me. That was the reason I switched to VirtualBox [1] which runs on OSX, Linux and Windows as well. I could play with my servers and when I did, they configured to death and I reimported them into a clean setup. I think that will also be a good start for new people who have to find into the hadoop ecosystem to see the power without the harm of configuration in a multi-node environment. The appliance is created with VirtualBox, because it runs on OSX and Windows very easily. The idea behind it is to check new settings in a small environment rather easily; the appliance is designed for research, not for development and really not for production. The a

HDFS debugging scenario

The first step to debug issues in a running hadoop - environment to take a look at the stacktraces, easy accessible over jobtracker/stacks and let you show all running stacks in a jstack view. You will see the running processes, as an example I discuss a lab testing scenario, see below. http://jobtracker:50030/stacks Process Thread Dump:  43 active threads Thread 3203101 (IPC Client (47) connection to NAMENODE/IP:9000 from hdfs):   State: TIMED_WAITING   Blocked count: 6   Waited count: 7   Stack:     java.lang.Object.wait(Native Method)     org.apache.hadoop.ipc.Client$Connection.waitForWork(     org.apache.hadoop.ipc.Client$ In that case the RPC connection has a state "TIMED_WAIT" in a block and waited count. That means, the namenode could not answer the RPC request fast enough. The problem belongs the setup as I see often in production environments. For demonstration I use a ESX Cluster with a VM for the namen

Syncing hdfs-clusters

Mostly it is a good idea to test new code on a reference cluster with a nearly live dataset. To sync files from a cluster to another use the hadoop builtin tool distcp [1]. With a small script I "rebase" a development cluster with logfiles we collected over the past day. COPYDATE=`date -d '-1 Day' +"%Y-%m-%d"` DELDATE=`date -d '-3 Day' +"%Y-%m-%d"` SNAMENODE=namenode1 TNAMENODE=namenode2 PATH="/user/flume/logs" LOG="/var/log/jobs/sync.log" #logging exec >> $LOG 2>&1 echo -e "\n ------- sync $COPYDATE ------- \n" /usr/bin/hadoop distcp -i -m 100 hdfs://$SNAMENODE:9000/$PATH/$COPYDATE hdfs://$TNAMENODE:9000/$PATH/$COPYDATE/ sleep 60 echo -e "\n ------- delete $DELDATE ------- \n" /usr/bin/hadoop dfs -rmr /$PATH/$DELDATE /usr/bin/hadoop dfs -rmr /$PATH/_distcp_logs* sleep 60 /usr/bin/hadoop dfs -chmod -R 777 /$PATH/ The script copy logfiles from the past day and the giv

Secure your hadoop cluster, Part II

To get absolutely safe you need a bit more time, coffee and Aspirin. You will get headaches, for sure. First the good news, hadoop and the ecosystem run out of the box with an enabled SELinux system in targeting mode. You have to consider a performance loss of 5 - 10%. To enable SELiux on a box use setenforce 1,  to check the system use sestatus . # sestatus  SELinux status:                 enabled SELinuxfs mount:                /selinux Current mode:                   enforcing Mode from config file:          enforcing Policy version:                 21 Policy from config file:        targeted Fine. Thats all. Now we enable SELinux at boot time: # cat /etc/selinux/config SELINUX=enforcing SELINUXTYPE=targeted SETLOCALDEFS=0 If you use fuse-hdfs check [1] for a valid rule. The best way to get a system running is always to use SELINUXTYPE=targeted. But in some environments it is neccessary to protect the systems much more (Healthcare, Bank, Military etc.), here we us

Sqoop and Microsoft SQL Server

From Microsoft's technet: With SQL Server-Hadoop Connector [1], you import data from: Tables in SQL Server to delimited text files on HDFS Tables in SQL Server to SequenceFiles files on HDFS Tables in SQL Server to tables in Hive* Queries executed in SQL Server to delimited text files on HDFS Queries executed in SQL Server to SequenceFiles files on HDFS Queries executed in SQL Server to tables in Hive*   With SQL Server-Hadoop Connector, you can export data from: Delimited text files on HDFS to SQL Server SequenceFiles on HDFS to SQL Server Hive Tables* to tables in SQL Server But before it works you have to setup the connector. First get the MS JDBC driver [2]: You have just to download the driver, unpack them and copy the driver (sqljdbc4.jar) file to the $SQOOP_HOME/lib/ directory. Now download the connector (.tar.gz) from [1], unpack them and set the MSSQL_CONNECTOR_HOME into that directory. Let's assume you unpack into /usr/sqoop/connector/mssql , do: # ex

Centralized logfile management across networks with flume

Facebooks's scribe was the first available service for managing a hughe amount on logfiles. We didn't talk over 2 GB / day or so, I mean more as 1 TB per day. Compressed. Now, a new apache incubator project is flume [1]. It is a pretty nice piece of software, so I love it. It is reliable, fast, safe and has no proprietary stack inside. And you can create really cool logging tasks. If you use Clouderas Distribution you get flume easy with a " yum install flume-master " on the master and " yum install flume-node " on a node. Check [2] for more infos about. Flume has a lot of sources to get logfiles: - from a text-file - as a tail (one or more files) - syslog UDP or TCP - synthetic sources Flume's design belongs to a large logfile distribution process. Let's assume, we have a 100 Node Webcluster and incoming traffic around 3 GB/s. The farm produce 700 MB raw weblogs per minute. Through the processing over flume we can compress the files, so

Secure your hadoop cluster, Part I

Use mapred with Active Directory (basic auth) The most cases I managed in past weeks concerned hadoop security. That means firewalling, SELinux, authentication and user management. It is usually a difficult process, related to the companys security profile and processes. So I start with the most interesting part - authentication. And, in all cases I worked on the main authentication system was a Windows Active Directory Forest (AD). Since hadoop is shipped with more taskcontroller-classes we can use LinuxTaskController. I use RHEL5 server, but it can be adapted similar to other installations. To enable the UNIX services in Windows Server > 2003 you have to extend the existing schema with UNIX templates, delivered from Microsoft. After that you have to install the "Identity Management for UNIX", in 2008 located in Server Manager => Roles => AD DS => common tasks => Add Role => Select Role Services. Install the software, restart your server and it shoul

hadoop log retention

Some people ask me for a "issue" in mapreduce-jobhistory (/jobhistory.jsp) - the history tooks a while to load the site on high-traffic clusters. For that I'll explain the mechanism: The history-files will be available for 30 days (hardcoded in pre-h21). That produce a lot of logs and waste also space on the hadoop-jobtracker. So I have some installations which hold 20GB on logs in history, as a dependecy a audit of long running jobs isn't really useable. Beginning from h21 the cleanup is configurable: Key: mapreduce.jobtracker.jobhistory.maxage Default: 7 * 24 * 60 * 60 * 1000L (one week) to set the store into a 3-day period use: mapreduce.jobtracker.jobhistory.maxage 3 * 24 * 60 * 60 * 1000L That means 3 Days, 24 hours, 60 minutes, 60 seconds and a cache size of 1000. a other way, but more a hack via crond.d: find /var/log/hadoop-0.20/history/done/ -type f -mtime +1 |xargs rm -f

Analyze your IIS Logs with hive

As you know, it's really easy to collect logs from a apache driven webfarm into a hive-cluster and analyze them. But how it'll work for IIS?  Okay, lets do a view inside. IIS let us collect logs in W3C format by checking over the administraion console, register "website", "Active log format". Here you can setup the path where the logs will be stored, the fields you'll logging and much more. After a restart you should see the logs in the desired path. A good idea will be a split into hours, so you can run the jobs every hour on a fresh dataset. A really easy way will be for a small farm to export the path as a windows shared drive, connect your hive server with the samba-utils: mount -t cifs //Windows-Server/share -o user=name,password=passwd /mountpoint Copy the file into hdfs: hadoop dfs -copyFromLocal /mountpoint/filename <hdfs-dir> (we assume iislog) Now you can proceed with analysis, we use hive here. Lets assume you want to know whic

Speedup Sqoop

Sqoop [1] (sql to hadoop) lets easy connect RDBMS into a hadoop infrastructure. Newest plugin comes from Microsoft and let us connect MS-SQL Server and hadoop each together. As a cool feature you can create a jar-file from your job, its pretty easy, just here a line: sqoop export --connect jdbc:<RDBMS>:thin:@<HOSTNAME>:<PORT>:<DB-NAME>  --ta ble<TABLENAME> --username<USERNAME> --password<PASSWORD> --export-dir <HDFS DIR WHICH CONTAINS DATA> --direct --fields-terminated-by '<TERMINATOR (Java)>' --package-name <JOBNAME>.<IDENTIFIER> --outdir <WHERE THE JAR SHOULD WRITTEN> --bindir <BIN_DIR> After you fired up you'll find a jar-package in --outdir, unzip it and you find your java-code and the precompiled class,so you can start to tune them. Now lets start the job again, but use the precompiled class: sqoop export --connect jdbc:<RDBMS>:thin:@<HOSTNAME>:<PORT>:<DB-NAME&g

Using MySQL as a Hive backend database

Hive let us use a SQL like (HiveQL) style to analyse large datasets with ad-hoc queries, and comes as a service on top of hdfs. It is easy to use and most SQL programmers can instant write some queries. The lack of the installation are the included derby DB, which is running on the node locally. For that Hive is not really multiuser-capable. To use Hive with more than one user you have to setup a backend database. The database will hold all metainformations regarding your tables, partitions, splits and rows. For that the database should be safe (maybe replication) or a HVA installation. I use 2 MySQL servers in a ESX Cluster environment with enabled binary logs (Active/Standby). Setup a server and install mysql-server version 5.1 and up. To get absolute safe you can setup a MySQL cluster ;) Let us configure the mysql-database: # cat /etc/my.cnf [mysqld_safe] socket      = /var/lib/mysql/mysql.sock [mysqld] user        = mysql pid-file    = /var/run/mysqld/ socke

Secondary namenode data loss

Yes, that happen if you not configure your installation well. I got some mails from our customers regarding that problem.  The secondary namenode hadoop.tmp.dir have to redefined in core-site.xml to a directory outside of /tmp, because the most linux-servers cleanup /tmp when a server reboot. That causes a lost of last edit logs and fsimage, in fact the namenode could not be replayed at a server crash. Simply add a new property into core-site.xml: <property> <name>hadoop.tmp.dir</name> <value>/path/for/node/${ }</value> </proper ty> restart the secondary namenode and you'll be save.  The same you should do with your hbase-configuration (hbase-site.xml): <property> <name>hbase.tmp.dir</name> <value>/path/for/node/${ }</value> </proper ty>

Use hive to catch grabber

Get the logs from the farm via flume & syslog, mapreduce them in hive for  IP, how often / second, bytes, item and compare with "human" profiles. Get the data on the fly via sqlstream, processes back into Oracle and from there a loadbalancer could get the IPs for a smooth redirect and I process the data into a graphing system (connection from that IP): Hourly I check geolocation, whois, provider. Using pig.latin. Ready for first testing in our labs. And, of course, not a really performant task (yet) ;-)