Skip to main content

Export HDFS over CIFS (Samba3)


Three weeks ago I played with libhdfs and NFS, but I did not get the results I expected. Then my next idea was, why not to use Samba? Samba3x is stable and most OS can mount an exported share.
The main task was to research the performance and setup of this scenario, because samba has a lot of tuning mechanisms inside. Let's go!

I used a RHEL 5.7 and the delivered RPMs:
 #> rpm -qa|grep samba

Like I described in "NFS exported HDFS" I mounted hdfs over fuse into the directory /123 via /etc/fstab:

 #> cat /etc/fstab
 hadoop-fuse-dfs#dfs://NAMENODE:9000 /123/hdfs fuse usetrash,rw 0 0

and checked it:
 #> mount
 fuse on /123/hdfs type fuse (rw,nosuid,nodev,allow_other,default_permissions)

 #> ls -la /123
 total 16
 drwxr-xr-x  3 root root   4096 Dec  9 16:36 .
 drwxr-xr-x 27 root root   4096 Dec  9 12:11 ..
 drwxr-xr-x  5 hdfs nobody 4096 Dec  9 02:14 hdfs

The first step afterwards is to configure samba. I figured that config out:

#> cat /etc/samba/smb.conf
        bind interfaces only = yes
        deadtime = 15
        default case = lower
        disable netbios = yes
        interfaces = eth0
        dns proxy = no
        workgroup = HDFS
        server string = Samba Server Version %v
        socket options = TCP_NODELAY IPTOS_LOWDELAY SO_RCVBUF=65536 SO_SNDBUF=65536
        load printers = no
        max connections = 30
        strict sync = no
        sync always = no
        syslog = 1
        syslog only = yes
        security = user
        smb passwd file = /etc/samba/smbpasswd
        comment = HDFS
        path = /123/hdfs
        public = yes
        writable = yes
        printable = no
        create mask = 0744
        force user = hdfs
        force group = nobody

Created user and password, here I used the hdfs-system-user (id=hdfs, group=nobody)      

smbpasswd -a username

At last I started the server:
 #> service smb restart

Test cases
For testing I used another RHEL5.7 server and mounted the exported share into /test:
 #> mount -t cifs -o username=hdfs,rw //SAMBASERVER/hdfs /test

 #> ls -la /test/
 total 8
 drwxr-xr-x  5 hdfs nobody    0 Dec  9 02:14 .
 drwxr-xr-x 25 root root   4096 Dec  9 15:03 ..
 drwxr-xr-x  3 hdfs nobody    0 Dec  9 02:12 mapred
 drwxr-xr-x  3 hdfs nobody    0 Dec  9 02:13 opt
 drwxr-xr-x  6 hdfs nobody    0 Dec  9 15:56 user

Now the hdfs from my testcluster is exported via samba. So far, so good.

My first test concerned the read performance, here I chose a rsync of a smaller logfile collection:
 #> cd /tmp/rsync-test
 #> rsync -av /test/hdfs/user/flume/weblogs/2011-12-07/ .
 sent 20478888644 bytes  received 92606 bytes  17377158.46 bytes/sec
 total size is 20475835998
 (19GB, 16 MB/s) 

How many files I synced?
 #> find . -type f |wc -l

Okay, that worked. Then I tested the write speed, here I used a plain file I created with

 #> dd if=/dev/zero of=/tmp/13GB bs=128M count=100

and copied it into the cifs-mount, for testing with "time":
 #> time cp /tmp/13GB /test/hdfs/user/
 real 7m57.864s
 user 0m0.328s
 sys 0m20.602s

= around 27 mb/s

checked for correct rights and groups on hdfs:

 hdfs#> hadoop dfs -ls /user
 Found 1 item
 -rw-r--r--   3 hdfs supergroup 13421772800 2011-12-09 15:56 /user/13GB

To compare with a scp write test I used:
 #> scp /tmp/13GB hdfs@SAMBASERVER:/123/hdfs/user

and got
13GB  100%   13GB  47.8MB/s   04:28

which is much faster. The overhead from samba will cost performance, for sure.

It is possible to export a hdfs filesystem over libhdfs and samba to clients and get acceptable results. That makes some tasks easier, including the use of hdfs as a (limited) cluster storage.



  1. I found your post saying you were having some issues at the start of December 2011, how are you finding the FUSE module now? Is it much more reliable, or are there settings that help with reliability/performance?

  2. With CIFS I got good results, here you have the opportunity to tune samba a lot. It runs stable, but you have to count with lesser performance as raw writes.

    1. Thanks for your blog and sharings.

      Do you have tried to connect from ms windows clients( 2008 R2) to samba server?
      I've tried, login and read is good, but write is failed.
      Do you have any result on this?

  3. Not really. I have tested from an Windows7 client, MacOSX, Linux. Whats the error message in the eventlog for?

  4. I'm still in problems.

    The error code in widnows 2008 and win7 is 0x8007045d when uploading.

    logging and downlading are ok, but uploading failed.

    I've tested with windows xp sp3, windows 7 and windows 2008 r2 client .
    The same samba configurations without hdfs, uploading and downloading ok.

    Samba version 3.0.33-3.29.el5_7.4
    Hadoop 0.20.2
    My email address is
    If you concerns my configuration, I can opens my servers.

    Thanks for your responce.


    Depends on the NTLM Auth from Win7. The client dont have the right to write, and that depends on the missing kerberos setup. Windows uses the given username, if you run as administrator you will be send uid 0, and uid 0 we have prohibited.
    Set up your cluster with Kerberos Auth or add the user in windows and work in a context of them.

  6. Does samba3 have any kernel requirement like NFS?

  7. No, need xattr and ext3, so far I know.


Post a Comment

Popular posts from this blog

Deal with corrupted messages in Apache Kafka

Under some strange circumstances it can happen that a message in a Kafka topic is corrupted. This happens often by using 3rd party frameworks together with Kafka. Additionally, Kafka < 0.9 has no lock at at the consumer read level, but has a lock on Log.write(). This can cause a rare race condition, as described in KAKFA-2477 [1]. Probably a log entry looks like: ERROR Error processing message, stopping consumer: ($) kafka.message.InvalidMessageException: Message is corrupt (stored crc = xxxxxxxxxx, computed crc = yyyyyyyyyy Kafka-Tools Kafka stores the offset of every consumer in Zookeeper. To read out the offsets, Kafka provides handy tools [2]. But also can be used, at least to display the consumer and the stored offsets. First we need to find the consumer for a topic (> Kafka 0.9): bin/ --zookeeper management01:2181 --describe --group test Prior to Kafka 0.9 the only possibility to get this inform

Hive query shows ERROR "too many counters"

A hive job face the odd " Too many counters:"  like Ended Job = job_xxxxxx with exception 'org.apache.hadoop.mapreduce.counters.LimitExceededException(Too many counters: 201 max=200)' FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask Intercepting System.exit(1) These happens when operators are used in queries ( Hive Operators ). Hive creates 4 counters per operator, max upto 1000, plus a few additional counters like file read/write, partitions and tables. Hence the number of counter required is going to be dependent upon the query.  To avoid such exception, configure " mapreduce.job.counters.max " in mapreduce-site.xml to a value above 1000. Hive will fail when he is hitting the 1k counts, but other MR jobs not. A number around 1120 should be a good choice. Using " EXPLAIN EXTENDED " and " grep -ri operators | wc -l " print out the used numbers of operators. Use this value to tweak the MR s

GPT & GenAI for Startup Storytelling

OpenAI and Bard   are the most used GenAI tools today; the first one has a massive Microsoft investment, and the other one is an experiment from Google. But did you know that you can also use them to optimize and hack your startup?  For startups, creating pitch scripts, sales emails, and elevator pitches with generative AI (GenAI) can help you not only save time but also validate your marketing and wording. Curious? Here are a few prompt hacks for startups to create,improve, and validate buyer personas, your startup's mission/vision statements, and unique selling proposition (USP) definitions. First Step: Introduce yourself and your startup Introduce yourself, your startup, your website, your idea, your position, and in a few words what you are doing to the chatbot: Prompt : I'm NAME and our startup NAME, with website URL, is doing WHATEVER. With PRODUCT NAME, we aim to change or disrupt INDUSTRY. Bard is able to pull information from your website. I'm not sure if ChatGPT