Skip to main content

Exporting HDFS via FUSE and Samba: Experiment and Performance Results (Updated)

This article walks through exporting HDFS via FUSE and serving it through Samba so that clients can mount it over CIFS. It shows the configuration on RHEL, how to mount HDFS, how to test read and write performance with rsync and dd, and compares the results to direct SCP transfers. The setup is historical but still useful to understand the trade-offs and overhead of exposing HDFS as a network filesystem.

Some years ago I experimented with exposing HDFS to non-Hadoop clients. After playing with libhdfs and NFS and not getting the results I wanted, the next idea was: why not try Samba? Samba 3.x was stable, and most operating systems could mount a CIFS share without additional software.

The main goal was to understand the setup and performance characteristics of this scenario, because Samba has many tuning options. The environment used in this experiment was RHEL 5.7 with the stock Samba packages.

1. Environment and packages

# rpm -qa | grep samba
samba-3.0.33-3.29.el5_7.4.x86_64
samba-common-3.0.33-3.29.el5_7.4.x86_64

2. Mount HDFS via FUSE

As described in an earlier post about NFS-exported HDFS, HDFS is first mounted locally using hadoop-fuse-dfs. The mount point used here was /123/hdfs, configured in /etc/fstab:

# cat /etc/fstab
[...]
hadoop-fuse-dfs#dfs://NAMENODE:9000 /123/hdfs fuse usetrash,rw 0 0

Check the mount:

# mount
[...]
fuse on /123/hdfs type fuse (rw,nosuid,nodev,allow_other,default_permissions)

And list the directory:

# ls -la /123
total 16
drwxr-xr-x  3 root root   4096 Dec  9 16:36 .
drwxr-xr-x 27 root root   4096 Dec  9 12:11 ..
drwxr-xr-x  5 hdfs nobody 4096 Dec  9 02:14 hdfs

At this point HDFS is available on the local node under /123/hdfs via FUSE.

3. Samba configuration

The next step is to configure Samba to export this FUSE-mounted HDFS path to remote clients.

# cat /etc/samba/smb.conf
[global]
    bind interfaces only = yes
    deadtime = 15
    default case = lower
    disable netbios = yes
    interfaces = eth0
    dns proxy = no
    workgroup = HDFS
    server string = Samba Server Version %v
    socket options = TCP_NODELAY IPTOS_LOWDELAY SO_RCVBUF=65536 SO_SNDBUF=65536
    load printers = no
    max connections = 30
    strict sync = no
    sync always = no
    syslog = 1
    syslog only = yes
    security = user
    smb passwd file = /etc/samba/smbpasswd

[hdfs]
    comment = HDFS
    path = /123/hdfs
    public = yes
    writable = yes
    printable = no
    create mask = 0744
    force user = hdfs
    force group = nobody

For authentication, I created a Samba user and password. In this setup I reused the HDFS system user (hdfs, group nobody):

# smbpasswd -a username

Finally, restart the Samba service:

# service smb restart

4. Mounting the share from a client

On another RHEL 5.7 server, the Samba share is mounted to /test using CIFS:

# mount -t cifs -o username=hdfs,rw //SAMBASERVER/hdfs /test
Password: HERE_THE_PASSWORD

Verify the mount:

# ls -la /test/
total 8
drwxr-xr-x  5 hdfs nobody    0 Dec  9 02:14 .
drwxr-xr-x 25 root root   4096 Dec  9 15:03 ..
drwxr-xr-x  3 hdfs nobody    0 Dec  9 02:12 mapred
drwxr-xr-x  3 hdfs nobody    0 Dec  9 02:13 opt
drwxr-xr-x  6 hdfs nobody    0 Dec  9 15:56 user

Now the HDFS filesystem from the test cluster is exported via Samba and accessible to CIFS clients. So far, so good.

5. Read performance test with rsync

The first performance test focused on read throughput. I used rsync to copy a collection of smaller log files from HDFS (via Samba) to the local filesystem:

# cd /tmp/rsync-test
# rsync -av /test/hdfs/user/flume/weblogs/2011-12-07/ .
sent 20478888644 bytes  received 92606 bytes  17377158.46 bytes/sec
total size is 20475835998
(19GB, ~16 MB/s)

Number of files synced:

# find . -type f | wc -l
4665

The result shows that reading many small files via Samba on top of HDFS FUSE achieved roughly 16 MB/s on this setup.

6. Write performance test with dd and cp

For write performance, I created a large test file:

# dd if=/dev/zero of=/tmp/13GB bs=128M count=100

Then copied it into the CIFS-mounted share, measuring the time:

# time cp /tmp/13GB /test/hdfs/user/

real    7m57.864s
user    0m0.328s
sys     0m20.602s

This corresponds to roughly 27 MB/s write throughput via Samba on top of FUSE-mounted HDFS.

Check permissions and ownership in HDFS:

# hadoop dfs -ls /user
Found 1 item
-rw-r--r--   3 hdfs supergroup 13421772800 2011-12-09 15:56 /user/13GB

7. Comparison with direct SCP

To compare, I used scp to write the same file directly to the node where HDFS was mounted:

# scp /tmp/13GB hdfs@SAMBASERVER:/123/hdfs/user
13GB  100%   13GB  47.8MB/s   04:28

Direct SCP to the FUSE mount was significantly faster (~47.8 MB/s) than writing through Samba (~27 MB/s). The additional overhead from Samba clearly costs performance.

Conclusion

This experiment shows that it is technically possible to export a FUSE-mounted HDFS filesystem via Samba and access it from CIFS clients with acceptable read/write performance. For some operational and ad-hoc tasks, this can make it easier to interact with HDFS as a (limited) cluster storage from systems that are not HDFS-aware.

However, the additional layers (FUSE + Samba) introduce overhead and complexity. For production workloads or high-throughput scenarios, native HDFS clients, WebHDFS, HDFS NFS gateways, object stores, or modern data access layers are usually the better choice.

For further tuning ideas, see Samba optimization guides such as: https://calomel.org/samba_optimize.html

If you need help with distributed systems, backend engineering, or data platforms, check my Services.

Most read articles

Why Is Customer Obsession Disappearing?

Many companies trade real customer-obsession for automated, low-empathy support. Through examples from Coinbase, PayPal, GO Telecommunications and AT&T, this article shows how reliance on AI chatbots, outsourced call centers, and KPI-driven workflows erodes trust, NPS and customer retention. It argues that human-centric support—treating support as strategic investment instead of cost—is still a core growth engine in competitive markets. It's wild that even with all the cool tech we've got these days, like AI solving complex equations and doing business across time zones in a flash, so many companies are still struggling with the basics: taking care of their customers. The drama around Coinbase's customer support is a prime example of even tech giants messing up. And it's not just Coinbase — it's a big-picture issue for the whole industry. At some point, the idea of "customer obsession" got replaced with "customer automation," and no...

How to scale MySQL perfectly

When MySQL reaches its limits, scaling cannot rely on hardware alone. This article explains how strategic techniques such as caching, sharding and operational optimisation can drastically reduce load and improve application responsiveness. It outlines how in-memory systems like Redis or Memcached offload repeated reads, how horizontal sharding mechanisms distribute data for massive scale, and how tools such as Vitess, ProxySQL and HAProxy support routing, failover and cluster management. The summary also highlights essential practices including query tuning, indexing, replication and connection management. Together these approaches form a modern DevOps strategy that transforms MySQL from a single bottleneck into a resilient, scalable data layer able to grow with your application. When your MySQL database reaches its performance limits, vertical scaling through hardware upgrades provides a temporary solution. Long-term growth, though, requires a more comprehensive approach. This invo...

What the Heck is Superposition and Entanglement?

This post is about superposition and interference in simple, intuitive terms. It describes how quantum states combine, how probability amplitudes add, and why interference patterns appear in systems such as electrons, photons and waves. The goal is to give a clear, non mathematical understanding of how quantum behavior emerges from the rules of wave functions and measurement. If you’ve ever heard the words superposition or entanglement thrown around in conversations about quantum physics, you may have nodded politely while your brain quietly filed them away in the "too confusing to deal with" folder.  These aren't just theoretical quirks; they're the foundation of mind-bending tech like Google's latest quantum chip, the Willow with its 105 qubits. Superposition challenges our understanding of reality, suggesting that particles don't have definite states until observed. This principle is crucial in quantum technologies, enabling phenomena like quantum comp...