Some years ago I experimented with exposing HDFS to non-Hadoop clients. After playing with libhdfs and NFS and not getting the results I wanted, the next idea was: why not try Samba? Samba 3.x was stable, and most operating systems could mount a CIFS share without additional software.
The main goal was to understand the setup and performance characteristics of this scenario, because Samba has many tuning options. The environment used in this experiment was RHEL 5.7 with the stock Samba packages.
1. Environment and packages
# rpm -qa | grep samba
samba-3.0.33-3.29.el5_7.4.x86_64
samba-common-3.0.33-3.29.el5_7.4.x86_64
2. Mount HDFS via FUSE
As described in an earlier post about NFS-exported HDFS, HDFS is first mounted locally using hadoop-fuse-dfs. The mount point used here was /123/hdfs, configured in /etc/fstab:
# cat /etc/fstab
[...]
hadoop-fuse-dfs#dfs://NAMENODE:9000 /123/hdfs fuse usetrash,rw 0 0
Check the mount:
# mount
[...]
fuse on /123/hdfs type fuse (rw,nosuid,nodev,allow_other,default_permissions)
And list the directory:
# ls -la /123
total 16
drwxr-xr-x 3 root root 4096 Dec 9 16:36 .
drwxr-xr-x 27 root root 4096 Dec 9 12:11 ..
drwxr-xr-x 5 hdfs nobody 4096 Dec 9 02:14 hdfs
At this point HDFS is available on the local node under /123/hdfs via FUSE.
3. Samba configuration
The next step is to configure Samba to export this FUSE-mounted HDFS path to remote clients.
# cat /etc/samba/smb.conf
[global]
bind interfaces only = yes
deadtime = 15
default case = lower
disable netbios = yes
interfaces = eth0
dns proxy = no
workgroup = HDFS
server string = Samba Server Version %v
socket options = TCP_NODELAY IPTOS_LOWDELAY SO_RCVBUF=65536 SO_SNDBUF=65536
load printers = no
max connections = 30
strict sync = no
sync always = no
syslog = 1
syslog only = yes
security = user
smb passwd file = /etc/samba/smbpasswd
[hdfs]
comment = HDFS
path = /123/hdfs
public = yes
writable = yes
printable = no
create mask = 0744
force user = hdfs
force group = nobody
For authentication, I created a Samba user and password. In this setup I reused the HDFS system user (hdfs, group nobody):
# smbpasswd -a username
Finally, restart the Samba service:
# service smb restart
4. Mounting the share from a client
On another RHEL 5.7 server, the Samba share is mounted to /test using CIFS:
# mount -t cifs -o username=hdfs,rw //SAMBASERVER/hdfs /test
Password: HERE_THE_PASSWORD
Verify the mount:
# ls -la /test/
total 8
drwxr-xr-x 5 hdfs nobody 0 Dec 9 02:14 .
drwxr-xr-x 25 root root 4096 Dec 9 15:03 ..
drwxr-xr-x 3 hdfs nobody 0 Dec 9 02:12 mapred
drwxr-xr-x 3 hdfs nobody 0 Dec 9 02:13 opt
drwxr-xr-x 6 hdfs nobody 0 Dec 9 15:56 user
Now the HDFS filesystem from the test cluster is exported via Samba and accessible to CIFS clients. So far, so good.
5. Read performance test with rsync
The first performance test focused on read throughput. I used rsync to copy a collection of smaller log files from HDFS (via Samba) to the local filesystem:
# cd /tmp/rsync-test
# rsync -av /test/hdfs/user/flume/weblogs/2011-12-07/ .
sent 20478888644 bytes received 92606 bytes 17377158.46 bytes/sec
total size is 20475835998
(19GB, ~16 MB/s)
Number of files synced:
# find . -type f | wc -l
4665
The result shows that reading many small files via Samba on top of HDFS FUSE achieved roughly 16 MB/s on this setup.
6. Write performance test with dd and cp
For write performance, I created a large test file:
# dd if=/dev/zero of=/tmp/13GB bs=128M count=100
Then copied it into the CIFS-mounted share, measuring the time:
# time cp /tmp/13GB /test/hdfs/user/
real 7m57.864s
user 0m0.328s
sys 0m20.602s
This corresponds to roughly 27 MB/s write throughput via Samba on top of FUSE-mounted HDFS.
Check permissions and ownership in HDFS:
# hadoop dfs -ls /user
Found 1 item
-rw-r--r-- 3 hdfs supergroup 13421772800 2011-12-09 15:56 /user/13GB
7. Comparison with direct SCP
To compare, I used scp to write the same file directly to the node where HDFS was mounted:
# scp /tmp/13GB hdfs@SAMBASERVER:/123/hdfs/user
13GB 100% 13GB 47.8MB/s 04:28
Direct SCP to the FUSE mount was significantly faster (~47.8 MB/s) than writing through Samba (~27 MB/s). The additional overhead from Samba clearly costs performance.
Conclusion
This experiment shows that it is technically possible to export a FUSE-mounted HDFS filesystem via Samba and access it from CIFS clients with acceptable read/write performance. For some operational and ad-hoc tasks, this can make it easier to interact with HDFS as a (limited) cluster storage from systems that are not HDFS-aware.
However, the additional layers (FUSE + Samba) introduce overhead and complexity. For production workloads or high-throughput scenarios, native HDFS clients, WebHDFS, HDFS NFS gateways, object stores, or modern data access layers are usually the better choice.
For further tuning ideas, see Samba optimization guides such as: https://calomel.org/samba_optimize.html
If you need help with distributed systems, backend engineering, or data platforms, check my Services.