novatechflow

Posts

Showing posts with the label hdfs

XAttr are coming to HDFS

HDFS 2006 [1] describes the use of Extended Attributes. XAttr, known from *NIX Operating Systems, connects physically stored data with describing metadata above the strictly defined attributes by the filesystem. Mostly used to provide additional information, like hash, checksum, encoding or security relevant information like signature or author / creator. According to the source code [2] the use of xattr can be configured by dfs.namenode.fs-limits.max-xattrs-per-inode and dfs.namenode.fs-limits.max-xattr-size in hdfs-default.xml . The default for dfs.namenode.fs-limits.max-xattrs-per-inode is 32 , for dfs.namenode.fs-limits.max-xattr-size the default is 16384 . Within HDFS, the extended user attributes will be stored in the user namespace as an identifier.The identifier has four namespaces, like the Linux FS kernel implementation has: security, system, trusted and user. Only the superuser can access the trusted namespaces (system and security). The xattr def...

How HDFS protects your data

Often we get questions how HDFS protects data and what the mechanisms are to prevent data corruption. Eric Sammer explain this en detail in Hadoop Operations . Additional to the points below, you can also have a second cluster to sync the files, simply to prevent human being failures, like deleting a subset of data. If you have enough space in your cluster, enabling the trash per core-site.xml and setting to a higher value then a day helps too. <property> <name>fs.trash.interval</name> <value>1440</value> <description>Number of minutes after which the checkpoint gets deleted. If zero, the trash feature is disabled. 1440 means 1 day </description> </property> <property> <name>fs.trash.checkpoint.interval</name> <value>15</value> <description>Number of minutes between trash checkpoints. Should be smaller or equal to fs.trash.interval. Every time the checkpointer runs it creates a new check...

Secure your cluster with kerberos

Many times I get questions about a safe and fast way to secure a cluster without big steps like integrate AD structures, simply to prevent unauthorized access. I created this writeup to let you know the steps you need. I used CentOS 6.3 and CDH 4.0.1, but you can use other distributions as well. Setup KDC on a Linux Box Install kerberos5 related packages as well as kadmin, too. First thing you have to do is to replace EXAMPLE.COM, which is delivered per default, with your own realm. I used ALO.ALT here. Example config: # hadoop1> cat /etc/krb5.conf [libdefaults] default_realm = ALO.ALT dns_lookup_realm = false dns_lookup_kdc = false [realms] ALO.ALT = { kdc = HADOOP1.ALO.ALT:88 admin_server = HADOOP1.ALO.ALT:749 default_domain = HADOOP1.ALO.ALT } [domain_realm] .alo.alt = ALO.ALT alo.alt = ALO.ALT [logging] kdc = FILE:/var/log/krb5kdc.log admin_server = FILE:/var/log/kadmin.lo...

Export HDFS over CIFS (Samba3)

Three weeks ago I played with libhdfs and NFS, but I did not get the results I expected. Then my next idea was, why not to use Samba? Samba3x is stable and most OS can mount an exported share. The main task was to research the performance and setup of this scenario, because samba has a lot of tuning mechanisms inside. Let's go! I used a RHEL 5.7 and the delivered RPMs: #> rpm -qa|grep samba samba-3.0.33-3.29.el5_7.4.x86_64 samba-common-3.0.33-3.29.el5_7.4.x86_64 Like I described in " NFS exported HDFS " I mounted hdfs over fuse into the directory /123 via /etc/fstab: #> cat /etc/fstab [..] hadoop-fuse-dfs#dfs://NAMENODE:9000 /123/hdfs fuse usetrash,rw 0 0 and checked it: #> mount [..] fuse on /123/hdfs type fuse (rw,nosuid,nodev,allow_other,default_permissions) #> ls -la /123 total 16 drwxr-xr-x 3 root root 4096 Dec 9 16:36 . drwxr-xr-x 27 root root ...

NFS exported HDFS (CDH3)

For some reasons it could be a good idea to make a hdfs filesystem available across networks as a exported share. Here I describe a working scenario with linux and hadoop with tools both have on board. I used fuse and libhdfs to mount a hdfs filesystem. Change namenode.local and <PORT> to fit your environment. Install: yum install hadoop-0.20-fuse.x86_64 hadoop-0.20-libhdfs.x86_64 Create a mountpoint: mkdir /hdfs-mount Mount your hdfs (testing): hadoop-fuse-dfs dfs://namenode.local:<PORT> /hdfs-mount -d You will show like that: INFO fuse_options.c:162 Adding FUSE arg /hdfs-mount INFO fuse_options.c:110 Ignoring option -d unique: 1, opcode: INIT (26), nodeid: 0, insize: 56 INIT: 7.10 flags=0x0000000b max_readahead=0x00020000 INFO fuse_init.c:101 Mounting namenode.local:<PORT> INIT: 7.8 flags=0x00000001 max_readahead=0x00020000 max_write=0x00020000 uniqu...