Mostly it is a good idea to test new code on a reference cluster with a nearly live dataset. To sync files from a cluster to another use the hadoop builtin tool distcp [1]. With a small script I "rebase" a development cluster with logfiles we collected over the past day. COPYDATE=`date -d '-1 Day' +"%Y-%m-%d"` DELDATE=`date -d '-3 Day' +"%Y-%m-%d"` SNAMENODE=namenode1 TNAMENODE=namenode2 PATH="/user/flume/logs" LOG="/var/log/jobs/sync.log" #logging exec >> $LOG 2>&1 echo -e "\n ------- sync $COPYDATE ------- \n" /usr/bin/hadoop distcp -i -m 100 hdfs://$SNAMENODE:9000/$PATH/$COPYDATE hdfs://$TNAMENODE:9000/$PATH/$COPYDATE/ sleep 60 echo -e "\n ------- delete $DELDATE ------- \n" /usr/bin/hadoop dfs -rmr /$PATH/$DELDATE /usr/bin/hadoop dfs -rmr /$PATH/_distcp_logs* sleep 60 /usr/bin/hadoop dfs -chmod -R 777 /$PATH/ The script copy logfiles from the past day and the giv
Hey, I'm Alex. I founded X-Warp, Infinimesh, Infinite Devices, Scalytics and worked with Cloudera, E.On, Google, Evariant, and had the incredible luck to build products with outstanding people in my life, across the globe.