Mostly it is a good idea to test new code on a reference cluster with a nearly live dataset. To sync files from a cluster to another use the hadoop builtin tool distcp [1]. With a small script I "rebase" a development cluster with logfiles we collected over the past day.   COPYDATE=`date -d '-1 Day' +"%Y-%m-%d"`  DELDATE=`date -d '-3 Day' +"%Y-%m-%d"`  SNAMENODE=namenode1  TNAMENODE=namenode2  PATH="/user/flume/logs"  LOG="/var/log/jobs/sync.log"   #logging  exec >> $LOG 2>&1   echo -e "\n ------- sync $COPYDATE ------- \n"  /usr/bin/hadoop distcp -i -m 100 hdfs://$SNAMENODE:9000/$PATH/$COPYDATE hdfs://$TNAMENODE:9000/$PATH/$COPYDATE/  sleep 60  echo -e "\n ------- delete $DELDATE ------- \n"  /usr/bin/hadoop dfs -rmr /$PATH/$DELDATE  /usr/bin/hadoop dfs -rmr /$PATH/_distcp_logs*  sleep 60  /usr/bin/hadoop dfs -chmod -R 777 /$PATH/   The script copy logfiles from the past day and the giv...
Hey, I'm Alex. I founded X-Warp, Infinimesh, Infinite Devices, Scalytics and worked with Cloudera, E.On, Google, Evariant, and had the incredible luck to build products with outstanding people in my life, across the globe.