novatechflow | Alexander Alten

Posts

Showing posts from January, 2012

How to Use Snappy Compression with Hive and Hadoop (Updated)

Snappy is a fast compression codec widely used in Hadoop ecosystems. This updated guide shows how to generate data, upload it to HDFS, process it with Hive using Snappy compression, verify the output files, and load Snappy-compressed data back into Hive using modern Hadoop commands. Snappy is a high-performance compression and decompression library originally developed at Google. It is optimized for speed rather than maximum compression ratio, making it a preferred codec in many Hadoop and Hive pipelines. Snappy is integrated in all modern Hadoop distributions (Hadoop 2.x and 3.x) and works with Hive, MapReduce, and increasingly Tez or Spark-based Hive deployments. 1. Create sample input data Generate a small test file: $ seq 1 1000 | awk '{OFS="\001";print $1, $1 % 10}' > test_input.hive $ head -5 test_input.hive 1^A1 2^A2 3^A3 4^A4 5^A5 2. Upload the data into HDFS Updated command: Hadoop now uses hdfs dfs instead of hadoop dfs . $ hdfs dfs ...