I help teams fix systemic engineering issues: processes, architecture, and clarity.
→ See how I work with teams.
Snappy is a high-performance compression and decompression library originally developed at Google. It is optimized for speed rather than maximum compression ratio, making it a preferred codec in many Hadoop and Hive pipelines.
Snappy is integrated in all modern Hadoop distributions (Hadoop 2.x and 3.x) and works with Hive, MapReduce, and increasingly Tez or Spark-based Hive deployments.
1. Create sample input data
Generate a small test file:
$ seq 1 1000 | awk '{OFS="\001";print $1, $1 % 10}' > test_input.hive
$ head -5 test_input.hive
1^A1
2^A2
3^A3
4^A4
5^A5
2. Upload the data into HDFS
Updated command: Hadoop now uses hdfs dfs instead of hadoop dfs.
$ hdfs dfs -mkdir -p /tmp/hivetest
$ hdfs dfs -put test_input.hive /tmp/hivetest
$ hdfs dfs -ls /tmp/hivetest
Found 1 items
-rw-r--r-- 3 hdfs supergroup 5893 ... /tmp/hivetest/test_input.hive
3. Create an external Hive table
External tables in Hive reference a directory:
hive> CREATE EXTERNAL TABLE hivetest1 (a INT, b INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001'
LOCATION '/tmp/hivetest';
hive> SELECT * FROM hivetest1 LIMIT 1;
1 1
4. Enable Snappy in Hive
Before writing compressed output, enable Snappy via session settings:
hive> SET hive.exec.compress.output=true;
hive> SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
hive> SET mapreduce.output.fileoutputformat.compress.type=BLOCK;
Note: In Hadoop 2.x and 3.x, the modern property names start with mapreduce.output. instead of mapred.output.
Now create a managed table for compressed output:
hive> CREATE TABLE hivetest2 (a INT, b INT);
5. Insert data into the Snappy-compressed table
Hive will launch a MapReduce (or Tez) job that produces Snappy-compressed output blocks:
hive> INSERT OVERWRITE TABLE hivetest2
SELECT * FROM hivetest1;
Total jobs = 1
... job output omitted ...
1000 Rows loaded to hivetest2
6. Validate the compressed data file
Verify that Hive created a Snappy file:
$ hdfs dfs -ls /user/hive/warehouse/hivetest2
Found 1 items
-rw-r--r-- 3 hdfs supergroup 4021 ... /user/hive/warehouse/hivetest2/000000_0.snappy
Read the Snappy file (it will appear as binary with readable records):
$ hdfs dfs -cat /user/hive/warehouse/hivetest2/000000_0.snappy | head -5
...binary header...
1^A1
2^A2
3^A3
4^A4
5^A5
7. Load Snappy data into another Hive table
You can also load existing Snappy files using LOAD DATA:
hive> CREATE TABLE hivetest3 (a INT, b INT);
hive> LOAD DATA INPATH '/user/hive/warehouse/hivetest2/000000_0.snappy'
INTO TABLE hivetest3;
hive> SELECT * FROM hivetest3 LIMIT 5;
1 1
2 2
3 3
4 4
5 5
Notes for Modern Hadoop Versions
- Snappy is enabled by default in Hadoop 2.x+ as long as native libraries are installed.
- On Hive-on-Tez or Spark, compression is still controlled via SET statements or table properties.
- Modern Hive uses ORC or Parquet, which internally use Snappy by default.
- This example remains valid for teaching raw file-based compression workflows.
Snappy remains a preferred codec due to its balance between speed and reasonable compression ratio. Always ensure that your Hadoop installation has native Snappy libraries available; otherwise, Hadoop will silently fall back to slow pure-Java implementations.
If you need help with distributed systems, backend engineering, or data platforms, check my Services.