I help teams fix systemic engineering issues: processes, architecture, and clarity.
→ See how I work with teams.
Apache-driven webfarms have always been easy to integrate into Hadoop and Hive, but many teams struggled with the same question: How do we collect and analyze IIS logs at scale? This walkthrough shows how IIS W3C logs can be exported, ingested into HDFS, structured in Hive, and processed into partitioned traffic tables for efficient reporting.
Configuring IIS for W3C Logging
Inside IIS Manager, under a website’s configuration, choose:
Logging → Format: W3C
From there you can configure:
- Log output path
- Fields to capture (IP, URI, user agent, status, etc.)
- Log rotation and splitting (hourly rotation recommended)
After applying the configuration and restarting IIS, new logs will begin to appear in the chosen directory.
Mounting IIS Logs on a Linux/Hive Node
For small to medium environments, one simple method is to export the IIS log directory as a Windows share and mount it on the Hive or ingestion node:
mount -t cifs //Windows-Server/share \
-o user=name,password=passwd \
/mountpoint
Copy the logs into HDFS for downstream analysis:
hadoop dfs -copyFromLocal /mountpoint/filename <hdfs-dir>
For this example, assume the logs land in the directory iislog.
Defining Hive Tables for IIS Log Parsing
To analyze IIS logs, start by defining the raw table structure. The schema must match the W3C field layout.
Raw IIS Log Table
hive> CREATE TABLE iislog (
sdate STRING,
stime STRING,
ssitename STRING,
scomputername STRING,
sip STRING,
csmethod STRING,
csuristem STRING,
csuriquery STRING,
sport INT,
csusername STRING,
cip STRING,
csversion STRING,
csuseragent STRING,
csCookie STRING,
csReferer STRING,
scstatus INT,
scsubstatus INT,
scwin32status INT,
scbyte INT,
csbytes INT,
timetaken INT
)
PARTITIONED BY (time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '32'
STORED AS TEXTFILE;
Aggregated Traffic Table
hive> CREATE TABLE iptraffic (
sdate STRING,
cip STRING,
traffic INT,
hits INT,
appid STRING,
scsuseragent STRING
)
PARTITIONED BY (time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '124'
STORED AS TEXTFILE;
This table will store summarized traffic per IP, per minute, with user agent and application metadata.
Automating Partition Creation and ETL
To perform rolling analysis, the workflow creates a Hive partition for the previous two minutes of logs, then runs an INSERT OVERWRITE statement into iptraffic.
Below is an example automation script:
#!/bin/bash
TABLE=IPTRAFFIC
DATEPAR=`date -d '-2 Min' +"%Y%m%d%H%M"`
DATEPATH=`date -d '-2 Min' +"%Y-%m-%d/%H00/%M"`
SDATE=`date -d '-2 Min' +"%Y-%m-%d"`
STIME=`date -d '-2 Min' +"%H:%M"`
# Create the partition
hive -e "ALTER TABLE iptraffic ADD IF NOT EXISTS PARTITION (time='$DATEPAR')"
if [ $? -ne 0 ]; then
echo "Couldn't create partition"
exit 1
else
echo "==> PARTITION (time='$DATEPAR') created"
fi
# Insert summarized traffic
hive -e "INSERT OVERWRITE TABLE iptraffic PARTITION (time=$DATEPAR)
SELECT concat('$SDATE ','$STIME:00'),
cip,
sum(csbytes)/1024 as counter,
count(1) as hits,
ssitename,
csuseragent
FROM iislog
WHERE iislog.time=$DATEPAR
AND NOT(iislog.cip LIKE '192\\.%')
GROUP BY cip,
concat('$SDATE ','$STIME:00'),
csuseragent,
ssitename"
if [ $? -ne 0 ]; then
echo '==> An error occurred in analysis'
exit 1
else
echo '==> Insert analysis successful'
fi
The script performs three core tasks:
- Calculate timestamps for the relevant partition
- Create the Hive partition if missing
- Aggregate traffic and write the result into the warehouse directory
Inspecting Raw and Processed Data
View a sample from the raw IIS log table:
hive> SELECT * FROM iislog LIMIT 10;
Example output snippet:
20110928130000 2011-09-28 10:59:06 W3SVC2 IISTEST xxx.xxx.xxx.xxx GET /images/bluebox.gif - 80 - \
xxx.xxx.xxx.xxx HTTP/1.1 Mozilla/5.0 ... 6313985 NULL 200 0 0 551 1689 201109281300
Now check the aggregated traffic partition:
hadoop dfs -cat /user/hive/warehouse/iptraffic/time=201110071059/* | less
Example output:
2011-10-07 10:59:00|xxx.xxx.xxx.xxx|18|2|W3SVC5|Mozilla/5.0 (MSIE 9.0 ...)
2011-10-07 10:59:00|xxx.xxx.xxx.xxx|1|2|W3SVC7|Mozilla/5.0 (Chrome/14 ...)
The dataset is now structured, partitioned and ready for dashboarding, reporting or downstream machine learning workloads.
Why This Pattern Still Matters
Although modern data pipelines often use tools like Fluent Bit, Kafka, Spark, Iceberg and cloud-native ingestion services, the principles illustrated here remain foundational:
- Normalize logs into structured tables
- Use time-based partitioning for fast filtering
- Apply lightweight automation to maintain rolling windows of data
This workflow helped many teams operationalize IIS analytics long before contemporary lakehouse tooling existed.
If you need help with distributed systems, backend engineering, or data platforms, check my Services.