Back in 2011, Microsoft shipped a dedicated SQL Server–Hadoop Connector that plugged into Apache Sqoop as an additional connector package. It automated import and export between SQL Server, HDFS, and Hive, and required a separate download plus the Microsoft JDBC driver.
Today, the situation is different:
- Apache Sqoop is retired and has been moved to the Apache Attic (June 2021). It still works but is no longer actively developed or recommended for new projects.:contentReference[oaicite:0]{index=0}
- The original SQL Server–Hadoop Connector tarball is effectively legacy; most distributions simply use Sqoop’s generic JDBC support plus the Microsoft JDBC driver.:contentReference[oaicite:1]{index=1}
- Modern stacks often use Kafka Connect, Spark-based ingestion, or managed services (e.g. Azure Data Factory, HDInsightSqoop) to move data between SQL Server and data lakes.:contentReference[oaicite:2]{index=2}
However, many clusters still run Sqoop, and a lot of people land here from search. So this article keeps the core idea—Sqoop <=> SQL Server—but refreshes the details and points out what is legacy versus current practice.
1. Legacy SQL Server–Hadoop Connector (Historical Context)
Originally, Microsoft provided a separate package named something like sqoop-sqlserver-1.0.tar.gz. It contained:
install.sh– installer that dropped the connector jar and config into your Sqoop installationlib/– Sqoop connector jarconf/– connector configuration- PDF user guide and license files:contentReference[oaicite:3]{index=3}
The setup looked roughly like this:
# 1. Install Microsoft JDBC driver
# (e.g. sqljdbc4.jar) into $SQOOP_HOME/lib
# 2. Unpack the connector
tar -zxvf sqoop-sqlserver-1.0.tar.gz
cd sqoop-sqlserver-1.0
# 3. Set MSSQL_CONNECTOR_HOME and run installer
export MSSQL_CONNECTOR_HOME=/usr/sqoop/connector/mssql
sh ./install.sh
On modern systems, you will rarely use this legacy connector. Sqoop’s generic JDBC support plus the SQL Server JDBC driver is enough for most use cases; vendor distributions that still ship Sqoop often embed any required SQL Server integration out of the box.:contentReference[oaicite:4]{index=4}
2. Modern Setup: Sqoop + Microsoft JDBC Driver
If you still run Sqoop and want to integrate with SQL Server, the minimal and current approach is:
- Install Sqoop (usually as part of a Hadoop distribution or HDP/CDH/CDP stack).
- Download the latest Microsoft JDBC driver for SQL Server.:contentReference[oaicite:5]{index=5}
- Copy the appropriate driver jar (for Java 8/11, etc.) into
$SQOOP_HOME/lib.
Example (simplified):
# Download and unpack JDBC driver (version/name will vary)
tar -zxvf sqljdbc_<version>_enu.tar.gz
cp sqljdbc_<version>/enu/sqljdbc4.jar $SQOOP_HOME/lib
# Verify Sqoop sees the driver
sqoop list-databases \
--connect "jdbc:sqlserver://<HOST>:1433;database=<DB>" \
--username dbuser \
--password dbpasswd
At this point, Sqoop uses the Microsoft JDBC driver via the standard connector mechanism; no extra Microsoft “Sqoop adapter” is strictly required.
3. Importing from SQL Server to HDFS with Sqoop
The original article showed a simple example that split work across three mappers. A modernized version looks like this:
sqoop import \
--connect "jdbc:sqlserver://<HOST>:1433;database=<DB>" \
--username dbuser \
--password dbpasswd \
--table <TABLE> \
--target-dir /path/to/hdfs/dir \
--split-by <PRIMARY_KEY_COLUMN> \
-m 3
Notes:
- split-by should be a numeric, evenly distributed column (typically the primary key).
- The number of mappers
-mshould match available cluster capacity and database limits. - For large tables, consider incremental imports (
--incremental,--check-column,--last-value).
The original PoC reported roughly one minute for ~1.3 GB over three mappers. Real-world performance depends heavily on network bandwidth, database tuning, transaction isolation, and how busy the cluster is.
4. Exporting from HDFS back to SQL Server
The reverse direction—writing processed data back to SQL Server—uses sqoop export:
sqoop export \
--connect "jdbc:sqlserver://<HOST>:1433;database=<DB>" \
--username dbuser \
--password dbpasswd \
--table <TARGET_TABLE> \
--export-dir /path/from/hdfs/dir \
--batch
The original example used --direct, which leveraged SQL Server bulk APIs through the connector. On some modern distributions and driver combinations this is no longer supported or behaves differently. When in doubt:
- Start with standard JDBC-based export (no
--direct). - Use
--batchfor better throughput, but monitor transaction logs and locking. - Make sure HDFS files and the table schema (columns, types, nullability) align closely.
5. Security and Connectivity Considerations
For anything beyond quick PoCs:
- Use encrypted connections to SQL Server (JDBC connection string with SSL/TLS options).
- Consider Windows/AD authentication via Kerberos if supported by your environment and driver.:contentReference[oaicite:6]{index=6}
- Lock down Sqoop hosts so the JDBC URL, user, and password are not world-readable (no plain text passwords in shared shell histories).
- Use dedicated service logins with restricted privileges instead of full
dboaccess.
6. Where This Fits in Modern Architectures
Apache Sqoop’s mission was to be the bulk data transfer tool between Hadoop and relational databases. That mission is effectively complete: Sqoop is now in the Apache Attic and no longer actively developed.:contentReference[oaicite:7]{index=7}
In modern lakehouse and cloud-native architectures, you will see more of:
- Kafka Connect JDBC for continuous CDC or streaming-style ingestion.
- Spark / Flink connectors for SQL Server and other RDBMS.
- Azure Data Factory, Synapse pipelines, HDInsight Sqoop wrappers for managed bulk transfers with monitoring and orchestration.:contentReference[oaicite:8]{index=8}
However, many on-premise or long-lived Hadoop clusters still rely on Sqoop jobs. For those environments, the pattern in this article remains a pragmatic way to move data between SQL Server and HDFS, as long as you understand that you are building on top of a retired but still functional tool.
Summary
The original 2011 post focused on Microsoft’s SQL Server–Hadoop Connector and showed how to move data between SQL Server and HDFS via Sqoop. This refreshed version:
- Explains that Sqoop is now retired and the Microsoft connector is legacy.
- Shows how to integrate Sqoop with modern SQL Server JDBC drivers.
- Updates import/export examples and flags (
--batch, careful use of--direct). - Places the pattern in context of modern ingestion options.
If you are maintaining long-lived Hadoop clusters, this pattern is still useful. For completely new architectures, you should also evaluate more modern, managed, and streaming-friendly ingestion options.
If you need help with distributed systems, backend engineering, or data platforms, check my Services.