Listen:
Sqoop [1] (sql to hadoop) lets easy connect RDBMS into a hadoop infrastructure. Newest plugin comes from Microsoft and let us connect MS-SQL Server and hadoop each together. As a cool feature you can create a jar-file from your job, its pretty easy, just here a line:
sqoop export --connect jdbc:<RDBMS>:thin:@<HOSTNAME>:<PORT>:<DB-NAME> --ta ble<TABLENAME> --username<USERNAME> --password<PASSWORD> --export-dir <HDFS DIR WHICH CONTAINS DATA> --direct --fields-terminated-by '<TERMINATOR (Java)>' --package-name <JOBNAME>.<IDENTIFIER> --outdir <WHERE THE JAR SHOULD WRITTEN> --bindir <BIN_DIR>
After you fired up you'll find a jar-package in --outdir, unzip it and you find your java-code and the precompiled class,so you can start to tune them.
Now lets start the job again, but use the precompiled class:
sqoop export --connect jdbc:<RDBMS>:thin:@<HOSTNAME>:<PORT>:<DB-NAME> --ta ble<TABLENAME> --username<USERNAME> --password<PASSW ORD> --export-dir <HDFS DIR WHICH CONTAINS DATA> --direct --fields-terminated-by '<TERMINATOR (Java)>' --jar-file <PATH/TO/JAR>
--class-name <JOBNAME>.<IDENTIFIER>.<CLASSNAME>
The step above let you increase the export of large datasets dramatically. So I speedup a export of 100k records from hdfs into a oracle-DB from 16sec into 8sec.
sqoop export --connect jdbc:<RDBMS>:thin:@<HOSTNAME>:<PORT>:<DB-NAME> --ta
After you fired up you'll find a jar-package in --outdir, unzip it and you find your java-code and the precompiled class,so you can start to tune them.
Now lets start the job again, but use the precompiled class:
sqoop export --connect jdbc:<RDBMS>:thin:@<HOSTNAME>:<PORT>:<DB-NAME> --ta
--class-name <JOBNAME>.<IDENTIFIER>.<CLASSNAME>
The step above let you increase the export of large datasets dramatically. So I speedup a export of 100k records from hdfs into a oracle-DB from 16sec into 8sec.
Comments
Post a Comment