Example SQOOP code to work with Split-by and boundary-query:
//Split-by key : If we dont have any primary key in the table, we can go with split-by
sqoop import \
-connect jdbc:mysql://localhost:3306/retail_db \
-username root \
-password cloudera \
-table products \
-target-dir /user/cloudera/prod_slitt \
-delete-target-dir \
-split-by product_id
Retrieved 1345 records.
[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/prod_slitt/*
-rw-r--r-- 1 cloudera cloudera 0 2020-08-03 00:02 /user/cloudera/prod_slitt/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 41419 2020-08-03 00:02 /user/cloudera/prod_slitt/part-m-00000
-rw-r--r-- 1 cloudera cloudera 43660 2020-08-03 00:02 /user/cloudera/prod_slitt/part-m-00001
-rw-r--r-- 1 cloudera cloudera 42195 2020-08-03 00:02 /user/cloudera/prod_slitt/part-m-00002
-rw-r--r-- 1 cloudera cloudera 46719 2020-08-03 00:02 /user/cloudera/prod_slitt/part-m-00003
// do line count
$ hdfs dfs -cat /user/cloudera/prod_slitt/* | wc -l
1345
// split-by along with boundary-query to identify min and max values of product_id
sqoop import \
-connect jdbc:mysql://localhost:3306/retail_db \
-username root \
-password cloudera \
-table products \
-target-dir /user/cloudera/prod_slitt_v1 \
-delete-target-dir \
-boundary-query 'select min(product_id), max(product_id)from products where product_id > 100' \
-split-by product_id
BoundingValsQuery: select min(product_id), max(product_id)from products where product_id > 100
20/08/03 00:08:08 INFO db.IntegerSplitter: Split size: 311; Num splits: 4 from: 101 to: 1345
Retrieved 1245 records.
hdfs dfs -ls /user/cloudera/prod_slitt_v1
Found 5 items
-rw-r--r-- 1 cloudera cloudera 0 2020-08-03 00:09 /user/cloudera/prod_slitt_v1/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 37832 2020-08-03 00:08 /user/cloudera/prod_slitt_v1/part-m-00000
-rw-r--r-- 1 cloudera cloudera 40905 2020-08-03 00:08 /user/cloudera/prod_slitt_v1/part-m-00001
-rw-r--r-- 1 cloudera cloudera 38828 2020-08-03 00:08 /user/cloudera/prod_slitt_v1/part-m-00002
-rw-r--r-- 1 cloudera cloudera 43714 2020-08-03 00:08 /user/cloudera/prod_slitt_v1/part-m-00003
// do line count
hdfs dfs -cat /user/cloudera/prod_slitt_v1/* | wc -l
1245
No comments:
Post a Comment