Monday, 3 August 2020

Split-by and boundary-query example in SQOOP

Example SQOOP code to work with Split-by and boundary-query:

//Split-by key : If we dont have any primary key in the table, we can go with split-by
sqoop import \
-connect jdbc:mysql://localhost:3306/retail_db \
-username root \
-password cloudera \
-table products \
-target-dir /user/cloudera/prod_slitt \
-delete-target-dir \
-split-by product_id 

Retrieved 1345 records.

[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/prod_slitt/* 
-rw-r--r--   1 cloudera cloudera          0 2020-08-03 00:02 /user/cloudera/prod_slitt/_SUCCESS
-rw-r--r--   1 cloudera cloudera      41419 2020-08-03 00:02 /user/cloudera/prod_slitt/part-m-00000
-rw-r--r--   1 cloudera cloudera      43660 2020-08-03 00:02 /user/cloudera/prod_slitt/part-m-00001
-rw-r--r--   1 cloudera cloudera      42195 2020-08-03 00:02 /user/cloudera/prod_slitt/part-m-00002
-rw-r--r--   1 cloudera cloudera      46719 2020-08-03 00:02 /user/cloudera/prod_slitt/part-m-00003

// do line count
$ hdfs dfs -cat /user/cloudera/prod_slitt/* | wc -l
1345


// split-by along with boundary-query to identify min and max values of product_id
sqoop import \
-connect jdbc:mysql://localhost:3306/retail_db \
-username root \
-password cloudera \
-table products \
-target-dir /user/cloudera/prod_slitt_v1 \
-delete-target-dir \
-boundary-query 'select min(product_id), max(product_id)from products where product_id > 100' \
-split-by product_id


 BoundingValsQuery: select min(product_id), max(product_id)from products where product_id > 100
20/08/03 00:08:08 INFO db.IntegerSplitter: Split size: 311; Num splits: 4 from: 101 to: 1345
 Retrieved 1245 records.
 
 hdfs dfs -ls /user/cloudera/prod_slitt_v1
Found 5 items
-rw-r--r--   1 cloudera cloudera          0 2020-08-03 00:09 /user/cloudera/prod_slitt_v1/_SUCCESS
-rw-r--r--   1 cloudera cloudera      37832 2020-08-03 00:08 /user/cloudera/prod_slitt_v1/part-m-00000
-rw-r--r--   1 cloudera cloudera      40905 2020-08-03 00:08 /user/cloudera/prod_slitt_v1/part-m-00001
-rw-r--r--   1 cloudera cloudera      38828 2020-08-03 00:08 /user/cloudera/prod_slitt_v1/part-m-00002
-rw-r--r--   1 cloudera cloudera      43714 2020-08-03 00:08 /user/cloudera/prod_slitt_v1/part-m-00003

// do line count
hdfs dfs -cat /user/cloudera/prod_slitt_v1/* | wc -l
1245

No comments:

Post a Comment

Flume - Simple Demo

// create a folder in hdfs : $ hdfs dfs -mkdir /user/flumeExa // Create a shell script which generates : Hadoop in real world <n>...