Showing posts with label sqoop. Show all posts
Showing posts with label sqoop. Show all posts

Saturday, 8 August 2020

SQOOP Export with Staging table Example - HDFS to MySQL data export using SQOOP with staging-table

mysql> use retail_db;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A


Database changed
mysql> show tables;
+---------------------+
| Tables_in_retail_db |
+---------------------+
| categories |
| customers |
| departments |
| order_items |
| orders |
| products |
+---------------------+
6 rows in set (0.00 sec)


mysql> describe orders;
+-------------------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+-------------+------+-----+---------+----------------+
| order_id | int(11) | NO | PRI | NULL | auto_increment |
| order_date | datetime | NO | | NULL | |
| order_customer_id | int(11) | NO | | NULL | |
| order_status | varchar(45) | NO | | NULL | |
+-------------------+-------------+------+-----+---------+----------------+




// create a staging table with the same structure of orders table


mysql> create table retail_db.orders_staging as select * from retail_db.orders where 1 = 0;
Query OK, 0 rows affected (0.10 sec)
Records: 0 Duplicates: 0 Warnings: 0


// create a new table : orders_exported which has the same structure of retail_db.orders table


mysql> create table retail_db.orders_exported as select * from retail_db.orders where 1 = 0;
Query OK, 0 rows affected (0.00 sec)
Records: 0 Duplicates: 0 Warnings: 0






// verify the structure of newly created orders_staging table
mysql> describe orders_staging;
+-------------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+-------------+------+-----+---------+-------+
| order_id | int(11) | NO | | 0 | |
| order_date | datetime | NO | | NULL | |
| order_customer_id | int(11) | NO | | NULL | |
| order_status | varchar(45) | NO | | NULL | |
+-------------------+-------------+------+-----+---------+-------+
4 rows in set (0.00 sec)


// I already have the orders data in HDFS
$ hdfs dfs -ls /user/cloudera/orders
Found 5 items
-rw-r--r-- 1 cloudera cloudera 0 2020-08-06 05:52 /user/cloudera/orders/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 741614 2020-08-06 05:52 /user/cloudera/orders/part-m-00000
-rw-r--r-- 1 cloudera cloudera 753022 2020-08-06 05:52 /user/cloudera/orders/part-m-00001
-rw-r--r-- 1 cloudera cloudera 752368 2020-08-06 05:52 /user/cloudera/orders/part-m-00002
-rw-r--r-- 1 cloudera cloudera 752940 2020-08-06 05:52 /user/cloudera/orders/part-m-00003




// Export HDFS data into MySQL table using Staging stable approach
sqoop export \
-connect "jdbc:mysql://localhost/retail_db" \
-username root \
-password cloudera \
-table orders_exported \
-export-dir hdfs://localhost:8020/user/cloudera/orders/ \
-staging-table orders_staging \
-clear-staging-table \
-input-fields-terminated-by ','




20/08/08 00:24:45 INFO mapreduce.ExportJobBase: Exported 68883 records.
20/08/08 00:24:45 INFO mapreduce.ExportJobBase: Starting to migrate data from staging table to destination.
20/08/08 00:24:45 INFO manager.SqlManager: Migrated 68883 records from `orders_staging` to `orders_exported`




mysql> select * from retail_db.orders_exported limit 5;
+----------+---------------------+-------------------+-----------------+
| order_id | order_date | order_customer_id | order_status |
+----------+---------------------+-------------------+-----------------+
| 1 | 2013-07-25 00:00:00 | 11599 | CLOSED |
| 2 | 2013-07-25 00:00:00 | 256 | PENDING_PAYMENT |
| 3 | 2013-07-25 00:00:00 | 12111 | COMPLETE |
| 4 | 2013-07-25 00:00:00 | 8827 | CLOSED |
| 5 | 2013-07-25 00:00:00 | 11318 | COMPLETE |
+----------+---------------------+-------------------+-----------------+
5 rows in set (0.00 sec)


mysql> select count(1) from retail_db.orders_exported limit 5;
+----------+
| count(1) |
+----------+
| 68883 |
+----------+
1 row in set (0.01 sec)




Line count in HDFS:
$ hdfs dfs -cat hdfs://localhost:8020/user/cloudera/orders/* | wc -l
68883







Friday, 7 August 2020

Import all the tables from MySQL to Hive with Snappy Compression and Parquet file creation using SQOOP

mysql> use retail_db;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show tables;
+---------------------+
| Tables_in_retail_db |
+---------------------+
| categories          |
| customers           |
| departments         |
| order_items         |
| orders              |
| products            |
+---------------------+
6 rows in set (0.00 sec)


Here we are going to import all the tables of retail_db database of MySQL into Hive.

Additional options we have used are : Snappy Compression, Save as Parquet file 


// Delete the database in Hive

hive> drop database ohm cascade;

OK

Time taken: 1.187 seconds

 

 

hive> create database ohm;

OK



sqoop import-all-tables \

-connect jdbc:mysql://localhost:3306/retail_db \

-username root \

-password cloudera \

-warehouse-dir /user/hive/warehouse \

-m 1 \

-hive-database ohm \

-hive-import \

-hive-overwrite \

-create-hive-table \

-compress \

-compression-codec snappy \

-as-parquetfile \

-outdir java_out ;





hive> use ohm;

OK

Time taken: 0.01 seconds 

hive> show tables;

OK

categories

customers

departments

order_items

orders

products

Time taken: 0.029 seconds, Fetched: 6 row(s)



hive> describe database ohm;

OK

ohm hdfs://quickstart.cloudera:8020/user/hive/warehouse/ohm.db cloudera USER

Time taken: 0.011 seconds, Fetched: 1 row(s)




The following is the way to import all the tables one by one

sqoop import \
-connect "jdbc:mysql://localhost:3306/retail_db" \
-username root \
-password cloudera \
-table categories \
-target-dir /user/cloudera/retail_db\
-hive-import \
-create-hive-table \
-hive-database ohm \
-hive-table categories;


sqoop import \
-connect "jdbc:mysql://localhost:3306/retail_db" \
-username root \
-password cloudera \
-table customers \
-target-dir /user/cloudera/retail_db\
-hive-import \
-create-hive-table \
-hive-database ohm \
-hive-table customers;



sqoop import \
-connect "jdbc:mysql://localhost:3306/retail_db" \
-username root \
-password cloudera \
-table departments \
-target-dir /user/cloudera/retail_db/dept \
-hive-import \
-create-hive-table \
-hive-database ohm \
-hive-table departments;



sqoop import \
-connect "jdbc:mysql://localhost:3306/retail_db" \
-username root \
-password cloudera \
-table order_items \
-target-dir /user/cloudera/retail_db/order_items \
-hive-import \
-create-hive-table \
-hive-database ohm \
-hive-table order_items;



sqoop import \
-connect "jdbc:mysql://localhost:3306/retail_db" \
-username root \
-password cloudera \
-table orders \
-target-dir /user/cloudera/retail_db/orders \
-hive-import \
-create-hive-table \
-hive-database ohm \
-hive-table orders;


sqoop import \
-connect "jdbc:mysql://localhost:3306/retail_db" \
-username root \
-password cloudera \
-table products \
-target-dir /user/cloudera/retail_db\
-hive-import \
-create-hive-table \
-hive-database ohm \
-hive-table products;





sqoop import-all-tables \
-connect jdbc:mysql://localhost:3306/retail_db \
-username root \
-password cloudera  
-target-dir /user/cloudera/retail_db\
-hive-import \
-create-hive-table \
-hive-database ohm \
-hive-table products;


hive> show databases;
OK
default
ohm


hive> use ohm;
OK



hive> show tables;
OK
categories
cust_parquet
customer_parquet_snappy
customers
departments
order_items
person_fixed
products


 

Create, Execute, Delete SQOOP Jobs

 // I have a ohm.person table in MySQL with the following records 


mysql> select * from ohm.person;

+----+-----------------+---------------------+

| id | name            | last_mod_dt         |

+----+-----------------+---------------------+

|  1 | Raja            | 2020-08-07 01:17:17 |

|  2 | Ravi            | 2020-08-07 01:17:30 |

|  3 | Kalai           | 2020-08-07 01:17:34 |

|  4 | Sudha           | 2020-08-07 01:17:39 |

|  5 | Priya           | 2020-08-07 01:31:28 |

|  6 | Vanitha         | 2020-08-07 01:31:34 |

|  7 | Kasturi         | 2020-08-07 01:31:40 |

|  8 | Lakshmi         | 2020-08-07 01:31:45 |

|  9 | Suriya Devi     | 2020-08-07 01:31:51 |

| 10 | Nanjil Vijayan  | 2020-08-07 01:40:53 |

| 11 | Elizabeth Helen | 2020-08-07 01:41:14 |

| 12 | Peter Paul      | 2020-08-07 01:41:20 |

| 13 | Ravindran       | 2020-08-07 01:41:35 |

+----+-----------------+---------------------+




// Create a new Sqoop Job:


$ sqoop job \

-create person_inc_job \

-- import \

-connect jdbc:mysql://localhost:3306/ohm \

-table person \

-username root \

-password cloudera \

-check-column last_mod_dt \

-incremental append   \

-last-value '01:17:17' \

-target-dir /user/cloudera/person_test \

-m 1  



// Display all job names already created


$ sqoop job --list


Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.

Please set $ACCUMULO_HOME to the root of your Accumulo installation.

20/08/07 02:05:54 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0

Available jobs:


  person_inc_job


// delete the target folder if exists


$ hdfs dfs -rm -r /user/cloudera/person_test

rm: `/user/cloudera/person_test': No such file or directory




// run the sqoop job for the first time 

$ sqoop job --exec person_inc_job


 Retrieved 13 records.



$ hdfs dfs -cat /user/cloudera/person_test/*

1,Raja,2020-08-07 01:17:17.0

2,Ravi,2020-08-07 01:17:30.0

3,Kalai,2020-08-07 01:17:34.0

4,Sudha,2020-08-07 01:17:39.0

5,Priya,2020-08-07 01:31:28.0

6,Vanitha,2020-08-07 01:31:34.0

7,Kasturi,2020-08-07 01:31:40.0

8,Lakshmi,2020-08-07 01:31:45.0

9,Suriya Devi,2020-08-07 01:31:51.0

10,Nanjil Vijayan,2020-08-07 01:40:53.0

11,Elizabeth Helen,2020-08-07 01:41:14.0

12,Peter Paul,2020-08-07 01:41:20.0

13,Ravindran,2020-08-07 01:41:35.0



// Now again add some more new records in MySQL


mysql> insert into ohm.person(name) values ('Anbu');

Query OK, 1 row affected (0.00 sec)


mysql> insert into ohm.person(name) values ('Sudha');

Query OK, 1 row affected (0.01 sec)


mysql> insert into ohm.person(name) values ('Aish');

Query OK, 1 row affected (0.00 sec)


mysql> insert into ohm.person(name) values ('Vijay');

Query OK, 1 row affected (0.00 sec)


mysql> insert into ohm.person(name) values ('Balaji');

Query OK, 1 row affected (0.01 sec)



mysql> select * from ohm.person;  -- MySQL console

+----+-----------------+---------------------+

| id | name            | last_mod_dt         |

+----+-----------------+---------------------+

|  1 | Raja            | 2020-08-07 01:17:17 |

|  2 | Ravi            | 2020-08-07 01:17:30 |

|  3 | Kalai           | 2020-08-07 01:17:34 |

|  4 | Sudha           | 2020-08-07 01:17:39 |

|  5 | Priya           | 2020-08-07 01:31:28 |

|  6 | Vanitha         | 2020-08-07 01:31:34 |

|  7 | Kasturi         | 2020-08-07 01:31:40 |

|  8 | Lakshmi         | 2020-08-07 01:31:45 |

|  9 | Suriya Devi     | 2020-08-07 01:31:51 |

| 10 | Nanjil Vijayan  | 2020-08-07 01:40:53 |

| 11 | Elizabeth Helen | 2020-08-07 01:41:14 |

| 12 | Peter Paul      | 2020-08-07 01:41:20 |

| 13 | Ravindran       | 2020-08-07 01:41:35 |  -- last value

| 14 | Anbu            | 2020-08-07 02:23:25 |  -- newly added records 

| 15 | Sudha           | 2020-08-07 02:23:29 |

| 16 | Aish            | 2020-08-07 02:23:36 |

| 17 | Vijay           | 2020-08-07 02:23:44 |

| 18 | Balaji          | 2020-08-07 02:23:47 |

+----+-----------------+---------------------+

18 rows in set (0.00 sec)



// run the sqoop job again to fetch incremental records

$ sqoop job --exec person_inc_job


Retrieved 5 records.


// Display the imported data available in HDFS

$ hdfs dfs -cat /user/cloudera/person_test/*

1,Raja,2020-08-07 01:17:17.0

2,Ravi,2020-08-07 01:17:30.0

3,Kalai,2020-08-07 01:17:34.0

4,Sudha,2020-08-07 01:17:39.0

5,Priya,2020-08-07 01:31:28.0

6,Vanitha,2020-08-07 01:31:34.0

7,Kasturi,2020-08-07 01:31:40.0

8,Lakshmi,2020-08-07 01:31:45.0

9,Suriya Devi,2020-08-07 01:31:51.0

10,Nanjil Vijayan,2020-08-07 01:40:53.0

11,Elizabeth Helen,2020-08-07 01:41:14.0

12,Peter Paul,2020-08-07 01:41:20.0

13,Ravindran,2020-08-07 01:41:35.0 -- loaded from FULL LOAD

14,Anbu,2020-08-07 02:23:25.0   -- incremental records

15,Sudha,2020-08-07 02:23:29.0

16,Aish,2020-08-07 02:23:36.0

17,Vijay,2020-08-07 02:23:44.0

18,Balaji,2020-08-07 02:23:47.0



// To delete the sqoop job :

$ sqoop job --delete person_inc_job;

Incremental Import in SQOOP with Date column

// Create a database named : ohm 

// Create a table named : person   with id (auto increment), name string, last_mod_dt (default value is current_timestamp)


mysql> create database if not exists ohm;

Query OK, 1 row affected (0.00 sec)

 

mysql> use ohm;


CREATE TABLE if not exists ohm.person (

  id BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,

  name VARCHAR(500) NOT NULL,

  last_mod_dt TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,

  PRIMARY KEY (id),

  UNIQUE KEY person (last_mod_dt)

)  ;


mysql> describe ohm.person;

+-------------+---------------------+------+-----+-------------------+----------------+

| Field       | Type                | Null | Key | Default           | Extra          |

+-------------+---------------------+------+-----+-------------------+----------------+

| id          | bigint(20) unsigned | NO   | PRI | NULL              | auto_increment |

| name        | varchar(500)        | NO   |     | NULL              |                |

| last_mod_dt | timestamp           | NO   | UNI | CURRENT_TIMESTAMP |                |

+-------------+---------------------+------+-----+-------------------+----------------+



// Add 4 records

mysql> insert into ohm.person(name) values ('Raja');

Query OK, 1 row affected (0.00 sec)


mysql> insert into ohm.person(name) values ('Ravi');

Query OK, 1 row affected (0.00 sec)


mysql> insert into ohm.person(name) values ('Kalai');

Query OK, 1 row affected (0.00 sec)


mysql> insert into ohm.person(name) values ('Sudha');

Query OK, 1 row affected (0.00 sec)


mysql> insert into ohm.person(name) values ('Priya');

Query OK, 1 row affected (0.00 sec)


mysql> insert into ohm.person(name) values ('Vanitha');

Query OK, 1 row affected (0.00 sec)


mysql> insert into ohm.person(name) values ('Kasturi');

Query OK, 1 row affected (0.00 sec)


mysql> insert into ohm.person(name) values ('Lakshmi');

Query OK, 1 row affected (0.00 sec)


mysql> insert into ohm.person(name) values ('Suriya Devi');



mysql> select * from ohm.person;

+----+-------------+---------------------+

| id | name        | last_mod_dt         |

+----+-------------+---------------------+

|  1 | Raja        | 2020-08-07 01:17:17 |

|  2 | Ravi        | 2020-08-07 01:17:30 |

|  3 | Kalai       | 2020-08-07 01:17:34 |

|  4 | Sudha       | 2020-08-07 01:17:39 |

|  5 | Priya       | 2020-08-07 01:31:28 |

|  6 | Vanitha     | 2020-08-07 01:31:34 |

|  7 | Kasturi     | 2020-08-07 01:31:40 |

|  8 | Lakshmi     | 2020-08-07 01:31:45 |

|  9 | Suriya Devi | 2020-08-07 01:31:51 |

+----+-------------+---------------------+





Now do SQOOP Import full load 


$ sqoop import \

-connect jdbc:mysql://localhost:3306/ohm \

-table person \

-username root \

-password cloudera \

-check-column last_mod_dt \

-incremental append   \

-last-value '2020-08-07 01:17:17' \

-target-dir /user/cloudera/person_test



 BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `person` WHERE ( `last_mod_dt` > '2020-08-07 01:17:17' AND `last_mod_dt` <= '2020-08-07 01:31:51.0' )



$  hdfs dfs -ls /user/cloudera/person_test

Found 5 items

-rw-r--r--   1 cloudera cloudera          0 2020-08-07 01:28 /user/cloudera/person_test/_SUCCESS

-rw-r--r--   1 cloudera cloudera         29 2020-08-07 01:28 /user/cloudera/person_test/part-m-00000

-rw-r--r--   1 cloudera cloudera         29 2020-08-07 01:28 /user/cloudera/person_test/part-m-00001

-rw-r--r--   1 cloudera cloudera         30 2020-08-07 01:28 /user/cloudera/person_test/part-m-00002

-rw-r--r--   1 cloudera cloudera         30 2020-08-07 01:28 /user/cloudera/person_test/part-m-00003


// Here we have imported all 4 records using FULL LOAD

$ hdfs dfs -cat /user/cloudera/person_test/*

2,Ravi,2020-08-07 01:17:30.0

3,Kalai,2020-08-07 01:17:34.0

4,Sudha,2020-08-07 01:17:39.0

5,Priya,2020-08-07 01:31:28.0

6,Vanitha,2020-08-07 01:31:34.0

7,Kasturi,2020-08-07 01:31:40.0

8,Lakshmi,2020-08-07 01:31:45.0

9,Suriya Devi,2020-08-07 01:31:51.0


 

// Adding new records in MySQL


mysql> insert into ohm.person(name) values ('Nanjil Vijayan');

Query OK, 1 row affected (0.01 sec)


mysql> insert into ohm.person(name) values ('Elizabeth Helen');

Query OK, 1 row affected (0.01 sec)


mysql> insert into ohm.person(name) values ('Peter Paul');

Query OK, 1 row affected (0.00 sec)


mysql> insert into ohm.person(name) values ('Ravindran');

Query OK, 1 row affected (0.00 sec)


mysql> select * from ohm.person;

+----+-----------------+---------------------+

| id | name            | last_mod_dt         |

+----+-----------------+---------------------+

|  1 | Raja            | 2020-08-07 01:17:17 |

|  2 | Ravi            | 2020-08-07 01:17:30 |

|  3 | Kalai           | 2020-08-07 01:17:34 |

|  4 | Sudha           | 2020-08-07 01:17:39 |

|  5 | Priya           | 2020-08-07 01:31:28 |

|  6 | Vanitha         | 2020-08-07 01:31:34 |

|  7 | Kasturi         | 2020-08-07 01:31:40 |

|  8 | Lakshmi         | 2020-08-07 01:31:45 |

|  9 | Suriya Devi     | 2020-08-07 01:31:51 |  -- Last-value of the previous import

| 10 | Nanjil Vijayan  | 2020-08-07 01:40:53 |

| 11 | Elizabeth Helen | 2020-08-07 01:41:14 |

| 12 | Peter Paul      | 2020-08-07 01:41:20 |

| 13 | Ravindran       | 2020-08-07 01:41:35 |

+----+-----------------+---------------------+





Now do SQOOP incremental Import with incremental append mode with last-modified option


$ sqoop import \

-connect jdbc:mysql://localhost:3306/ohm \

-table person \

-username root \

-password cloudera \

-check-column last_mod_dt \

-incremental append   \

-last-value '2020-08-07 01:31:51' \

-target-dir /user/cloudera/person_test \

-m 1  



 hdfs dfs -ls /user/cloudera/person_test

Found 5 items

-rw-r--r--   1 cloudera cloudera         59 2020-08-07 01:37 /user/cloudera/person_test/part-m-00000

-rw-r--r--   1 cloudera cloudera         60 2020-08-07 01:38 /user/cloudera/person_test/part-m-00001

-rw-r--r--   1 cloudera cloudera         64 2020-08-07 01:38 /user/cloudera/person_test/part-m-00002

-rw-r--r--   1 cloudera cloudera         68 2020-08-07 01:38 /user/cloudera/person_test/part-m-00003

-rw-r--r--   1 cloudera cloudera        152 2020-08-07 01:43 /user/cloudera/person_test/part-m-00004  -- new file which has only incremental records



$ hdfs dfs -cat /user/cloudera/person_test/*

2,Ravi,2020-08-07 01:17:30.0

3,Kalai,2020-08-07 01:17:34.0

4,Sudha,2020-08-07 01:17:39.0

5,Priya,2020-08-07 01:31:28.0

6,Vanitha,2020-08-07 01:31:34.0

7,Kasturi,2020-08-07 01:31:40.0

8,Lakshmi,2020-08-07 01:31:45.0

9,Suriya Devi,2020-08-07 01:31:51.0

10,Nanjil Vijayan,2020-08-07 01:40:53.0  -- incremental records 

11,Elizabeth Helen,2020-08-07 01:41:14.0

12,Peter Paul,2020-08-07 01:41:20.0

13,Ravindran,2020-08-07 01:41:35.0



Wednesday, 5 August 2020

Import all tables from MySQL to HDFS (Exclude Specific tables while importing)

//Import all the tables from MySQL to HDFS using SQOOP and exclude specific tables while importing

mysql> use retail_db;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

show Database changed
mysql> show tables;
+---------------------+
| Tables_in_retail_db |
+---------------------+
| categories          |
| customers           |
| departments         |
| order_items         |
| orders              |
| products            |
+---------------------+
6 rows in set (0.00 sec)


// Import all tables from retail_db 

sqoop import-all-tables \
-connect jdbc:mysql://localhost:3306/retail_db \
-username root \
-password cloudera   


// imported into /user/cloudera/categories,
/user/cloudera/customers,
/user/cloudera/departments,
/user/cloudera/order_items,
/user/cloudera/orders,
/user/cloudera/products

// Testing purpose - excluded all the tables 
$ sqoop import-all-tables \
--connect jdbc:mysql://localhost:3306/retail_db \
--username root \
--password cloudera   \
--exclude-tables "categories,customers,departments,order_items,orders,products"

Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
20/08/05 09:05:02 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0
20/08/05 09:05:02 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/08/05 09:05:05 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
Skipping table: categories
Skipping table: customers
Skipping table: departments
Skipping table: order_items
Skipping table: orders
Skipping table: products



// only categories table will be imported - except categories all other tables names are mentioned in exclude-tables list.
sqoop import-all-tables \
--connect jdbc:mysql://localhost:3306/retail_db \
--username root \
--password cloudera   \
--exclude-tables "customers,departments,order_items,orders,products"


Skipping table: customers
Skipping table: departments
Skipping table: order_items
Skipping table: orders
Skipping table: products


Use --exclude-tables "table1,table2" option to ignore the table1 and table2.

Do NOT add white-space between the table names (aka. "table1, table2")


$ hdfs dfs -ls /user/cloudera/
Found 1 items
drwxr-xr-x   - cloudera supergroup          0 2020-08-05 09:12 /user/cloudera/categories

$ hdfs dfs -ls /user/cloudera/categories/*
[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/categories/*
-rw-r--r--   1 cloudera supergroup          0 2020-08-05 09:12 /user/cloudera/categories/_SUCCESS
-rw-r--r--   1 cloudera supergroup        271 2020-08-05 09:12 /user/cloudera/categories/part-m-00000
-rw-r--r--   1 cloudera supergroup        263 2020-08-05 09:12 /user/cloudera/categories/part-m-00001
-rw-r--r--   1 cloudera supergroup        266 2020-08-05 09:12 /user/cloudera/categories/part-m-00002
-rw-r--r--   1 cloudera supergroup        229 2020-08-05 09:12 /user/cloudera/categories/part-m-00003

Monday, 3 August 2020

SQOOP Export Example : Hive to MySQL using SQOOP Export

// We are going to export ohm.cust_from_mysql (hive table ) into cust_exported (MySQL)

hive> select * from ohm.cust_from_mysql limit 5;
OK
1 Richard Hernandez XXXXXXXXX XXXXXXXXX 6303 Heather Plaza Brownsville TX 78521 
2 Mary Barrett XXXXXXXXX XXXXXXXXX 9526 Noble Embers Ridge Littleton CO 80126
3 Ann Smith XXXXXXXXX XXXXXXXXX 3422 Blue Pioneer Bend Caguas PR 00725
4 Mary Jones XXXXXXXXX XXXXXXXXX 8324 Little Common San Marcos CA 92069
5 Robert Hudson XXXXXXXXX XXXXXXXXX 10 Crystal River Mall Caguas PR 00725

hive> select count(1) from ohm.cust_from_mysql;

12435 -- Hive Table has 12435 Rows in it

// We must create target table in destination database before export 
Create an empty table in retail_db of MySQL
mysql> create table retail_db.cust_exported as select * from customers where 1 =0 

// Presently retail_db.cust_exported table has zero rows.

$ sqoop export \
-connect "jdbc:mysql://localhost/retail_db" \
-username root \
-password cloudera \
-table cust_exported \
-hcatalog-table cust_from_mysql \
-hcatalog-database ohm

Exported 12435 records.


mysql> select * from cust_exported  limit 5;
+-------------+----------------+----------------+----------------+-------------------+------------------------+---------------+----------------+------------------+
| customer_id | customer_fname | customer_lname | customer_email | customer_password | customer_street        | customer_city | customer_state | customer_zipcode |
+-------------+----------------+----------------+----------------+-------------------+------------------------+---------------+----------------+------------------+
|        9327 | Donna          | Smith          | XXXXXXXXX      | XXXXXXXXX         | 4114 Clear Nectar Isle | Caguas        | PR             | 00725            |
|           1 | Richard        | Hernandez      | XXXXXXXXX      | XXXXXXXXX         | 6303 Heather Plaza     | Brownsville   | TX             | 78521            |
|        9328 | Mary           | Perez          | XXXXXXXXX      | XXXXXXXXX         | 376 Golden Orchard     | Moreno Valley | CA             | 92553            |
|        9329 | Eugene         | Powell         | XXXXXXXXX      | XXXXXXXXX         | 2161 Burning Maze      | Metairie      | LA             | 70003            |
|        9330 | Mary           | Conley         | XXXXXXXXX      | XXXXXXXXX         | 3046 Broad Sky Dale    | Caguas        | PR             | 00725            |
+-------------+----------------+----------------+----------------+-------------------+------------------------+---------------+----------------+------------------+

mysql> select count(1) from retail_db.cust_exported;
+----------+
| count(1) |
+----------+
|    12435 |
+----------+

SQOOP Export Example : HDFS to MySQL using SQOOP Export


Target table must exists in the Database. So we are creating the empty table

mysql> create table retail_db.customers_exported as select * from customers where 1 = 0;

// Now the record count is zero for customers_exported table of retail_db;

hdfs dfs  -ls /user/cloudera/customers
Found 9 items
-rw-r--r--   1 cloudera cloudera          0 2020-08-02 21:58 /user/cloudera/customers/_SUCCESS
-rw-r--r--   1 cloudera cloudera     118147 2020-08-02 21:57 /user/cloudera/customers/part-m-00000
-rw-r--r--   1 cloudera cloudera     119075 2020-08-02 21:57 /user/cloudera/customers/part-m-00001
-rw-r--r--   1 cloudera cloudera     119057 2020-08-02 21:57 /user/cloudera/customers/part-m-00002
-rw-r--r--   1 cloudera cloudera     118831 2020-08-02 21:57 /user/cloudera/customers/part-m-00003
-rw-r--r--   1 cloudera cloudera     118798 2020-08-02 21:57 /user/cloudera/customers/part-m-00004
-rw-r--r--   1 cloudera cloudera     119294 2020-08-02 21:57 /user/cloudera/customers/part-m-00005
-rw-r--r--   1 cloudera cloudera     119892 2020-08-02 21:58 /user/cloudera/customers/part-m-00006
-rw-r--r--   1 cloudera cloudera     120431 2020-08-02 21:58 /user/cloudera/customers/part-m-00007


sqoop export \
-connect "jdbc:mysql://localhost/retail_db" \
-username root \
-password cloudera \
-table customers_exported \
-export-dir /user/cloudera/customers \
-input-fields-terminated-by ',' \


mysql> select * from retail_db.customers_exported limit 10;
+-------------+----------------+----------------+----------------+-------------------+--------------------------+---------------+----------------+------------------+
| customer_id | customer_fname | customer_lname | customer_email | customer_password | customer_street          | customer_city | customer_state | customer_zipcode |
+-------------+----------------+----------------+----------------+-------------------+--------------------------+---------------+----------------+------------------+
|        4665 | Jessica        | York           | XXXXXXXXX      | XXXXXXXXX         | 3982 Middle Horse Line   | Escondido     | CA             | 92025            |
|        4666 | Mary           | Smith          | XXXXXXXXX      | XXXXXXXXX         | 340 Velvet Barn Gate     | Manati        | PR             | 00674            |
|        4667 | Cheryl         | Gonzalez       | XXXXXXXXX      | XXXXXXXXX         | 5375 Iron Cloud Mews     | Caguas        | PR             | 00725            |
|        4668 | Mary           | Schwartz       | XXXXXXXXX      | XXXXXXXXX         | 4263 Old Robin Mount     | Mechanicsburg | PA             | 17055            |
|        4669 | Robert         | Smith          | XXXXXXXXX      | XXXXXXXXX         | 4905 Lazy Stead          | Humacao       | PR             | 00791            |
|        4670 | Sharon         | Bailey         | XXXXXXXXX      | XXXXXXXXX         | 3890 Jagged Apple Dell   | Caguas        | PR             | 00725            |
|        4671 | Eric           | Diaz           | XXXXXXXXX      | XXXXXXXXX         | 281 Middle Blossom Ledge | Chandler      | AZ             | 85225            |
|        4672 | Mary           | Smith          | XXXXXXXXX      | XXXXXXXXX         | 7800 Tawny Creek Trace   | Detroit       | MI             | 48235            |
|        4673 | Wayne          | Hodges         | XXXXXXXXX      | XXXXXXXXX         | 7509 Iron Concession     | Caguas        | PR             | 00725            |
|        4674 | Mary           | Lynn           | XXXXXXXXX      | XXXXXXXXX         | 7694 Velvet Turnabout    | Jamaica       | NY             | 11434            |
+-------------+----------------+----------------+----------------+-------------------+--------------------------+---------------+----------------+------------------+

SQOOP Import Problem and Solution #3

#Problem : 2

Connect to MySQL database using SQOOP,
import all customers whose street name contains "Plaza"

Ex:
Hazy Mountain Plaza
Tawny Fox Plaza

Data description:
A MySQL instance is running on the localhost node.
In that instance, you will find customers table that contains
customer's data.

Installation : localhost
Database name : retail_db
Table Name : Customers
User Name : root 
Password : cloudera

Output requirement:

Place the customers files in HDFS directory
"user/cloudera/problem1/customers/txtdata"

Save output in text format with fields 
separated by a "*" and lines should be terminated by pipe.


Select the columns : customer_id,
customer_fname,customer_lname,
customer_street_name.



$ sqoop import \
-connect "jdbc:mysql://localhost/retail_db" \
-username root \
-password cloudera \
-table customers \
-where "customer_street like '%Plaza%'" \
-target-dir /user/cloudera/problem1/customers/textdata \
-columns "customer_id,customer_fname,customer_lname,customer_street"  


dfs dfs -ls /user/cloudera/problem1/customers/textdata
Found 5 items
-rw-r--r--   1 cloudera cloudera          0 2020-08-03 08:40 /user/cloudera/problem1/customers/textdata/_SUCCESS
-rw-r--r--   1 cloudera cloudera        974 2020-08-03 08:40 /user/cloudera/problem1/customers/textdata/part-m-00000
-rw-r--r--   1 cloudera cloudera       1165 2020-08-03 08:40 /user/cloudera/problem1/customers/textdata/part-m-00001
-rw-r--r--   1 cloudera cloudera        986 2020-08-03 08:40 /user/cloudera/problem1/customers/textdata/part-m-00002
-rw-r--r--   1 cloudera cloudera       1028 2020-08-03 08:40 /user/cloudera/problem1/customers/textdata/part-m-00003


$ hdfs dfs -cat /user/cloudera/problem1/customers/textdata/part-m-00000 | tail
1793,Mary,Burke,141 Dewy Plaza
1821,Mary,Trevino,6520 Quaking Quail Plaza
1891,Mary,Mora,1804 Misty Plaza
2053,Ronald,Smith,1025 Colonial Plaza
2247,Alice,Werner,1557 Heather Leaf Plaza
2517,Jose,Kane,5202 Emerald Island Plaza
2545,Larry,Patel,1688 Middle Panda Plaza
2575,Mary,Carroll,4979 Honey Plaza
2831,Mary,Padilla,3810 Crystal Plaza
2875,Mary,Roberts,2205 Velvet Plaza


 

SQOOP Import Problem and Solution #2

#Problem : 2

Connect to MySQL database using SQOOP,
import all customers lives in 'CA' state.

Data description:
A MySQL instance is running on the localhost node.
In that instance, you will find customers table that contains
customer's data.

Installation : localhost
Database name : retail_db
Table Name : Customers
User Name : root 
Password : cloudera

Output requirement:

Place the customers files in HDFS directory
"user/cloudera/problem1/customers_selected/avrodata"

Use avro format with Snappy compression

Select the columns : customer_id,customer_fname,customer_lname,customer_state.



$ sqoop import \
-connect "jdbc:mysql://localhost/retail_db" \
-username root \
-password cloudera \
-table customers \
-where "customer_state='CA'" \
-target-dir /user/cloudera/problem1/customers_selected/avrodata \
-columns "customer_id,customer_fname,customer_lname,customer_state" \
-compress \
-compression-codec snappy \
-as-avrodatafile


$ hdfs dfs -ls /user/cloudera/problem1/customers_selected/avrodata
Found 5 items
-rw-r--r--   1 cloudera cloudera          0 2020-08-03 08:24 /user/cloudera/problem1/customers_selected/avrodata/_SUCCESS
-rw-r--r--   1 cloudera cloudera       6508 2020-08-03 08:24 /user/cloudera/problem1/customers_selected/avrodata/part-m-00000.avro
-rw-r--r--   1 cloudera cloudera       6427 2020-08-03 08:24 /user/cloudera/problem1/customers_selected/avrodata/part-m-00001.avro
-rw-r--r--   1 cloudera cloudera       6419 2020-08-03 08:24 /user/cloudera/problem1/customers_selected/avrodata/part-m-00002.avro
-rw-r--r--   1 cloudera cloudera       7093 2020-08-03 08:24 /user/cloudera/problem1/customers_selected/avrodata/part-m-00003.avro


$ avro-tools tojson hdfs://localhost/user/cloudera/problem1/customers_selected/avrodata/part-m-00003.avro | head
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
{"customer_id":{"int":9328},"customer_fname":{"string":"Mary"},"customer_lname":{"string":"Perez"},"customer_state":{"string":"CA"}}
{"customer_id":{"int":9333},"customer_fname":{"string":"Angela"},"customer_lname":{"string":"Mills"},"customer_state":{"string":"CA"}}
{"customer_id":{"int":9336},"customer_fname":{"string":"Janice"},"customer_lname":{"string":"Guzman"},"customer_state":{"string":"CA"}}
{"customer_id":{"int":9338},"customer_fname":{"string":"James"},"customer_lname":{"string":"Davis"},"customer_state":{"string":"CA"}}
{"customer_id":{"int":9340},"customer_fname":{"string":"Mary"},"customer_lname":{"string":"Smith"},"customer_state":{"string":"CA"}}
{"customer_id":{"int":9347},"customer_fname":{"string":"Mary"},"customer_lname":{"string":"Fuentes"},"customer_state":{"string":"CA"}}
{"customer_id":{"int":9352},"customer_fname":{"string":"Mary"},"customer_lname":{"string":"Lewis"},"customer_state":{"string":"CA"}}
{"customer_id":{"int":9357},"customer_fname":{"string":"Katherine"},"customer_lname":{"string":"Spence"},"customer_state":{"string":"CA"}}
{"customer_id":{"int":9363},"customer_fname":{"string":"Mary"},"customer_lname":{"string":"Simmons"},"customer_state":{"string":"CA"}}
{"customer_id":{"int":9375},"customer_fname":{"string":"George"},"customer_lname":{"string":"Smith"},"customer_state":{"string":"CA"}}

SQOOP Import problem and solution #1

SQOOP problem : #1
Connect to MySQL database using SQOOP, import all orders
from orders table whose order_status is COMPLETE.

Data Description:
A MySQL instance is running on the localhost.In that instance, you will find orders table that 
contains order's data.

Installation : LocalHost
Database Name : retail_db
Table Name : Orders
User Name : root
Password : cloudera

Output requirement:
Place the order's files in HDFS directory:
/user/cloudera/problem1/orders/parquetdata
Use parquet format with tab delimiter and snappy compression/

NULL values are represented as -1 for numbers and "NA" for strings.


Solution:
$ sqoop import \
-connect "jdbc:mysql://localhost/retail_db" \
-username root \
-password cloudera \
-table orders \
-target-dir /user/cloudera/problem1/orders/parquetdata \
-delete-target-dir \
-fields-terminated-by "\t" \
-where "order_status='COMPLETE'" \
-null-string "NA" \
-null-non-string -1 \
-compress \
-compression-codec snappy \
-as-parquetfile

$ hdfs dfs -ls /user/cloudera/problem1/orders/parquetdataFound 6 items
drwxr-xr-x   - cloudera cloudera          0 2020-08-03 08:09 /user/cloudera/problem1/orders/parquetdata/.metadata
drwxr-xr-x   - cloudera cloudera          0 2020-08-03 08:11 /user/cloudera/problem1/orders/parquetdata/.signals
-rw-r--r--   1 cloudera cloudera      50233 2020-08-03 08:11 /user/cloudera/problem1/orders/parquetdata/1fa590f9-281d-4d62-9650-ddfe04d7eae2.parquet
-rw-r--r--   1 cloudera cloudera      47508 2020-08-03 08:11 /user/cloudera/problem1/orders/parquetdata/217e22ee-25b2-41a6-8bf9-f6fee4c7d0c4.parquet
-rw-r--r--   1 cloudera cloudera      46231 2020-08-03 08:11 /user/cloudera/problem1/orders/parquetdata/826a8254-9700-4edd-afdb-b8fb58b2fd49.parquet
-rw-r--r--   1 cloudera cloudera      46406 2020-08-03 08:11 /user/cloudera/problem1/orders/parquetdata/84355f27-d1a4-4c1b-ac9d-84fd5139a912.parquet



// View the parquet file content using parquet-tools:

[cloudera@quickstart ~]$ parquet-tools head -n5 hdfs://localhost/user/cloudera/problem1/orders/parquetdata/1fa590f9-281d-4d62-9650-ddfe04d7eae2.parquet
order_id = 51665
order_date = 1402729200000
order_customer_id = 8645
order_status = COMPLETE

order_id = 51670
order_date = 1402729200000
order_customer_id = 9322
order_status = COMPLETE

order_id = 51671
order_date = 1402729200000
order_customer_id = 9000
order_status = COMPLETE

order_id = 51673
order_date = 1402729200000
order_customer_id = 7538
order_status = COMPLETE

order_id = 51677
order_date = 1402729200000
order_customer_id = 6321
order_status = COMPLETE

Incremental append : MySQL to HDFS

#1. In the very first phase, we are doing FULL LOAD (init version)
#2. In the 2nd phase, we are doing INCREMENTAL APPEND 
$ mysql -uroot -pcloudera -hlocalhost

mysql>use ohm; 

mysql>create table person (id int, name varchar(50));

mysql>insert into person (id,name) values (100,'Raja'),(101,'Siva'),('102','Lal');

mysql> select * from ohm.person;
+------+------+
| id   | name |
+------+------+
|  100 | Raja |
|  101 | Siva |
|  102 | Lal  |
+------+------+
3 rows in set (0.00 sec)


// Initial full load
sqoop import \
-connect jdbc:mysql://localhost:3306/ohm \
-username root \
-password cloudera \
-table person \
-target-dir  user/cloudera/person_incremental \
-split-by id


hdfs dfs -ls user/cloudera/person_incremental
Found 4 items
-rw-r--r--   1 cloudera cloudera          0 2020-08-03 04:49 user/cloudera/person_incremental/_SUCCESS
-rw-r--r--   1 cloudera cloudera          9 2020-08-03 04:49 user/cloudera/person_incremental/part-m-00000
-rw-r--r--   1 cloudera cloudera          9 2020-08-03 04:49 user/cloudera/person_incremental/part-m-00001
-rw-r--r--   1 cloudera cloudera          8 2020-08-03 04:49 user/cloudera/person_incremental/part-m-00002

$ hdfs dfs -cat user/cloudera/person_incremental/*
100,Raja
101,Siva
102,Lal


//Adding 4 more records in ohm.person
mysql>insert into ohm.person (id,name) values (103,'Nila'),(104,'Kalai'),('105','Bharani'),('106','Sara');
Query OK, 4 rows affected (0.00 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> select * from ohm.person;
+------+---------+
| id   | name    |
+------+---------+
|  100 | Raja    | $ Existing Row
|  101 | Siva    | $ Existing Row
|  102 | Lal     | $ Existing Row
|  103 | Nila    | * Newly inserted row
|  104 | Kalai   | * Newly inserted row
|  105 | Bharani | * Newly inserted row
|  106 | Sara    | * Newly inserted row
+------+---------+



// incremental append using SQOOP Import
// feeding last-value as 102 and check-column as id
sqoop import \
-connect jdbc:mysql://localhost:3306/ohm \
-username root \
-password cloudera \
-table person \
-incremental append \
-check-column id \
-last-value 102 \
-target-dir  user/cloudera/person_incremental \
-split-by id


$ hdfs dfs -ls  user/cloudera/person_incremental
Found 8 items
-rw-r--r--   1 cloudera cloudera          0 2020-08-03 04:49 user/cloudera/person_incremental/_SUCCESS
-rw-r--r--   1 cloudera cloudera          9 2020-08-03 04:49 user/cloudera/person_incremental/part-m-00000
-rw-r--r--   1 cloudera cloudera          9 2020-08-03 04:49 user/cloudera/person_incremental/part-m-00001
-rw-r--r--   1 cloudera cloudera          8 2020-08-03 04:49 user/cloudera/person_incremental/part-m-00002
-rw-r--r--   1 cloudera cloudera          9 2020-08-03 05:10 user/cloudera/person_incremental/part-m-00003
-rw-r--r--   1 cloudera cloudera         10 2020-08-03 05:10 user/cloudera/person_incremental/part-m-00004
-rw-r--r--   1 cloudera cloudera         12 2020-08-03 05:10 user/cloudera/person_incremental/part-m-00005
-rw-r--r--   1 cloudera cloudera          9 2020-08-03 05:10 user/cloudera/person_incremental/part-m-00006


$ hdfs dfs -cat user/cloudera/person_incremental/*
100,Raja
101,Siva
102,Lal
103,Nila  -- Newly appended row
104,Kalai  -- Newly appended row
105,Bharani  -- Newly appended row
106,Sara  -- Newly appended row

List Databases, Tables and Display the Structure of a table using SQOOP

List Databases, Tables and Display the Structure of a table using SQOOP

//Display all the databases

sqoop list-databases \
-connect "jdbc:mysql://localhost:3306/" \
-username root \
-password cloudera 

firehose
hue
metastore
mysql
nav
navms
ohm
oozie
retail_db
rman
sentry


//Display all the tables in a database:

sqoop list-tables \
-connect "jdbc:mysql://localhost:3306/retail_db" \
-username root \
-password cloudera 


categories
customers
departments
order_items
orders
products
suppliers


//Do evalute some queries

sqoop eval \
-connect "jdbc:mysql://localhost:3306/retail_db" \
-username root \
-password cloudera \
-query "describe customers" 

---------------------------------------------------------------------------------------------------------
| Field                | Type                 | Null | Key | Default              | Extra                | 
---------------------------------------------------------------------------------------------------------
| customer_id          | int(11)              | NO  | PRI | (null)               | auto_increment       | 
| customer_fname       | varchar(45)          | NO  |     | (null)               |                      | 
| customer_lname       | varchar(45)          | NO  |     | (null)               |                      | 
| customer_email       | varchar(45)          | NO  |     | (null)               |                      | 
| customer_password    | varchar(45)          | NO  |     | (null)               |                      | 
| customer_street      | varchar(255)         | NO  |     | (null)               |                      | 
| customer_city        | varchar(45)          | NO  |     | (null)               |                      | 
| customer_state       | varchar(45)          | NO  |     | (null)               |                      | 
| customer_zipcode     | varchar(45)          | NO  |     | (null)               |                      | 
---------------------------------------------------------------------------------------------------------

sqoop eval \
-connect "jdbc:mysql://localhost:3306/retail_db" \
-username root \
-password cloudera \
-query "select min(customer_id) as Minn, max(customer_id) as Maxx from retail_db.customers" 

-----------------------------
| Minn        | Maxx        | 
-----------------------------
| 1           | 12435       | 
-----------------------------

NULL Handling while SQOOP import

//Handling nulls while doing SQOOP import

use retail_db;

CREATE TABLE suppliers
( id int NOT NULL,
  name char(50),
  age int
);

mysql> describe suppliers;
+-------+----------+------+-----+---------+-------+
| Field | Type     | Null | Key | Default | Extra |
+-------+----------+------+-----+---------+-------+
| id    | int(11)  | NO   |     | NULL    |       |
| name  | char(50) | YES  |     | NULL    |       |
| age   | int(11)  | YES  |     | NULL    |       |
+-------+----------+------+-----+---------+-------+

insert into retail_db.suppliers (id,name,age) values (100,NULL,NULL),(101,'Raja',NULL),(103,NULL,NULL),(104,'Manoj',NULL), (105,NULL,30), (106,NULL,NULL);
 
mysql>  select * from retail_db.suppliers;
+-----+-------+------+
| id  | name  | age  |
+-----+-------+------+
| 100 | NULL  | NULL |
| 101 | Raja  | NULL |
| 103 | NULL  | NULL |
| 104 | Manoj | NULL |
| 105 | NULL  |   30 |
| 106 | NULL  | NULL |
+-----+-------+------+

// String column which is having NULL will be replaced with 'xxxx'
// Non String column which has NULL will be replaced with 'yyy'

sqoop import \
-connect jdbc:mysql://localhost/retail_db \
-username root \
-password cloudera \
-table suppliers \
-target-dir /user/cloudera/suppliers_null_tr \
-delete-target-dir \
-null-string "xxxx" \
-null-non-string 'yyy' \
-split-by id

Retrieved 6 records.


 hdfs dfs -ls /user/cloudera/suppliers_null_tr 
Found 5 items
-rw-r--r--   1 cloudera cloudera          0 2020-08-03 02:02 /user/cloudera/suppliers_null_tr/_SUCCESS
-rw-r--r--   1 cloudera cloudera         28 2020-08-03 02:02 /user/cloudera/suppliers_null_tr/part-m-00000
-rw-r--r--   1 cloudera cloudera         14 2020-08-03 02:02 /user/cloudera/suppliers_null_tr/part-m-00001
-rw-r--r--   1 cloudera cloudera         15 2020-08-03 02:02 /user/cloudera/suppliers_null_tr/part-m-00002
-rw-r--r--   1 cloudera cloudera         26 2020-08-03 02:02 /user/cloudera/suppliers_null_tr/part-m-00003

$ hdfs dfs -cat /user/cloudera/suppliers_null_tr/* 
100,xxxx,yyy
101,Raja,yyy
103,xxxx,yyy
104,Manoj,yyy
105,xxxx,30
106,xxxx,yyy


Importing data from MySQL to HDFS using SQOOP based on conditions Where, Query, Columns options

Import data from MySQL to HDFS using SQOOP with conditional data importing

//Conditional import using Where  

sqoop import \
-connect jdbc:mysql://localhost/retail_db \
-username root \
-password cloudera \
-table customers \
-target-dir /user/cloudera/cust_nm_mary \
-delete-target-dir \
-where 'customer_fname="Mary"'

Retrieved 4741 records.

hdfs dfs -ls /user/cloudera/cust_nm_mary 
Found 5 items
-rw-r--r--   1 cloudera cloudera          0 2020-08-03 00:23 /user/cloudera/cust_nm_mary/_SUCCESS
-rw-r--r--   1 cloudera cloudera      89617 2020-08-03 00:23 /user/cloudera/cust_nm_mary/part-m-00000
-rw-r--r--   1 cloudera cloudera      88116 2020-08-03 00:23 /user/cloudera/cust_nm_mary/part-m-00001
-rw-r--r--   1 cloudera cloudera      89590 2020-08-03 00:23 /user/cloudera/cust_nm_mary/part-m-00002
-rw-r--r--   1 cloudera cloudera      90219 2020-08-03 00:23 /user/cloudera/cust_nm_mary/part-m-00003

// Look here : All the records have Mary but, here we have selected all the columns
hdfs dfs -cat /user/cloudera/cust_nm_mary/part-m-00000 | head
2,Mary,Barrett,XXXXXXXXX,XXXXXXXXX,9526 Noble Embers Ridge,Littleton,CO,80126
4,Mary,Jones,XXXXXXXXX,XXXXXXXXX,8324 Little Common,San Marcos,CA,92069
6,Mary,Smith,XXXXXXXXX,XXXXXXXXX,3151 Sleepy Quail Promenade,Passaic,NJ,07055
9,Mary,Perez,XXXXXXXXX,XXXXXXXXX,3616 Quaking Street,Caguas,PR,00725
11,Mary,Huffman,XXXXXXXXX,XXXXXXXXX,3169 Stony Woods,Caguas,PR,00725
13,Mary,Baldwin,XXXXXXXXX,XXXXXXXXX,7922 Iron Oak Gardens,Caguas,PR,00725
17,Mary,Robinson,XXXXXXXXX,XXXXXXXXX,1325 Noble Pike,Taylor,MI,48180
20,Mary,Ellis,XXXXXXXXX,XXXXXXXXX,4703 Old Route,West New York,NJ,07093
24,Mary,Smith,XXXXXXXXX,XXXXXXXXX,9417 Emerald Towers,Caguas,PR,00725
27,Mary,Vincent,XXXXXXXXX,XXXXXXXXX,1768 Sleepy Zephyr Place,Caguas,PR,00725


// Retrieve data of specific columns only along with where condition
sqoop import \
-connect jdbc:mysql://localhost/retail_db \
-username root \
-password cloudera \
-table customers \
-target-dir /user/cloudera/cust_col_sel \
-delete-target-dir \
-columns "customer_fname, customer_lname, customer_city" \
-where 'customer_fname="Mary"'


 hdfs dfs -ls /user/cloudera/cust_col_sel
Found 5 items
-rw-r--r--   1 cloudera cloudera          0 2020-08-03 00:43 /user/cloudera/cust_col_sel/_SUCCESS
-rw-r--r--   1 cloudera cloudera      24361 2020-08-03 00:43 /user/cloudera/cust_col_sel/part-m-00000
-rw-r--r--   1 cloudera cloudera      23816 2020-08-03 00:43 /user/cloudera/cust_col_sel/part-m-00001
-rw-r--r--   1 cloudera cloudera      24280 2020-08-03 00:43 /user/cloudera/cust_col_sel/part-m-00002
-rw-r--r--   1 cloudera cloudera      24254 2020-08-03 00:43 /user/cloudera/cust_col_sel/part-m-00003

// Retrieved First Name, Last name, city only
$ hdfs dfs -cat /user/cloudera/cust_col_sel/part-m-00000 | head
Mary,Barrett,Littleton
Mary,Jones,San Marcos
Mary,Smith,Passaic
Mary,Perez,Caguas
Mary,Huffman,Caguas
Mary,Baldwin,Caguas
Mary,Robinson,Taylor
Mary,Ellis,West New York
Mary,Smith,Caguas
Mary,Vincent,Caguas


mysql> select customer_fname, customer_lname, customer_city from retail_db.customers where customer_id between 100 and 110;
+----------------+----------------+-------------------+
| customer_fname | customer_lname | customer_city     |
+----------------+----------------+-------------------+
| George         | Barrett        | Caguas            |
| Mary           | Briggs         | Dorchester Center |
| Gregory        | Smith          | Caguas            |
| Mary           | Williams       | Broomfield        |
| Debra          | Wood           | Saint Paul        |
| Mary           | Smith          | Caguas            |
| Lauren         | Freeman        | Napa              |
| Mary           | Weaver         | Caguas            |
| Joshua         | Smith          | Bronx             |
| Mary           | Thompson       | Caguas            |
| David          | Miranda        | Beaverton         |
+----------------+----------------+-------------------+

// Cannot specify --query and --table together.
//AND \$CONDITIONS must be included as part of the SELECT query
// Retrieve data based on the select query
sqoop import \
-connect jdbc:mysql://localhost/retail_db \
-username root \
-password cloudera \
-target-dir /user/cloudera/cust_qry_rstl \
-delete-target-dir \
-query "select customer_id,customer_fname, customer_lname, customer_city from retail_db.customers where customer_id between 100 and 110 AND \$CONDITIONS" \
-split-by  customer_id



hdfs dfs -ls /user/cloudera/cust_qry_rstl
Found 5 items
-rw-r--r--   1 cloudera cloudera          0 2020-08-03 01:11 /user/cloudera/cust_qry_rstl/_SUCCESS
-rw-r--r--   1 cloudera cloudera         85 2020-08-03 01:11 /user/cloudera/cust_qry_rstl/part-m-00000
-rw-r--r--   1 cloudera cloudera         77 2020-08-03 01:11 /user/cloudera/cust_qry_rstl/part-m-00001
-rw-r--r--   1 cloudera cloudera         47 2020-08-03 01:11 /user/cloudera/cust_qry_rstl/part-m-00002
-rw-r--r--   1 cloudera cloudera         76 2020-08-03 01:11 /user/cloudera/cust_qry_rstl/part-m-00003

// display the data written in HDFS
$ hdfs dfs -cat /user/cloudera/cust_qry_rstl/* 
100,George,Barrett,Caguas
101,Mary,Briggs,Dorchester Center
102,Gregory,Smith,Caguas
103,Mary,Williams,Broomfield
104,Debra,Wood,Saint Paul
105,Mary,Smith,Caguas
106,Lauren,Freeman,Napa
107,Mary,Weaver,Caguas
108,Joshua,Smith,Bronx
109,Mary,Thompson,Caguas
110,David,Miranda,Beaverton

Split-by and boundary-query example in SQOOP

Example SQOOP code to work with Split-by and boundary-query:

//Split-by key : If we dont have any primary key in the table, we can go with split-by
sqoop import \
-connect jdbc:mysql://localhost:3306/retail_db \
-username root \
-password cloudera \
-table products \
-target-dir /user/cloudera/prod_slitt \
-delete-target-dir \
-split-by product_id 

Retrieved 1345 records.

[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/prod_slitt/* 
-rw-r--r--   1 cloudera cloudera          0 2020-08-03 00:02 /user/cloudera/prod_slitt/_SUCCESS
-rw-r--r--   1 cloudera cloudera      41419 2020-08-03 00:02 /user/cloudera/prod_slitt/part-m-00000
-rw-r--r--   1 cloudera cloudera      43660 2020-08-03 00:02 /user/cloudera/prod_slitt/part-m-00001
-rw-r--r--   1 cloudera cloudera      42195 2020-08-03 00:02 /user/cloudera/prod_slitt/part-m-00002
-rw-r--r--   1 cloudera cloudera      46719 2020-08-03 00:02 /user/cloudera/prod_slitt/part-m-00003

// do line count
$ hdfs dfs -cat /user/cloudera/prod_slitt/* | wc -l
1345


// split-by along with boundary-query to identify min and max values of product_id
sqoop import \
-connect jdbc:mysql://localhost:3306/retail_db \
-username root \
-password cloudera \
-table products \
-target-dir /user/cloudera/prod_slitt_v1 \
-delete-target-dir \
-boundary-query 'select min(product_id), max(product_id)from products where product_id > 100' \
-split-by product_id


 BoundingValsQuery: select min(product_id), max(product_id)from products where product_id > 100
20/08/03 00:08:08 INFO db.IntegerSplitter: Split size: 311; Num splits: 4 from: 101 to: 1345
 Retrieved 1245 records.
 
 hdfs dfs -ls /user/cloudera/prod_slitt_v1
Found 5 items
-rw-r--r--   1 cloudera cloudera          0 2020-08-03 00:09 /user/cloudera/prod_slitt_v1/_SUCCESS
-rw-r--r--   1 cloudera cloudera      37832 2020-08-03 00:08 /user/cloudera/prod_slitt_v1/part-m-00000
-rw-r--r--   1 cloudera cloudera      40905 2020-08-03 00:08 /user/cloudera/prod_slitt_v1/part-m-00001
-rw-r--r--   1 cloudera cloudera      38828 2020-08-03 00:08 /user/cloudera/prod_slitt_v1/part-m-00002
-rw-r--r--   1 cloudera cloudera      43714 2020-08-03 00:08 /user/cloudera/prod_slitt_v1/part-m-00003

// do line count
hdfs dfs -cat /user/cloudera/prod_slitt_v1/* | wc -l
1245

Sunday, 2 August 2020

Compression methods in SQOOP : Gzip, Snappy, Deflate, bzip2, lz4

Compression methods:
Gzip, Snappy, Deflate, bzip2, lz4 

//Gzip Compression 
sqoop import \
-connect jdbc:mysql://localhost:3306/retail_db \
-username root \
-password cloudera \
-table customers \
-target-dir /user/cloudera/cust_gzip \
-delete-target-dir \
-compress

hdfs dfs -ls /user/cloudera/cust_gzip
Found 5 items
-rw-r--r--   1 cloudera cloudera          0 2020-08-02 23:06 /user/cloudera/cust_gzip/_SUCCESS
-rw-r--r--   1 cloudera cloudera      63937 2020-08-02 23:06 /user/cloudera/cust_gzip/part-m-00000.gz
-rw-r--r--   1 cloudera cloudera      63708 2020-08-02 23:05 /user/cloudera/cust_gzip/part-m-00001.gz
-rw-r--r--   1 cloudera cloudera      64171 2020-08-02 23:06 /user/cloudera/cust_gzip/part-m-00002.gz
-rw-r--r--   1 cloudera cloudera      63905 2020-08-02 23:05 /user/cloudera/cust_gzip/part-m-00003.gz



//Snappy Compression 
sqoop import \
-connect jdbc:mysql://localhost:3306/retail_db \
-username root \
-password cloudera \
-table customers \
-target-dir /user/cloudera/cust_snappy \
-delete-target-dir \
-compress \
-compression-codec snappy

$ hdfs dfs -ls /user/cloudera/cust_snappy
Found 5 items
-rw-r--r--   1 cloudera cloudera          0 2020-08-02 23:07 /user/cloudera/cust_snappy/_SUCCESS
-rw-r--r--   1 cloudera cloudera     110165 2020-08-02 23:07 /user/cloudera/cust_snappy/part-m-00000.snappy
-rw-r--r--   1 cloudera cloudera     109884 2020-08-02 23:07 /user/cloudera/cust_snappy/part-m-00001.snappy
-rw-r--r--   1 cloudera cloudera     110479 2020-08-02 23:07 /user/cloudera/cust_snappy/part-m-00002.snappy
-rw-r--r--   1 cloudera cloudera     110616 2020-08-02 23:07 /user/cloudera/cust_snappy/part-m-00003.snappy



//Deflate Compression 
sqoop import \
-connect jdbc:mysql://localhost:3306/retail_db \
-username root \
-password cloudera \
-table customers \
-target-dir /user/cloudera/cust_deflate \
-delete-target-dir \
-compress \
-compression-codec deflate

hdfs dfs -ls /user/cloudera/cust_deflate
Found 5 items
-rw-r--r--   1 cloudera cloudera          0 2020-08-02 23:16 /user/cloudera/cust_deflate/_SUCCESS
-rw-r--r--   1 cloudera cloudera      63925 2020-08-02 23:16 /user/cloudera/cust_deflate/part-m-00000.deflate
-rw-r--r--   1 cloudera cloudera      63696 2020-08-02 23:16 /user/cloudera/cust_deflate/part-m-00001.deflate
-rw-r--r--   1 cloudera cloudera      64159 2020-08-02 23:16 /user/cloudera/cust_deflate/part-m-00002.deflate
-rw-r--r--   1 cloudera cloudera      63893 2020-08-02 23:16 /user/cloudera/cust_deflate/part-m-00003.deflate


//Bzip Compression 
sqoop import \
-connect jdbc:mysql://localhost:3306/retail_db \
-username root \
-password cloudera \
-table customers \
-target-dir /user/cloudera/cust_bzip \
-delete-target-dir \
-compress \
-compression-codec bzip2


hdfs dfs -ls /user/cloudera/cust_bzip
Found 5 items
-rw-r--r--   1 cloudera cloudera          0 2020-08-02 23:09 /user/cloudera/cust_bzip/_SUCCESS
-rw-r--r--   1 cloudera cloudera      41632 2020-08-02 23:09 /user/cloudera/cust_bzip/part-m-00000.bz2
-rw-r--r--   1 cloudera cloudera      41703 2020-08-02 23:09 /user/cloudera/cust_bzip/part-m-00001.bz2
-rw-r--r--   1 cloudera cloudera      41869 2020-08-02 23:09 /user/cloudera/cust_bzip/part-m-00002.bz2
-rw-r--r--   1 cloudera cloudera      41910 2020-08-02 23:09 /user/cloudera/cust_bzip/part-m-00003.bz2



//Lz4 Compression 
sqoop import \
-connect jdbc:mysql://localhost:3306/retail_db \
-username root \
-password cloudera \
-table customers \
-target-dir /user/cloudera/cust_lz4 \
-delete-target-dir \
-compress \
-compression-codec Lz4


hdfs dfs -ls /user/cloudera/cust_lz4
Found 5 items
-rw-r--r--   1 cloudera cloudera          0 2020-08-02 23:13 /user/cloudera/cust_lz4/_SUCCESS
-rw-r--r--   1 cloudera cloudera      97878 2020-08-02 23:12 /user/cloudera/cust_lz4/part-m-00000.lz4
-rw-r--r--   1 cloudera cloudera      97610 2020-08-02 23:12 /user/cloudera/cust_lz4/part-m-00001.lz4
-rw-r--r--   1 cloudera cloudera      98155 2020-08-02 23:13 /user/cloudera/cust_lz4/part-m-00002.lz4
-rw-r--r--   1 cloudera cloudera      98349 2020-08-02 23:12 /user/cloudera/cust_lz4/part-m-00003.lz4


GZip : Good compression ratio
GZip files are not splittable
Hadoop files are broken into 128 MB of blocks

BZip :
Very good compression ratio, Very slow, CPU Intensive

Snappy : 
Very fast, splittable, modest
Deflat : splittable format
   : Error prone
   : Not widely accepted
   
LZ4   : modest, fast, splittable, best choice 


Create Avro, Parquet, Sequence files using the SQOOP output

Import data from MySQL and save the output as : Avro, Parquet, Sequence files

// Save as Avro file
$ sqoop import \
-connect jdbc:mysql://localhost/retail_db \
-username root \
-password cloudera \
-table customers \
-target-dir /user/cloudera/cust_avro \
-as-avrodatafile \
-delete-target-dir
 

hdfs dfs -ls /user/cloudera/cust_avro
Found 5 items
-rw-r--r--   1 cloudera cloudera          0 2020-08-02 22:37 /user/cloudera/cust_avro/_SUCCESS
-rw-r--r--   1 cloudera cloudera     258086 2020-08-02 22:37 /user/cloudera/cust_avro/part-m-00000.avro
-rw-r--r--   1 cloudera cloudera     257862 2020-08-02 22:37 /user/cloudera/cust_avro/part-m-00001.avro
-rw-r--r--   1 cloudera cloudera     259118 2020-08-02 22:37 /user/cloudera/cust_avro/part-m-00002.avro
-rw-r--r--   1 cloudera cloudera     260893 2020-08-02 22:37 /user/cloudera/cust_avro/part-m-00003.avro


//Save as parquet file
$ sqoop import \
-connect jdbc:mysql://localhost/retail_db \
-username root \
-password cloudera \
-table customers \
-target-dir /user/cloudera/cust_parquet \
-as-parquetfile \
-delete-target-dir


hdfs dfs -ls /user/cloudera/cust_parquet
Found 6 items
drwxr-xr-x   - cloudera cloudera          0 2020-08-02 22:38 /user/cloudera/cust_parquet/.metadata
drwxr-xr-x   - cloudera cloudera          0 2020-08-02 22:39 /user/cloudera/cust_parquet/.signals
-rw-r--r--   1 cloudera cloudera      88944 2020-08-02 22:39 /user/cloudera/cust_parquet/0395a40c-1223-45e3-bff7-2bc061edfb5c.parquet
-rw-r--r--   1 cloudera cloudera      89047 2020-08-02 22:39 /user/cloudera/cust_parquet/10c94461-e193-49fc-9f4e-c0c3be93c7e3.parquet
-rw-r--r--   1 cloudera cloudera      88762 2020-08-02 22:39 /user/cloudera/cust_parquet/1ac81c8a-7422-402f-ade5-01173f580a1a.parquet
-rw-r--r--   1 cloudera cloudera      89163 2020-08-02 22:39 /user/cloudera/cust_parquet/505b1ef1-fdde-4e25-9acb-ead834a6d4af.parquet



//Save as sequence file
$ sqoop import \
-connect jdbc:mysql://localhost/retail_db \
-username root \
-password cloudera \
-table customers \
-target-dir /user/cloudera/cust_sequence \
-as-sequencefile \
-delete-target-dir


hdfs dfs -ls /user/cloudera/cust_sequence
Found 5 items
-rw-r--r--   1 cloudera cloudera          0 2020-08-02 22:41 /user/cloudera/cust_sequence/_SUCCESS
-rw-r--r--   1 cloudera cloudera     315998 2020-08-02 22:41 /user/cloudera/cust_sequence/part-m-00000
-rw-r--r--   1 cloudera cloudera     315711 2020-08-02 22:41 /user/cloudera/cust_sequence/part-m-00001
-rw-r--r--   1 cloudera cloudera     315814 2020-08-02 22:41 /user/cloudera/cust_sequence/part-m-00002
-rw-r--r--   1 cloudera cloudera     315633 2020-08-02 22:41 /user/cloudera/cust_sequence/part-m-00003

Import Data from MySQL to HDFS using SQOOP - Revisiting SQOOP Commands

// Import Data from MySQL to HDFS using SQOOP
//mapper option not specified - so default mapper is 4
// target-dir is not specified - so it will be imported in user/cloudera folder
sqoop import \
-connect jdbc:mysql://localhost:3306/retail_db \
-username root \
-password cloudera \
-table customers;

SELECT t.* FROM `customers` AS t LIMIT 1
Num splits: 4 from: 1 to: 12435
Retrieved 12435 records.
URL to check the log : http://quickstart.cloudera:8088/proxy/application_1596428369104_0001/

Result in hdfs:

hdfs dfs -ls /user/cloudera/customers
Found 5 items
-rw-r--r--   1 cloudera cloudera          0 2020-08-02 21:42 /user/cloudera/customers/_SUCCESS
-rw-r--r--   1 cloudera cloudera     237145 2020-08-02 21:42 /user/cloudera/customers/part-m-00000
-rw-r--r--   1 cloudera cloudera     237965 2020-08-02 21:42 /user/cloudera/customers/part-m-00001
-rw-r--r--   1 cloudera cloudera     238092 2020-08-02 21:42 /user/cloudera/customers/part-m-00002
-rw-r--r--   1 cloudera cloudera     240323 2020-08-02 21:42 /user/cloudera/customers/part-m-00003

mysql> select count(1) from customers;
+----------+
| count(1) |
+----------+
|    12435 |
+----------+
1 row in set (0.02 sec)


// Line count checking in hdfs files
hdfs dfs -cat /user/cloudera/customers/*  | wc -l
12435

// get 1st 10 lines..
[cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/customers/part-m-00000 | head
1,Richard,Hernandez,XXXXXXXXX,XXXXXXXXX,6303 Heather Plaza,Brownsville,TX,78521
2,Mary,Barrett,XXXXXXXXX,XXXXXXXXX,9526 Noble Embers Ridge,Littleton,CO,80126
3,Ann,Smith,XXXXXXXXX,XXXXXXXXX,3422 Blue Pioneer Bend,Caguas,PR,00725
4,Mary,Jones,XXXXXXXXX,XXXXXXXXX,8324 Little Common,San Marcos,CA,92069
5,Robert,Hudson,XXXXXXXXX,XXXXXXXXX,10 Crystal River Mall ,Caguas,PR,00725
6,Mary,Smith,XXXXXXXXX,XXXXXXXXX,3151 Sleepy Quail Promenade,Passaic,NJ,07055
7,Melissa,Wilcox,XXXXXXXXX,XXXXXXXXX,9453 High Concession,Caguas,PR,00725
8,Megan,Smith,XXXXXXXXX,XXXXXXXXX,3047 Foggy Forest Plaza,Lawrence,MA,01841
9,Mary,Perez,XXXXXXXXX,XXXXXXXXX,3616 Quaking Street,Caguas,PR,00725
10,Melissa,Smith,XXXXXXXXX,XXXXXXXXX,8598 Harvest Beacon Plaza,Stafford,VA,22554

// increase the number of mappers to 8:
sqoop import \
-connect jdbc:mysql://localhost:3306/retail_db \
-username root \
-password cloudera \
-table customers
-m 8

// We get the below error
Output directory hdfs://quickstart.cloudera:8020/user/cloudera/customers already exists

// delete the output folder
$ hdfs dfs -rm -r hdfs://quickstart.cloudera:8020/user/cloudera/customers
Deleted hdfs://quickstart.cloudera:8020/user/cloudera/customers




$ hdfs dfs -ls hdfs://quickstart.cloudera:8020/user/cloudera/customers 
Found 9 items
-rw-r--r--   1 cloudera cloudera          0 2020-08-02 21:58 hdfs://quickstart.cloudera:8020/user/cloudera/customers/_SUCCESS
-rw-r--r--   1 cloudera cloudera     118147 2020-08-02 21:57 hdfs://quickstart.cloudera:8020/user/cloudera/customers/part-m-00000
-rw-r--r--   1 cloudera cloudera     119075 2020-08-02 21:57 hdfs://quickstart.cloudera:8020/user/cloudera/customers/part-m-00001
-rw-r--r--   1 cloudera cloudera     119057 2020-08-02 21:57 hdfs://quickstart.cloudera:8020/user/cloudera/customers/part-m-00002
-rw-r--r--   1 cloudera cloudera     118831 2020-08-02 21:57 hdfs://quickstart.cloudera:8020/user/cloudera/customers/part-m-00003
-rw-r--r--   1 cloudera cloudera     118798 2020-08-02 21:57 hdfs://quickstart.cloudera:8020/user/cloudera/customers/part-m-00004
-rw-r--r--   1 cloudera cloudera     119294 2020-08-02 21:57 hdfs://quickstart.cloudera:8020/user/cloudera/customers/part-m-00005
-rw-r--r--   1 cloudera cloudera     119892 2020-08-02 21:58 hdfs://quickstart.cloudera:8020/user/cloudera/customers/part-m-00006
-rw-r--r--   1 cloudera cloudera     120431 2020-08-02 21:58 hdfs://quickstart.cloudera:8020/user/cloudera/customers/part-m-00007

// reduced number of mappers as 2
//delete target directory if already exists
sqoop import \
-connect jdbc:mysql://localhost:3306/retail_db \
-username root \
-password cloudera \
-table customers \
-m 2 \
-target-dir /user/cloudera/customer_new \
-delete-target-dir


// only 2 mappers
$ hdfs dfs -ls /user/cloudera/customer_new 
Found 3 items
-rw-r--r--   1 cloudera cloudera          0 2020-08-02 22:12 /user/cloudera/customer_new/_SUCCESS
-rw-r--r--   1 cloudera cloudera     475038 2020-08-02 22:12 /user/cloudera/customer_new/part-m-00000
-rw-r--r--   1 cloudera cloudera     478487 2020-08-02 22:12 /user/cloudera/customer_new/part-m-00001

//re-run the same sqoop command again we wont get any error 
//as target directory already exists

Friday, 31 July 2020

Import a table from MySQL to Hbase using SQOOP Import

// Import into Hbase
sqoop import \
-connect jdbc:mysql://localhost:3306/ohm \
-driver com.mysql.jdbc.Driver \
-username root \
-password cloudera \
-table employee \
-hbase-create-table \
-hbase-table employee_details \
-column-family employees \
-hbase-row-key id -m 1

// single line statement for the same
sqoop import -connect jdbc:mysql://localhost:3306/ohm -driver com.mysql.jdbc.Driver -username root -password cloudera -table employee -hbase-create-table -hbase-table employee_details -column-family employees -hbase-row-key id -m 1

[cloudera@quickstart ~]$ hbase shell

scan 'employee-details'
// it will display all columnar view

Import a table from MySQL to Hive using SQOOP Import

// Start MySQL 
$ sudo mysql -uroot -pcloudera

mysql> create database ohm;
Query OK, 1 row affected (0.00 sec)

mysql> use ohm;
Database changed


CREATE TABLE employee 
(
id INT,
first_name VARCHAR(100),
last_name VARCHAR(100),
gender VARCHAR(10),
designation VARCHAR(20),
city VARCHAR(20),
country VARCHAR(20)
);


INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (1, 'Jervis', 'Roll', 'Male', 'Director of Sales', 'Thi Tran Lac', 'Vietnam');
INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (2, 'Gordon', 'Maltster', 'Male', 'Marketing Manager', 'Mabu', 'China');
INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (3, 'Griff', 'Godsafe', 'Male', 'Actuary', 'Kipit', 'Philippines');
INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (4, 'Gracie', 'Franken', 'Female', 'Assistant Manager', 'Xiabuji', 'China');
INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (5, 'Joelly', 'Wellbank', 'Female', 'Account Coordinator', 'Whitehorse', 'Canada');
INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (6, 'Bab', 'Havock', 'Female', 'Accountant II', 'Basyūn', 'Egypt');
INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (7, 'Carmine', 'Courage', 'Female', 'Account Coordinator', 'Boyeros', 'Cuba');
INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (8, 'Estella', 'Marvell', 'Female', 'Structural Analysis Engineer', 'Stettler', 'Canada');
INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (9, 'Celie', 'Trevaskiss', 'Female', 'Assistant Manager', 'Criuleni', 'Moldova');
INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (10, 'Madison', 'Ranyell', 'Male', 'Research Associate', 'Angatel', 'Philippines');
INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (11, 'Haydon', 'Faughey', 'Male', 'Safety Technician IV', 'Masalovka', 'Russia');
INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (12, 'Michele', 'Zarfai', 'Male', 'Legal Assistant', 'Karatau', 'Kazakhstan');
INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (13, 'Ruthi', 'Bowmer', 'Female', 'Analog Circuit Design manager', 'Peski', 'Russia');
INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (14, 'Adolphus', 'Pickthorne', 'Male', 'Senior Developer', 'Mae Fa Luang', 'Thailand');
INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (15, 'Kat', 'Dymocke', 'Female', 'Geological Engineer', 'Markópoulo Oropoú', 'Greece');


// Display databases;
$ sqoop list-databases -connect jdbc:mysql://localhost -username root -password cloudera;

information_schema
cm
firehose
hue
metastore
mysql
nav
navms
ohm
oozie
retail_db
rman

// Display tables
$ sqoop list-tables -connect jdbc:mysql://localhost/ohm -username root -password cloudera;
employee
person

// Create a table in Hive
hive> create database ohm;
OK
Time taken: 2.041 seconds
hive> show tables;
OK
Time taken: 0.244 seconds


// Import a table from Mysql To Hive
sqoop import \
-connect jdbc:mysql://localhost:3306/ohm \
-driver com.mysql.jdbc.Driver \
-username root \
-password cloudera \
-table employee \
-hive-import \
-split-by id \
-hive-table ohm.employee

 
hive> use ohm;
OK
Time taken: 0.017 seconds
hive> show tables;
OK
employee
Time taken: 0.029 seconds, Fetched: 1 row(s)
hive> select * from employee;
OK
1 Jervis Roll Male Director of Sales Thi Tran Lac Vietnam
2 Gordon Maltster Male Marketing Manager Mabu China
3 Griff Godsafe Male Actuary Kipit Philippines
4 Gracie Franken Female Assistant Manager Xiabuji China
5 Joelly Wellbank Female Account Coordinator Whitehorse Canada
6 Bab Havock Female Accountant II Basyūn Egypt
10 Madison Ranyell Male Research Associate Angatel Philippines
11 Haydon Faughey Male Safety Technician IV Masalovka Russia
12 Michele Zarfai Male Legal Assistant Karatau Kazakhstan
13 Ruthi Bowmer Female Analog Circuit Desig Peski Russia
14 Adolphus Pickthorne Male Senior Developer Mae Fa Luang Thailand
15 Kat Dymocke Female Geological Engineer Markópoulo Oropoú Greece
Time taken: 0.92 seconds, Fetched: 12 row(s)


// Verify the record count in MySQL, HDFS and Hive
mysql> use ohm;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> select count(1) from employee;
+----------+
| count(1) |
+----------+
|       12 |
+----------+
1 row in set (0.00 sec)



// do record count from all the files in warehouse 
hdfs dfs -cat /user/hive/warehouse/ohm.db/employee/* | wc -l
12

// do record count from the table
hive> select count(1) from employee;
OK
12


Flume - Simple Demo

// create a folder in hdfs : $ hdfs dfs -mkdir /user/flumeExa // Create a shell script which generates : Hadoop in real world <n>...