Sankara's Big Data Notes: sqoop

Showing posts with label sqoop. Show all posts

Saturday, 8 August 2020

SQOOP Export with Staging table Example - HDFS to MySQL data export using SQOOP with staging-table

mysql> use retail_db;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show tables;
+---------------------+
| Tables_in_retail_db |
+---------------------+
| categories |
| customers |
| departments |
| order_items |
| orders |
| products |
+---------------------+
6 rows in set (0.00 sec)

mysql> describe orders;
+-------------------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+-------------+------+-----+---------+----------------+
| order_id | int(11) | NO | PRI | NULL | auto_increment |
| order_date | datetime | NO | | NULL | |
| order_customer_id | int(11) | NO | | NULL | |
| order_status | varchar(45) | NO | | NULL | |
+-------------------+-------------+------+-----+---------+----------------+

// create a staging table with the same structure of orders table

mysql> create table retail_db.orders_staging as select * from retail_db.orders where 1 = 0;
Query OK, 0 rows affected (0.10 sec)
Records: 0 Duplicates: 0 Warnings: 0

// create a new table : orders_exported which has the same structure of retail_db.orders table

mysql> create table retail_db.orders_exported as select * from retail_db.orders where 1 = 0;
Query OK, 0 rows affected (0.00 sec)
Records: 0 Duplicates: 0 Warnings: 0

// verify the structure of newly created orders_staging table
mysql> describe orders_staging;
+-------------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+-------------+------+-----+---------+-------+
| order_id | int(11) | NO | | 0 | |
| order_date | datetime | NO | | NULL | |
| order_customer_id | int(11) | NO | | NULL | |
| order_status | varchar(45) | NO | | NULL | |
+-------------------+-------------+------+-----+---------+-------+
4 rows in set (0.00 sec)

// I already have the orders data in HDFS
$ hdfs dfs -ls /user/cloudera/orders
Found 5 items
-rw-r--r-- 1 cloudera cloudera 0 2020-08-06 05:52 /user/cloudera/orders/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 741614 2020-08-06 05:52 /user/cloudera/orders/part-m-00000
-rw-r--r-- 1 cloudera cloudera 753022 2020-08-06 05:52 /user/cloudera/orders/part-m-00001
-rw-r--r-- 1 cloudera cloudera 752368 2020-08-06 05:52 /user/cloudera/orders/part-m-00002
-rw-r--r-- 1 cloudera cloudera 752940 2020-08-06 05:52 /user/cloudera/orders/part-m-00003

// Export HDFS data into MySQL table using Staging stable approach
sqoop export \
-connect "jdbc:mysql://localhost/retail_db" \
-username root \
-password cloudera \
-table orders_exported \
-export-dir hdfs://localhost:8020/user/cloudera/orders/ \
-staging-table orders_staging \
-clear-staging-table \
-input-fields-terminated-by ','

20/08/08 00:24:45 INFO mapreduce.ExportJobBase: Exported 68883 records.
20/08/08 00:24:45 INFO mapreduce.ExportJobBase: Starting to migrate data from staging table to destination.
20/08/08 00:24:45 INFO manager.SqlManager: Migrated 68883 records from `orders_staging` to `orders_exported`

mysql> select * from retail_db.orders_exported limit 5;
+----------+---------------------+-------------------+-----------------+
| order_id | order_date | order_customer_id | order_status |
+----------+---------------------+-------------------+-----------------+
| 1 | 2013-07-25 00:00:00 | 11599 | CLOSED |
| 2 | 2013-07-25 00:00:00 | 256 | PENDING_PAYMENT |
| 3 | 2013-07-25 00:00:00 | 12111 | COMPLETE |
| 4 | 2013-07-25 00:00:00 | 8827 | CLOSED |
| 5 | 2013-07-25 00:00:00 | 11318 | COMPLETE |
+----------+---------------------+-------------------+-----------------+
5 rows in set (0.00 sec)

mysql> select count(1) from retail_db.orders_exported limit 5;
+----------+
| count(1) |
+----------+
| 68883 |
+----------+
1 row in set (0.01 sec)

Line count in HDFS:
$ hdfs dfs -cat hdfs://localhost:8020/user/cloudera/orders/* | wc -l
68883

Friday, 7 August 2020

Import all the tables from MySQL to Hive with Snappy Compression and Parquet file creation using SQOOP

mysql> use retail_db;

Reading table information for completion of table and column names

You can turn off this feature to get a quicker startup with -A

Database changed

mysql> show tables;

+---------------------+

| Tables_in_retail_db |

+---------------------+

| categories |

| customers |

| departments |

| order_items |

| orders |

| products |

+---------------------+

6 rows in set (0.00 sec)

Here we are going to import all the tables of retail_db database of MySQL into Hive.

Additional options we have used are : Snappy Compression, Save as Parquet file

// Delete the database in Hive

hive> drop database ohm cascade;

Time taken: 1.187 seconds

hive> create database ohm;

sqoop import-all-tables \

-connect jdbc:mysql://localhost:3306/retail_db \

-username root \

-password cloudera \

-warehouse-dir /user/hive/warehouse \

-m 1 \

-hive-database ohm \

-hive-import \

-hive-overwrite \

-create-hive-table \

-compress \

-compression-codec snappy \

-as-parquetfile \

-outdir java_out ;

hive> use ohm;

Time taken: 0.01 seconds

hive> show tables;

Create, Execute, Delete SQOOP Jobs

// I have a ohm.person table in MySQL with the following records

mysql> select * from ohm.person;

+----+-----------------+---------------------+

| id | name | last_mod_dt |

+----+-----------------+---------------------+

| 1 | Raja | 2020-08-07 01:17:17 |

| 2 | Ravi | 2020-08-07 01:17:30 |

| 3 | Kalai | 2020-08-07 01:17:34 |

| 4 | Sudha | 2020-08-07 01:17:39 |

| 5 | Priya | 2020-08-07 01:31:28 |

| 6 | Vanitha | 2020-08-07 01:31:34 |

| 7 | Kasturi | 2020-08-07 01:31:40 |

| 8 | Lakshmi | 2020-08-07 01:31:45 |

| 9 | Suriya Devi | 2020-08-07 01:31:51 |

| 10 | Nanjil Vijayan | 2020-08-07 01:40:53 |

| 11 | Elizabeth Helen | 2020-08-07 01:41:14 |

| 12 | Peter Paul | 2020-08-07 01:41:20 |

| 13 | Ravindran | 2020-08-07 01:41:35 |

+----+-----------------+---------------------+

// Create a new Sqoop Job:

$ sqoop job \

-create person_inc_job \

-- import \

-connect jdbc:mysql://localhost:3306/ohm \

-table person \

-username root \

-password cloudera \

-check-column last_mod_dt \

-incremental append \

-last-value '01:17:17' \

-target-dir /user/cloudera/person_test \

-m 1

// Display all job names already created

$ sqoop job --list

Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.

Please set $ACCUMULO_HOME to the root of your Accumulo installation.

20/08/07 02:05:54 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0

Available jobs:

person_inc_job

// delete the target folder if exists

$ hdfs dfs -rm -r /user/cloudera/person_test

rm: `/user/cloudera/person_test': No such file or directory

// run the sqoop job for the first time

$ sqoop job --exec person_inc_job

Retrieved 13 records.

$ hdfs dfs -cat /user/cloudera/person_test/*

1,Raja,2020-08-07 01:17:17.0

2,Ravi,2020-08-07 01:17:30.0

3,Kalai,2020-08-07 01:17:34.0

4,Sudha,2020-08-07 01:17:39.0

5,Priya,2020-08-07 01:31:28.0

6,Vanitha,2020-08-07 01:31:34.0

7,Kasturi,2020-08-07 01:31:40.0

8,Lakshmi,2020-08-07 01:31:45.0

9,Suriya Devi,2020-08-07 01:31:51.0

10,Nanjil Vijayan,2020-08-07 01:40:53.0

11,Elizabeth Helen,2020-08-07 01:41:14.0

12,Peter Paul,2020-08-07 01:41:20.0

13,Ravindran,2020-08-07 01:41:35.0

// Now again add some more new records in MySQL

mysql> insert into ohm.person(name) values ('Anbu');

Query OK, 1 row affected (0.00 sec)

mysql> insert into ohm.person(name) values ('Sudha');

Query OK, 1 row affected (0.01 sec)

mysql> insert into ohm.person(name) values ('Aish');

Query OK, 1 row affected (0.00 sec)

mysql> insert into ohm.person(name) values ('Vijay');

Query OK, 1 row affected (0.00 sec)

mysql> insert into ohm.person(name) values ('Balaji');

Query OK, 1 row affected (0.01 sec)

mysql> select * from ohm.person; -- MySQL console

+----+-----------------+---------------------+

| id | name | last_mod_dt |

+----+-----------------+---------------------+

| 1 | Raja | 2020-08-07 01:17:17 |

| 2 | Ravi | 2020-08-07 01:17:30 |

| 3 | Kalai | 2020-08-07 01:17:34 |

| 4 | Sudha | 2020-08-07 01:17:39 |

| 5 | Priya | 2020-08-07 01:31:28 |

| 6 | Vanitha | 2020-08-07 01:31:34 |

| 7 | Kasturi | 2020-08-07 01:31:40 |

| 8 | Lakshmi | 2020-08-07 01:31:45 |

| 9 | Suriya Devi | 2020-08-07 01:31:51 |

| 10 | Nanjil Vijayan | 2020-08-07 01:40:53 |

| 11 | Elizabeth Helen | 2020-08-07 01:41:14 |

| 12 | Peter Paul | 2020-08-07 01:41:20 |

| 13 | Ravindran | 2020-08-07 01:41:35 | -- last value

| 14 | Anbu | 2020-08-07 02:23:25 | -- newly added records

| 15 | Sudha | 2020-08-07 02:23:29 |

| 16 | Aish | 2020-08-07 02:23:36 |

| 17 | Vijay | 2020-08-07 02:23:44 |

| 18 | Balaji | 2020-08-07 02:23:47 |

+----+-----------------+---------------------+

18 rows in set (0.00 sec)

// run the sqoop job again to fetch incremental records

$ sqoop job --exec person_inc_job

Retrieved 5 records.

// Display the imported data available in HDFS

$ hdfs dfs -cat /user/cloudera/person_test/*

1,Raja,2020-08-07 01:17:17.0

2,Ravi,2020-08-07 01:17:30.0

3,Kalai,2020-08-07 01:17:34.0

4,Sudha,2020-08-07 01:17:39.0

5,Priya,2020-08-07 01:31:28.0

6,Vanitha,2020-08-07 01:31:34.0

7,Kasturi,2020-08-07 01:31:40.0

8,Lakshmi,2020-08-07 01:31:45.0

9,Suriya Devi,2020-08-07 01:31:51.0

10,Nanjil Vijayan,2020-08-07 01:40:53.0

11,Elizabeth Helen,2020-08-07 01:41:14.0

12,Peter Paul,2020-08-07 01:41:20.0

13,Ravindran,2020-08-07 01:41:35.0 -- loaded from FULL LOAD

14,Anbu,2020-08-07 02:23:25.0 -- incremental records

15,Sudha,2020-08-07 02:23:29.0

16,Aish,2020-08-07 02:23:36.0

17,Vijay,2020-08-07 02:23:44.0

18,Balaji,2020-08-07 02:23:47.0

// To delete the sqoop job :

$ sqoop job --delete person_inc_job;

Incremental Import in SQOOP with Date column

// Create a database named : ohm

// Create a table named : person with id (auto increment), name string, last_mod_dt (default value is current_timestamp)

mysql> create database if not exists ohm;

Query OK, 1 row affected (0.00 sec)

mysql> use ohm;

CREATE TABLE if not exists ohm.person (

id BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,

name VARCHAR(500) NOT NULL,

last_mod_dt TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,

PRIMARY KEY (id),

UNIQUE KEY person (last_mod_dt)

) ;

mysql> describe ohm.person;

+-------------+---------------------+------+-----+-------------------+----------------+

+-------------+---------------------+------+-----+-------------------+----------------+

| id | bigint(20) unsigned | NO | PRI | NULL | auto_increment |

+-------------+---------------------+------+-----+-------------------+----------------+

// Add 4 records

mysql> insert into ohm.person(name) values ('Raja');

Query OK, 1 row affected (0.00 sec)

mysql> insert into ohm.person(name) values ('Ravi');

Query OK, 1 row affected (0.00 sec)

mysql> insert into ohm.person(name) values ('Kalai');

Query OK, 1 row affected (0.00 sec)

mysql> insert into ohm.person(name) values ('Sudha');

Query OK, 1 row affected (0.00 sec)

mysql> insert into ohm.person(name) values ('Priya');

Query OK, 1 row affected (0.00 sec)

mysql> insert into ohm.person(name) values ('Vanitha');

Query OK, 1 row affected (0.00 sec)

mysql> insert into ohm.person(name) values ('Kasturi');

Query OK, 1 row affected (0.00 sec)

mysql> insert into ohm.person(name) values ('Lakshmi');

Query OK, 1 row affected (0.00 sec)

mysql> insert into ohm.person(name) values ('Suriya Devi');

mysql> select * from ohm.person;

+----+-------------+---------------------+

| id | name | last_mod_dt |

+----+-------------+---------------------+

| 1 | Raja | 2020-08-07 01:17:17 |

| 2 | Ravi | 2020-08-07 01:17:30 |

| 3 | Kalai | 2020-08-07 01:17:34 |

| 4 | Sudha | 2020-08-07 01:17:39 |

| 5 | Priya | 2020-08-07 01:31:28 |

| 6 | Vanitha | 2020-08-07 01:31:34 |

| 7 | Kasturi | 2020-08-07 01:31:40 |

| 8 | Lakshmi | 2020-08-07 01:31:45 |

| 9 | Suriya Devi | 2020-08-07 01:31:51 |

+----+-------------+---------------------+

Now do SQOOP Import full load

$ sqoop import \

-connect jdbc:mysql://localhost:3306/ohm \

-table person \

-username root \

-password cloudera \

-check-column last_mod_dt \

-incremental append \

-last-value '2020-08-07 01:17:17' \

-target-dir /user/cloudera/person_test

BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `person` WHERE ( `last_mod_dt` > '2020-08-07 01:17:17' AND `last_mod_dt` <= '2020-08-07 01:31:51.0' )

$ hdfs dfs -ls /user/cloudera/person_test

Found 5 items

-rw-r--r-- 1 cloudera cloudera 0 2020-08-07 01:28 /user/cloudera/person_test/_SUCCESS

-rw-r--r-- 1 cloudera cloudera 29 2020-08-07 01:28 /user/cloudera/person_test/part-m-00000

-rw-r--r-- 1 cloudera cloudera 29 2020-08-07 01:28 /user/cloudera/person_test/part-m-00001

-rw-r--r-- 1 cloudera cloudera 30 2020-08-07 01:28 /user/cloudera/person_test/part-m-00002

-rw-r--r-- 1 cloudera cloudera 30 2020-08-07 01:28 /user/cloudera/person_test/part-m-00003

// Here we have imported all 4 records using FULL LOAD

$ hdfs dfs -cat /user/cloudera/person_test/*

2,Ravi,2020-08-07 01:17:30.0

3,Kalai,2020-08-07 01:17:34.0

4,Sudha,2020-08-07 01:17:39.0

5,Priya,2020-08-07 01:31:28.0

6,Vanitha,2020-08-07 01:31:34.0

7,Kasturi,2020-08-07 01:31:40.0

8,Lakshmi,2020-08-07 01:31:45.0

9,Suriya Devi,2020-08-07 01:31:51.0

// Adding new records in MySQL

mysql> insert into ohm.person(name) values ('Nanjil Vijayan');

Query OK, 1 row affected (0.01 sec)

mysql> insert into ohm.person(name) values ('Elizabeth Helen');

Query OK, 1 row affected (0.01 sec)

mysql> insert into ohm.person(name) values ('Peter Paul');

Query OK, 1 row affected (0.00 sec)

mysql> insert into ohm.person(name) values ('Ravindran');

Query OK, 1 row affected (0.00 sec)

mysql> select * from ohm.person;

+----+-----------------+---------------------+

| id | name | last_mod_dt |

+----+-----------------+---------------------+

| 1 | Raja | 2020-08-07 01:17:17 |

| 2 | Ravi | 2020-08-07 01:17:30 |

| 3 | Kalai | 2020-08-07 01:17:34 |

| 4 | Sudha | 2020-08-07 01:17:39 |

| 5 | Priya | 2020-08-07 01:31:28 |

| 6 | Vanitha | 2020-08-07 01:31:34 |

| 7 | Kasturi | 2020-08-07 01:31:40 |

| 8 | Lakshmi | 2020-08-07 01:31:45 |

| 9 | Suriya Devi | 2020-08-07 01:31:51 | -- Last-value of the previous import

| 10 | Nanjil Vijayan | 2020-08-07 01:40:53 |

| 11 | Elizabeth Helen | 2020-08-07 01:41:14 |

| 12 | Peter Paul | 2020-08-07 01:41:20 |

| 13 | Ravindran | 2020-08-07 01:41:35 |

+----+-----------------+---------------------+

Now do SQOOP incremental Import with incremental append mode with last-modified option

$ sqoop import \

-connect jdbc:mysql://localhost:3306/ohm \

-table person \

-username root \

-password cloudera \

-check-column last_mod_dt \

-incremental append \

-last-value '2020-08-07 01:31:51' \

-target-dir /user/cloudera/person_test \

-m 1

hdfs dfs -ls /user/cloudera/person_test

Found 5 items

-rw-r--r-- 1 cloudera cloudera 59 2020-08-07 01:37 /user/cloudera/person_test/part-m-00000

-rw-r--r-- 1 cloudera cloudera 60 2020-08-07 01:38 /user/cloudera/person_test/part-m-00001

-rw-r--r-- 1 cloudera cloudera 64 2020-08-07 01:38 /user/cloudera/person_test/part-m-00002

-rw-r--r-- 1 cloudera cloudera 68 2020-08-07 01:38 /user/cloudera/person_test/part-m-00003

-rw-r--r-- 1 cloudera cloudera 152 2020-08-07 01:43 /user/cloudera/person_test/part-m-00004 -- new file which has only incremental records

$ hdfs dfs -cat /user/cloudera/person_test/*

2,Ravi,2020-08-07 01:17:30.0

3,Kalai,2020-08-07 01:17:34.0

4,Sudha,2020-08-07 01:17:39.0

5,Priya,2020-08-07 01:31:28.0

6,Vanitha,2020-08-07 01:31:34.0

7,Kasturi,2020-08-07 01:31:40.0

8,Lakshmi,2020-08-07 01:31:45.0

9,Suriya Devi,2020-08-07 01:31:51.0

10,Nanjil Vijayan,2020-08-07 01:40:53.0 -- incremental records

11,Elizabeth Helen,2020-08-07 01:41:14.0

12,Peter Paul,2020-08-07 01:41:20.0

13,Ravindran,2020-08-07 01:41:35.0

Wednesday, 5 August 2020

Import all tables from MySQL to HDFS (Exclude Specific tables while importing)

//Import all the tables from MySQL to HDFS using SQOOP and exclude specific tables while importing

mysql> use retail_db;

Reading table information for completion of table and column names

You can turn off this feature to get a quicker startup with -A

show Database changed

mysql> show tables;

+---------------------+

| Tables_in_retail_db |

+---------------------+

| categories |

| customers |

| departments |

| order_items |

| orders |

| products |

+---------------------+

6 rows in set (0.00 sec)

// Import all tables from retail_db

sqoop import-all-tables \

-connect jdbc:mysql://localhost:3306/retail_db \

-username root \

-password cloudera

// imported into /user/cloudera/categories,

/user/cloudera/customers,

/user/cloudera/departments,

/user/cloudera/order_items,

/user/cloudera/orders,

/user/cloudera/products

// Testing purpose - excluded all the tables

$ sqoop import-all-tables \

--connect jdbc:mysql://localhost:3306/retail_db \

--username root \

--password cloudera \

--exclude-tables "categories,customers,departments,order_items,orders,products"

Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.

Please set $ACCUMULO_HOME to the root of your Accumulo installation.

20/08/05 09:05:02 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.0

20/08/05 09:05:02 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

20/08/05 09:05:05 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.

Skipping table: categories

Skipping table: customers

Skipping table: departments

Skipping table: order_items

Skipping table: orders

Skipping table: products

// only categories table will be imported - except categories all other tables names are mentioned in exclude-tables list.

sqoop import-all-tables \

--connect jdbc:mysql://localhost:3306/retail_db \

--username root \

--password cloudera \

--exclude-tables "customers,departments,order_items,orders,products"

Skipping table: customers

Skipping table: departments

Skipping table: order_items

Skipping table: orders

Skipping table: products

Use --exclude-tables "table1,table2" option to ignore the table1 and table2.

Do NOT add white-space between the table names (aka. "table1, table2")

$ hdfs dfs -ls /user/cloudera/

Found 1 items

drwxr-xr-x - cloudera supergroup 0 2020-08-05 09:12 /user/cloudera/categories

$ hdfs dfs -ls /user/cloudera/categories/*

[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/categories/*

-rw-r--r-- 1 cloudera supergroup 0 2020-08-05 09:12 /user/cloudera/categories/_SUCCESS

-rw-r--r-- 1 cloudera supergroup 271 2020-08-05 09:12 /user/cloudera/categories/part-m-00000

-rw-r--r-- 1 cloudera supergroup 263 2020-08-05 09:12 /user/cloudera/categories/part-m-00001

-rw-r--r-- 1 cloudera supergroup 266 2020-08-05 09:12 /user/cloudera/categories/part-m-00002

-rw-r--r-- 1 cloudera supergroup 229 2020-08-05 09:12 /user/cloudera/categories/part-m-00003

Monday, 3 August 2020

SQOOP Export Example : Hive to MySQL using SQOOP Export

// We are going to export ohm.cust_from_mysql (hive table ) into cust_exported (MySQL)

hive> select * from ohm.cust_from_mysql limit 5;

1 Richard Hernandez XXXXXXXXX XXXXXXXXX 6303 Heather Plaza Brownsville TX 78521

2 Mary Barrett XXXXXXXXX XXXXXXXXX 9526 Noble Embers Ridge Littleton CO 80126

3 Ann Smith XXXXXXXXX XXXXXXXXX 3422 Blue Pioneer Bend Caguas PR 00725

4 Mary Jones XXXXXXXXX XXXXXXXXX 8324 Little Common San Marcos CA 92069

5 Robert Hudson XXXXXXXXX XXXXXXXXX 10 Crystal River Mall Caguas PR 00725

hive> select count(1) from ohm.cust_from_mysql;

12435 -- Hive Table has 12435 Rows in it

// We must create target table in destination database before export

Create an empty table in retail_db of MySQL

mysql> create table retail_db.cust_exported as select * from customers where 1 =0

// Presently retail_db.cust_exported table has zero rows.

$ sqoop export \

-connect "jdbc:mysql://localhost/retail_db" \

-username root \

-password cloudera \

-table cust_exported \

-hcatalog-table cust_from_mysql \

-hcatalog-database ohm

Exported 12435 records.

mysql> select * from cust_exported limit 5;

+-------------+----------------+----------------+----------------+-------------------+------------------------+---------------+----------------+------------------+

+-------------+----------------+----------------+----------------+-------------------+------------------------+---------------+----------------+------------------+

+-------------+----------------+----------------+----------------+-------------------+------------------------+---------------+----------------+------------------+

mysql> select count(1) from retail_db.cust_exported;

+----------+

| count(1) |

+----------+

| 12435 |

+----------+

SQOOP Export Example : HDFS to MySQL using SQOOP Export

Target table must exists in the Database. So we are creating the empty table

mysql> create table retail_db.customers_exported as select * from customers where 1 = 0;

// Now the record count is zero for customers_exported table of retail_db;

hdfs dfs -ls /user/cloudera/customers

Found 9 items

-rw-r--r-- 1 cloudera cloudera 0 2020-08-02 21:58 /user/cloudera/customers/_SUCCESS

-rw-r--r-- 1 cloudera cloudera 118147 2020-08-02 21:57 /user/cloudera/customers/part-m-00000

-rw-r--r-- 1 cloudera cloudera 119075 2020-08-02 21:57 /user/cloudera/customers/part-m-00001

-rw-r--r-- 1 cloudera cloudera 119057 2020-08-02 21:57 /user/cloudera/customers/part-m-00002

-rw-r--r-- 1 cloudera cloudera 118831 2020-08-02 21:57 /user/cloudera/customers/part-m-00003

-rw-r--r-- 1 cloudera cloudera 118798 2020-08-02 21:57 /user/cloudera/customers/part-m-00004

-rw-r--r-- 1 cloudera cloudera 119294 2020-08-02 21:57 /user/cloudera/customers/part-m-00005

-rw-r--r-- 1 cloudera cloudera 119892 2020-08-02 21:58 /user/cloudera/customers/part-m-00006

-rw-r--r-- 1 cloudera cloudera 120431 2020-08-02 21:58 /user/cloudera/customers/part-m-00007

sqoop export \

-connect "jdbc:mysql://localhost/retail_db" \

-username root \

-password cloudera \

-table customers_exported \

-export-dir /user/cloudera/customers \

-input-fields-terminated-by ',' \

mysql> select * from retail_db.customers_exported limit 10;

+-------------+----------------+----------------+----------------+-------------------+--------------------------+---------------+----------------+------------------+

+-------------+----------------+----------------+----------------+-------------------+--------------------------+---------------+----------------+------------------+

+-------------+----------------+----------------+----------------+-------------------+--------------------------+---------------+----------------+------------------+

SQOOP Import Problem and Solution #3

#Problem : 2

Connect to MySQL database using SQOOP,

import all customers whose street name contains "Plaza"

Ex:

Hazy Mountain Plaza

Tawny Fox Plaza

Data description:

A MySQL instance is running on the localhost node.

In that instance, you will find customers table that contains

customer's data.

Installation : localhost

Database name : retail_db

Table Name : Customers

User Name : root

Password : cloudera

Output requirement:

Place the customers files in HDFS directory

"user/cloudera/problem1/customers/txtdata"

Save output in text format with fields

separated by a "*" and lines should be terminated by pipe.

Select the columns : customer_id,

customer_fname,customer_lname,

customer_street_name.

$ sqoop import \

-connect "jdbc:mysql://localhost/retail_db" \

-username root \

-password cloudera \

-table customers \

-where "customer_street like '%Plaza%'" \

-target-dir /user/cloudera/problem1/customers/textdata \

-columns "customer_id,customer_fname,customer_lname,customer_street"

dfs dfs -ls /user/cloudera/problem1/customers/textdata

Found 5 items

-rw-r--r-- 1 cloudera cloudera 0 2020-08-03 08:40 /user/cloudera/problem1/customers/textdata/_SUCCESS

-rw-r--r-- 1 cloudera cloudera 974 2020-08-03 08:40 /user/cloudera/problem1/customers/textdata/part-m-00000

-rw-r--r-- 1 cloudera cloudera 1165 2020-08-03 08:40 /user/cloudera/problem1/customers/textdata/part-m-00001

-rw-r--r-- 1 cloudera cloudera 986 2020-08-03 08:40 /user/cloudera/problem1/customers/textdata/part-m-00002

-rw-r--r-- 1 cloudera cloudera 1028 2020-08-03 08:40 /user/cloudera/problem1/customers/textdata/part-m-00003

$ hdfs dfs -cat /user/cloudera/problem1/customers/textdata/part-m-00000 | tail

1793,Mary,Burke,141 Dewy Plaza

1821,Mary,Trevino,6520 Quaking Quail Plaza

1891,Mary,Mora,1804 Misty Plaza

2053,Ronald,Smith,1025 Colonial Plaza

2247,Alice,Werner,1557 Heather Leaf Plaza

2517,Jose,Kane,5202 Emerald Island Plaza

2545,Larry,Patel,1688 Middle Panda Plaza

2575,Mary,Carroll,4979 Honey Plaza

2831,Mary,Padilla,3810 Crystal Plaza

2875,Mary,Roberts,2205 Velvet Plaza

SQOOP Import Problem and Solution #2

#Problem : 2

Connect to MySQL database using SQOOP,

import all customers lives in 'CA' state.

Data description:

A MySQL instance is running on the localhost node.

In that instance, you will find customers table that contains

customer's data.

Installation : localhost

Database name : retail_db

Table Name : Customers

User Name : root

Password : cloudera

Output requirement:

Place the customers files in HDFS directory

"user/cloudera/problem1/customers_selected/avrodata"

Use avro format with Snappy compression

Select the columns : customer_id,customer_fname,customer_lname,customer_state.

$ sqoop import \

-connect "jdbc:mysql://localhost/retail_db" \

-username root \

-password cloudera \

-table customers \

-where "customer_state='CA'" \

-target-dir /user/cloudera/problem1/customers_selected/avrodata \

-columns "customer_id,customer_fname,customer_lname,customer_state" \

-compress \

-compression-codec snappy \

-as-avrodatafile

$ hdfs dfs -ls /user/cloudera/problem1/customers_selected/avrodata

Found 5 items

-rw-r--r-- 1 cloudera cloudera 0 2020-08-03 08:24 /user/cloudera/problem1/customers_selected/avrodata/_SUCCESS

-rw-r--r-- 1 cloudera cloudera 6508 2020-08-03 08:24 /user/cloudera/problem1/customers_selected/avrodata/part-m-00000.avro

-rw-r--r-- 1 cloudera cloudera 6427 2020-08-03 08:24 /user/cloudera/problem1/customers_selected/avrodata/part-m-00001.avro

-rw-r--r-- 1 cloudera cloudera 6419 2020-08-03 08:24 /user/cloudera/problem1/customers_selected/avrodata/part-m-00002.avro

-rw-r--r-- 1 cloudera cloudera 7093 2020-08-03 08:24 /user/cloudera/problem1/customers_selected/avrodata/part-m-00003.avro

$ avro-tools tojson hdfs://localhost/user/cloudera/problem1/customers_selected/avrodata/part-m-00003.avro | head

log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).

log4j:WARN Please initialize the log4j system properly.

log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

{"customer_id":{"int":9328},"customer_fname":{"string":"Mary"},"customer_lname":{"string":"Perez"},"customer_state":{"string":"CA"}}

{"customer_id":{"int":9333},"customer_fname":{"string":"Angela"},"customer_lname":{"string":"Mills"},"customer_state":{"string":"CA"}}

{"customer_id":{"int":9336},"customer_fname":{"string":"Janice"},"customer_lname":{"string":"Guzman"},"customer_state":{"string":"CA"}}

{"customer_id":{"int":9338},"customer_fname":{"string":"James"},"customer_lname":{"string":"Davis"},"customer_state":{"string":"CA"}}

{"customer_id":{"int":9340},"customer_fname":{"string":"Mary"},"customer_lname":{"string":"Smith"},"customer_state":{"string":"CA"}}

{"customer_id":{"int":9347},"customer_fname":{"string":"Mary"},"customer_lname":{"string":"Fuentes"},"customer_state":{"string":"CA"}}

{"customer_id":{"int":9352},"customer_fname":{"string":"Mary"},"customer_lname":{"string":"Lewis"},"customer_state":{"string":"CA"}}

{"customer_id":{"int":9357},"customer_fname":{"string":"Katherine"},"customer_lname":{"string":"Spence"},"customer_state":{"string":"CA"}}

{"customer_id":{"int":9363},"customer_fname":{"string":"Mary"},"customer_lname":{"string":"Simmons"},"customer_state":{"string":"CA"}}

{"customer_id":{"int":9375},"customer_fname":{"string":"George"},"customer_lname":{"string":"Smith"},"customer_state":{"string":"CA"}}

SQOOP Import problem and solution #1

SQOOP problem : #1

Connect to MySQL database using SQOOP, import all orders

from orders table whose order_status is COMPLETE.

Data Description:

A MySQL instance is running on the localhost.In that instance, you will find orders table that

contains order's data.

Installation : LocalHost

Database Name : retail_db

Table Name : Orders

User Name : root

Password : cloudera

Output requirement:

Place the order's files in HDFS directory:

/user/cloudera/problem1/orders/parquetdata

Use parquet format with tab delimiter and snappy compression/

NULL values are represented as -1 for numbers and "NA" for strings.

Solution:

$ sqoop import \

-connect "jdbc:mysql://localhost/retail_db" \

-username root \

-password cloudera \

-table orders \

-target-dir /user/cloudera/problem1/orders/parquetdata \

-delete-target-dir \

-fields-terminated-by "\t" \

-where "order_status='COMPLETE'" \

-null-string "NA" \

-null-non-string -1 \

-compress \

-compression-codec snappy \

-as-parquetfile

$ hdfs dfs -ls /user/cloudera/problem1/orders/parquetdataFound 6 items

drwxr-xr-x - cloudera cloudera 0 2020-08-03 08:09 /user/cloudera/problem1/orders/parquetdata/.metadata

drwxr-xr-x - cloudera cloudera 0 2020-08-03 08:11 /user/cloudera/problem1/orders/parquetdata/.signals

-rw-r--r-- 1 cloudera cloudera 50233 2020-08-03 08:11 /user/cloudera/problem1/orders/parquetdata/1fa590f9-281d-4d62-9650-ddfe04d7eae2.parquet

-rw-r--r-- 1 cloudera cloudera 47508 2020-08-03 08:11 /user/cloudera/problem1/orders/parquetdata/217e22ee-25b2-41a6-8bf9-f6fee4c7d0c4.parquet

-rw-r--r-- 1 cloudera cloudera 46231 2020-08-03 08:11 /user/cloudera/problem1/orders/parquetdata/826a8254-9700-4edd-afdb-b8fb58b2fd49.parquet

-rw-r--r-- 1 cloudera cloudera 46406 2020-08-03 08:11 /user/cloudera/problem1/orders/parquetdata/84355f27-d1a4-4c1b-ac9d-84fd5139a912.parquet

// View the parquet file content using parquet-tools:

[cloudera@quickstart ~]$ parquet-tools head -n5 hdfs://localhost/user/cloudera/problem1/orders/parquetdata/1fa590f9-281d-4d62-9650-ddfe04d7eae2.parquet

order_id = 51665

order_date = 1402729200000

order_customer_id = 8645

order_status = COMPLETE

order_id = 51670

order_date = 1402729200000

order_customer_id = 9322

order_status = COMPLETE

order_id = 51671

order_date = 1402729200000

order_customer_id = 9000

order_status = COMPLETE

order_id = 51673

order_date = 1402729200000

order_customer_id = 7538

order_status = COMPLETE

order_id = 51677

order_date = 1402729200000

order_customer_id = 6321

order_status = COMPLETE

Incremental append : MySQL to HDFS

#1. In the very first phase, we are doing FULL LOAD (init version)

#2. In the 2nd phase, we are doing INCREMENTAL APPEND

$ mysql -uroot -pcloudera -hlocalhost

mysql>use ohm;

mysql>create table person (id int, name varchar(50));

mysql>insert into person (id,name) values (100,'Raja'),(101,'Siva'),('102','Lal');

mysql> select * from ohm.person;

+------+------+

| id | name |

+------+------+

| 100 | Raja |

| 101 | Siva |

| 102 | Lal |

+------+------+

3 rows in set (0.00 sec)

// Initial full load

sqoop import \

-connect jdbc:mysql://localhost:3306/ohm \

-username root \

-password cloudera \

-table person \

-target-dir user/cloudera/person_incremental \

-split-by id

hdfs dfs -ls user/cloudera/person_incremental

Found 4 items

-rw-r--r-- 1 cloudera cloudera 0 2020-08-03 04:49 user/cloudera/person_incremental/_SUCCESS

-rw-r--r-- 1 cloudera cloudera 9 2020-08-03 04:49 user/cloudera/person_incremental/part-m-00000

-rw-r--r-- 1 cloudera cloudera 9 2020-08-03 04:49 user/cloudera/person_incremental/part-m-00001

-rw-r--r-- 1 cloudera cloudera 8 2020-08-03 04:49 user/cloudera/person_incremental/part-m-00002

$ hdfs dfs -cat user/cloudera/person_incremental/*

100,Raja

101,Siva

102,Lal

//Adding 4 more records in ohm.person

mysql>insert into ohm.person (id,name) values (103,'Nila'),(104,'Kalai'),('105','Bharani'),('106','Sara');

Query OK, 4 rows affected (0.00 sec)

Records: 4 Duplicates: 0 Warnings: 0

mysql> select * from ohm.person;

+------+---------+

| id | name |

+------+---------+

| 100 | Raja | $ Existing Row

| 101 | Siva | $ Existing Row

| 102 | Lal | $ Existing Row

| 103 | Nila | * Newly inserted row

| 104 | Kalai | * Newly inserted row

| 105 | Bharani | * Newly inserted row

| 106 | Sara | * Newly inserted row

+------+---------+

// incremental append using SQOOP Import

// feeding last-value as 102 and check-column as id

sqoop import \

-connect jdbc:mysql://localhost:3306/ohm \

-username root \

-password cloudera \

-table person \

-incremental append \

-check-column id \

-last-value 102 \

-target-dir user/cloudera/person_incremental \

-split-by id

$ hdfs dfs -ls user/cloudera/person_incremental

Found 8 items

-rw-r--r-- 1 cloudera cloudera 0 2020-08-03 04:49 user/cloudera/person_incremental/_SUCCESS

-rw-r--r-- 1 cloudera cloudera 9 2020-08-03 04:49 user/cloudera/person_incremental/part-m-00000

-rw-r--r-- 1 cloudera cloudera 9 2020-08-03 04:49 user/cloudera/person_incremental/part-m-00001

-rw-r--r-- 1 cloudera cloudera 8 2020-08-03 04:49 user/cloudera/person_incremental/part-m-00002

-rw-r--r-- 1 cloudera cloudera 9 2020-08-03 05:10 user/cloudera/person_incremental/part-m-00003

-rw-r--r-- 1 cloudera cloudera 10 2020-08-03 05:10 user/cloudera/person_incremental/part-m-00004

-rw-r--r-- 1 cloudera cloudera 12 2020-08-03 05:10 user/cloudera/person_incremental/part-m-00005

-rw-r--r-- 1 cloudera cloudera 9 2020-08-03 05:10 user/cloudera/person_incremental/part-m-00006

$ hdfs dfs -cat user/cloudera/person_incremental/*

100,Raja

101,Siva

102,Lal

103,Nila -- Newly appended row

104,Kalai -- Newly appended row

105,Bharani -- Newly appended row

106,Sara -- Newly appended row

List Databases, Tables and Display the Structure of a table using SQOOP

//Display all the databases

sqoop list-databases \

-connect "jdbc:mysql://localhost:3306/" \

-username root \

-password cloudera

firehose

hue

metastore

mysql

nav

navms

ohm

oozie

retail_db

rman

sentry

//Display all the tables in a database:

sqoop list-tables \

-connect "jdbc:mysql://localhost:3306/retail_db" \

-username root \

-password cloudera

NULL Handling while SQOOP import

//Handling nulls while doing SQOOP import

use retail_db;

CREATE TABLE suppliers

( id int NOT NULL,

name char(50),

age int

);

mysql> describe suppliers;

+-------+----------+------+-----+---------+-------+

+-------+----------+------+-----+---------+-------+

| id | int(11) | NO | | NULL | |

| age | int(11) | YES | | NULL | |

+-------+----------+------+-----+---------+-------+

insert into retail_db.suppliers (id,name,age) values (100,NULL,NULL),(101,'Raja',NULL),(103,NULL,NULL),(104,'Manoj',NULL), (105,NULL,30), (106,NULL,NULL);

mysql> select * from retail_db.suppliers;

+-----+-------+------+

| id | name | age |

+-----+-------+------+

| 100 | NULL | NULL |

| 101 | Raja | NULL |

| 103 | NULL | NULL |

| 104 | Manoj | NULL |

| 105 | NULL | 30 |

| 106 | NULL | NULL |

+-----+-------+------+

// String column which is having NULL will be replaced with 'xxxx'

// Non String column which has NULL will be replaced with 'yyy'

sqoop import \

-connect jdbc:mysql://localhost/retail_db \

-username root \

-password cloudera \

-table suppliers \

-target-dir /user/cloudera/suppliers_null_tr \

-delete-target-dir \

-null-string "xxxx" \

-null-non-string 'yyy' \

-split-by id

Retrieved 6 records.

hdfs dfs -ls /user/cloudera/suppliers_null_tr

Found 5 items

-rw-r--r-- 1 cloudera cloudera 0 2020-08-03 02:02 /user/cloudera/suppliers_null_tr/_SUCCESS

-rw-r--r-- 1 cloudera cloudera 28 2020-08-03 02:02 /user/cloudera/suppliers_null_tr/part-m-00000

-rw-r--r-- 1 cloudera cloudera 14 2020-08-03 02:02 /user/cloudera/suppliers_null_tr/part-m-00001

-rw-r--r-- 1 cloudera cloudera 15 2020-08-03 02:02 /user/cloudera/suppliers_null_tr/part-m-00002

-rw-r--r-- 1 cloudera cloudera 26 2020-08-03 02:02 /user/cloudera/suppliers_null_tr/part-m-00003

$ hdfs dfs -cat /user/cloudera/suppliers_null_tr/*

100,xxxx,yyy

101,Raja,yyy

103,xxxx,yyy

104,Manoj,yyy

105,xxxx,30

106,xxxx,yyy

Importing data from MySQL to HDFS using SQOOP based on conditions Where, Query, Columns options

Import data from MySQL to HDFS using SQOOP with conditional data importing

//Conditional import using Where

sqoop import \

-connect jdbc:mysql://localhost/retail_db \

-username root \

-password cloudera \

-table customers \

-target-dir /user/cloudera/cust_nm_mary \

-delete-target-dir \

-where 'customer_fname="Mary"'

Retrieved 4741 records.

hdfs dfs -ls /user/cloudera/cust_nm_mary

Found 5 items

-rw-r--r-- 1 cloudera cloudera 0 2020-08-03 00:23 /user/cloudera/cust_nm_mary/_SUCCESS

-rw-r--r-- 1 cloudera cloudera 89617 2020-08-03 00:23 /user/cloudera/cust_nm_mary/part-m-00000

-rw-r--r-- 1 cloudera cloudera 88116 2020-08-03 00:23 /user/cloudera/cust_nm_mary/part-m-00001

-rw-r--r-- 1 cloudera cloudera 89590 2020-08-03 00:23 /user/cloudera/cust_nm_mary/part-m-00002

-rw-r--r-- 1 cloudera cloudera 90219 2020-08-03 00:23 /user/cloudera/cust_nm_mary/part-m-00003

// Look here : All the records have Mary but, here we have selected all the columns

hdfs dfs -cat /user/cloudera/cust_nm_mary/part-m-00000 | head

2,Mary,Barrett,XXXXXXXXX,XXXXXXXXX,9526 Noble Embers Ridge,Littleton,CO,80126

4,Mary,Jones,XXXXXXXXX,XXXXXXXXX,8324 Little Common,San Marcos,CA,92069

6,Mary,Smith,XXXXXXXXX,XXXXXXXXX,3151 Sleepy Quail Promenade,Passaic,NJ,07055

9,Mary,Perez,XXXXXXXXX,XXXXXXXXX,3616 Quaking Street,Caguas,PR,00725

11,Mary,Huffman,XXXXXXXXX,XXXXXXXXX,3169 Stony Woods,Caguas,PR,00725

13,Mary,Baldwin,XXXXXXXXX,XXXXXXXXX,7922 Iron Oak Gardens,Caguas,PR,00725

17,Mary,Robinson,XXXXXXXXX,XXXXXXXXX,1325 Noble Pike,Taylor,MI,48180

20,Mary,Ellis,XXXXXXXXX,XXXXXXXXX,4703 Old Route,West New York,NJ,07093

24,Mary,Smith,XXXXXXXXX,XXXXXXXXX,9417 Emerald Towers,Caguas,PR,00725

27,Mary,Vincent,XXXXXXXXX,XXXXXXXXX,1768 Sleepy Zephyr Place,Caguas,PR,00725

// Retrieve data of specific columns only along with where condition

sqoop import \

-connect jdbc:mysql://localhost/retail_db \

-username root \

-password cloudera \

-table customers \

-target-dir /user/cloudera/cust_col_sel \

-delete-target-dir \

-columns "customer_fname, customer_lname, customer_city" \

-where 'customer_fname="Mary"'

hdfs dfs -ls /user/cloudera/cust_col_sel

Found 5 items

-rw-r--r-- 1 cloudera cloudera 0 2020-08-03 00:43 /user/cloudera/cust_col_sel/_SUCCESS

-rw-r--r-- 1 cloudera cloudera 24361 2020-08-03 00:43 /user/cloudera/cust_col_sel/part-m-00000

-rw-r--r-- 1 cloudera cloudera 23816 2020-08-03 00:43 /user/cloudera/cust_col_sel/part-m-00001

-rw-r--r-- 1 cloudera cloudera 24280 2020-08-03 00:43 /user/cloudera/cust_col_sel/part-m-00002

-rw-r--r-- 1 cloudera cloudera 24254 2020-08-03 00:43 /user/cloudera/cust_col_sel/part-m-00003

// Retrieved First Name, Last name, city only

$ hdfs dfs -cat /user/cloudera/cust_col_sel/part-m-00000 | head

Mary,Barrett,Littleton

Mary,Jones,San Marcos

Mary,Smith,Passaic

Mary,Perez,Caguas

Mary,Huffman,Caguas

Mary,Baldwin,Caguas

Mary,Robinson,Taylor

Mary,Ellis,West New York

Mary,Smith,Caguas

Mary,Vincent,Caguas

mysql> select customer_fname, customer_lname, customer_city from retail_db.customers where customer_id between 100 and 110;

+----------------+----------------+-------------------+

| customer_fname | customer_lname | customer_city |

+----------------+----------------+-------------------+

| George | Barrett | Caguas |

| Mary | Briggs | Dorchester Center |

| Gregory | Smith | Caguas |

| Mary | Williams | Broomfield |

| Debra | Wood | Saint Paul |

| Mary | Smith | Caguas |

| Lauren | Freeman | Napa |

| Mary | Weaver | Caguas |

| Joshua | Smith | Bronx |

| Mary | Thompson | Caguas |

| David | Miranda | Beaverton |

+----------------+----------------+-------------------+

// Cannot specify --query and --table together.

//AND \$CONDITIONS must be included as part of the SELECT query

// Retrieve data based on the select query

sqoop import \

-connect jdbc:mysql://localhost/retail_db \

-username root \

-password cloudera \

-target-dir /user/cloudera/cust_qry_rstl \

-delete-target-dir \

-query "select customer_id,customer_fname, customer_lname, customer_city from retail_db.customers where customer_id between 100 and 110 AND \$CONDITIONS" \

-split-by customer_id

hdfs dfs -ls /user/cloudera/cust_qry_rstl

Found 5 items

-rw-r--r-- 1 cloudera cloudera 0 2020-08-03 01:11 /user/cloudera/cust_qry_rstl/_SUCCESS

-rw-r--r-- 1 cloudera cloudera 85 2020-08-03 01:11 /user/cloudera/cust_qry_rstl/part-m-00000

-rw-r--r-- 1 cloudera cloudera 77 2020-08-03 01:11 /user/cloudera/cust_qry_rstl/part-m-00001

-rw-r--r-- 1 cloudera cloudera 47 2020-08-03 01:11 /user/cloudera/cust_qry_rstl/part-m-00002

-rw-r--r-- 1 cloudera cloudera 76 2020-08-03 01:11 /user/cloudera/cust_qry_rstl/part-m-00003

// display the data written in HDFS

$ hdfs dfs -cat /user/cloudera/cust_qry_rstl/*

100,George,Barrett,Caguas

101,Mary,Briggs,Dorchester Center

102,Gregory,Smith,Caguas

103,Mary,Williams,Broomfield

104,Debra,Wood,Saint Paul

105,Mary,Smith,Caguas

106,Lauren,Freeman,Napa

107,Mary,Weaver,Caguas

108,Joshua,Smith,Bronx

109,Mary,Thompson,Caguas

110,David,Miranda,Beaverton

Split-by and boundary-query example in SQOOP

Example SQOOP code to work with Split-by and boundary-query:

//Split-by key : If we dont have any primary key in the table, we can go with split-by

sqoop import \

-connect jdbc:mysql://localhost:3306/retail_db \

-username root \

-password cloudera \

-table products \

-target-dir /user/cloudera/prod_slitt \

-delete-target-dir \

-split-by product_id

Retrieved 1345 records.

[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/prod_slitt/*

-rw-r--r-- 1 cloudera cloudera 0 2020-08-03 00:02 /user/cloudera/prod_slitt/_SUCCESS

-rw-r--r-- 1 cloudera cloudera 41419 2020-08-03 00:02 /user/cloudera/prod_slitt/part-m-00000

-rw-r--r-- 1 cloudera cloudera 43660 2020-08-03 00:02 /user/cloudera/prod_slitt/part-m-00001

-rw-r--r-- 1 cloudera cloudera 42195 2020-08-03 00:02 /user/cloudera/prod_slitt/part-m-00002

-rw-r--r-- 1 cloudera cloudera 46719 2020-08-03 00:02 /user/cloudera/prod_slitt/part-m-00003

// do line count

$ hdfs dfs -cat /user/cloudera/prod_slitt/* | wc -l

1345

// split-by along with boundary-query to identify min and max values of product_id

sqoop import \

-connect jdbc:mysql://localhost:3306/retail_db \

-username root \

-password cloudera \

-table products \

-target-dir /user/cloudera/prod_slitt_v1 \

-delete-target-dir \

-boundary-query 'select min(product_id), max(product_id)from products where product_id > 100' \

-split-by product_id

BoundingValsQuery: select min(product_id), max(product_id)from products where product_id > 100

20/08/03 00:08:08 INFO db.IntegerSplitter: Split size: 311; Num splits: 4 from: 101 to: 1345

Retrieved 1245 records.

hdfs dfs -ls /user/cloudera/prod_slitt_v1

Found 5 items

-rw-r--r-- 1 cloudera cloudera 0 2020-08-03 00:09 /user/cloudera/prod_slitt_v1/_SUCCESS

-rw-r--r-- 1 cloudera cloudera 37832 2020-08-03 00:08 /user/cloudera/prod_slitt_v1/part-m-00000

-rw-r--r-- 1 cloudera cloudera 40905 2020-08-03 00:08 /user/cloudera/prod_slitt_v1/part-m-00001

-rw-r--r-- 1 cloudera cloudera 38828 2020-08-03 00:08 /user/cloudera/prod_slitt_v1/part-m-00002

-rw-r--r-- 1 cloudera cloudera 43714 2020-08-03 00:08 /user/cloudera/prod_slitt_v1/part-m-00003

// do line count

hdfs dfs -cat /user/cloudera/prod_slitt_v1/* | wc -l

1245

Sunday, 2 August 2020

Compression methods in SQOOP : Gzip, Snappy, Deflate, bzip2, lz4

Compression methods:

Gzip, Snappy, Deflate, bzip2, lz4

//Gzip Compression

sqoop import \

-connect jdbc:mysql://localhost:3306/retail_db \

-username root \

-password cloudera \

-table customers \

-target-dir /user/cloudera/cust_gzip \

-delete-target-dir \

-compress

hdfs dfs -ls /user/cloudera/cust_gzip

Found 5 items

-rw-r--r-- 1 cloudera cloudera 0 2020-08-02 23:06 /user/cloudera/cust_gzip/_SUCCESS

-rw-r--r-- 1 cloudera cloudera 63937 2020-08-02 23:06 /user/cloudera/cust_gzip/part-m-00000.gz

-rw-r--r-- 1 cloudera cloudera 63708 2020-08-02 23:05 /user/cloudera/cust_gzip/part-m-00001.gz

-rw-r--r-- 1 cloudera cloudera 64171 2020-08-02 23:06 /user/cloudera/cust_gzip/part-m-00002.gz

-rw-r--r-- 1 cloudera cloudera 63905 2020-08-02 23:05 /user/cloudera/cust_gzip/part-m-00003.gz

//Snappy Compression

sqoop import \

-connect jdbc:mysql://localhost:3306/retail_db \

-username root \

-password cloudera \

-table customers \

-target-dir /user/cloudera/cust_snappy \

-delete-target-dir \

-compress \

-compression-codec snappy

$ hdfs dfs -ls /user/cloudera/cust_snappy

Found 5 items

-rw-r--r-- 1 cloudera cloudera 0 2020-08-02 23:07 /user/cloudera/cust_snappy/_SUCCESS

-rw-r--r-- 1 cloudera cloudera 110165 2020-08-02 23:07 /user/cloudera/cust_snappy/part-m-00000.snappy

-rw-r--r-- 1 cloudera cloudera 109884 2020-08-02 23:07 /user/cloudera/cust_snappy/part-m-00001.snappy

-rw-r--r-- 1 cloudera cloudera 110479 2020-08-02 23:07 /user/cloudera/cust_snappy/part-m-00002.snappy

-rw-r--r-- 1 cloudera cloudera 110616 2020-08-02 23:07 /user/cloudera/cust_snappy/part-m-00003.snappy

//Deflate Compression

sqoop import \

-connect jdbc:mysql://localhost:3306/retail_db \

-username root \

-password cloudera \

-table customers \

-target-dir /user/cloudera/cust_deflate \

-delete-target-dir \

-compress \

-compression-codec deflate

hdfs dfs -ls /user/cloudera/cust_deflate

Found 5 items

-rw-r--r-- 1 cloudera cloudera 0 2020-08-02 23:16 /user/cloudera/cust_deflate/_SUCCESS

-rw-r--r-- 1 cloudera cloudera 63925 2020-08-02 23:16 /user/cloudera/cust_deflate/part-m-00000.deflate

-rw-r--r-- 1 cloudera cloudera 63696 2020-08-02 23:16 /user/cloudera/cust_deflate/part-m-00001.deflate

-rw-r--r-- 1 cloudera cloudera 64159 2020-08-02 23:16 /user/cloudera/cust_deflate/part-m-00002.deflate

-rw-r--r-- 1 cloudera cloudera 63893 2020-08-02 23:16 /user/cloudera/cust_deflate/part-m-00003.deflate

//Bzip Compression

sqoop import \

-connect jdbc:mysql://localhost:3306/retail_db \

-username root \

-password cloudera \

-table customers \

-target-dir /user/cloudera/cust_bzip \

-delete-target-dir \

-compress \

-compression-codec bzip2

hdfs dfs -ls /user/cloudera/cust_bzip

Found 5 items

-rw-r--r-- 1 cloudera cloudera 0 2020-08-02 23:09 /user/cloudera/cust_bzip/_SUCCESS

-rw-r--r-- 1 cloudera cloudera 41632 2020-08-02 23:09 /user/cloudera/cust_bzip/part-m-00000.bz2

-rw-r--r-- 1 cloudera cloudera 41703 2020-08-02 23:09 /user/cloudera/cust_bzip/part-m-00001.bz2

-rw-r--r-- 1 cloudera cloudera 41869 2020-08-02 23:09 /user/cloudera/cust_bzip/part-m-00002.bz2

-rw-r--r-- 1 cloudera cloudera 41910 2020-08-02 23:09 /user/cloudera/cust_bzip/part-m-00003.bz2

//Lz4 Compression

sqoop import \

-connect jdbc:mysql://localhost:3306/retail_db \

-username root \

-password cloudera \

-table customers \

-target-dir /user/cloudera/cust_lz4 \

-delete-target-dir \

-compress \

-compression-codec Lz4

hdfs dfs -ls /user/cloudera/cust_lz4

Found 5 items

-rw-r--r-- 1 cloudera cloudera 0 2020-08-02 23:13 /user/cloudera/cust_lz4/_SUCCESS

-rw-r--r-- 1 cloudera cloudera 97878 2020-08-02 23:12 /user/cloudera/cust_lz4/part-m-00000.lz4

-rw-r--r-- 1 cloudera cloudera 97610 2020-08-02 23:12 /user/cloudera/cust_lz4/part-m-00001.lz4

-rw-r--r-- 1 cloudera cloudera 98155 2020-08-02 23:13 /user/cloudera/cust_lz4/part-m-00002.lz4

-rw-r--r-- 1 cloudera cloudera 98349 2020-08-02 23:12 /user/cloudera/cust_lz4/part-m-00003.lz4

GZip : Good compression ratio

GZip files are not splittable

Hadoop files are broken into 128 MB of blocks

BZip :

Very good compression ratio, Very slow, CPU Intensive

Snappy :

Very fast, splittable, modest

Deflat : splittable format

: Error prone

: Not widely accepted

LZ4 : modest, fast, splittable, best choice

Create Avro, Parquet, Sequence files using the SQOOP output

Import data from MySQL and save the output as : Avro, Parquet, Sequence files

// Save as Avro file

$ sqoop import \

-connect jdbc:mysql://localhost/retail_db \

-username root \

-password cloudera \

-table customers \

-target-dir /user/cloudera/cust_avro \

-as-avrodatafile \

-delete-target-dir

hdfs dfs -ls /user/cloudera/cust_avro

Found 5 items

-rw-r--r-- 1 cloudera cloudera 0 2020-08-02 22:37 /user/cloudera/cust_avro/_SUCCESS

-rw-r--r-- 1 cloudera cloudera 258086 2020-08-02 22:37 /user/cloudera/cust_avro/part-m-00000.avro

-rw-r--r-- 1 cloudera cloudera 257862 2020-08-02 22:37 /user/cloudera/cust_avro/part-m-00001.avro

-rw-r--r-- 1 cloudera cloudera 259118 2020-08-02 22:37 /user/cloudera/cust_avro/part-m-00002.avro

-rw-r--r-- 1 cloudera cloudera 260893 2020-08-02 22:37 /user/cloudera/cust_avro/part-m-00003.avro

//Save as parquet file

$ sqoop import \

-connect jdbc:mysql://localhost/retail_db \

-username root \

-password cloudera \

-table customers \

-target-dir /user/cloudera/cust_parquet \

-as-parquetfile \

-delete-target-dir

hdfs dfs -ls /user/cloudera/cust_parquet

Found 6 items

drwxr-xr-x - cloudera cloudera 0 2020-08-02 22:38 /user/cloudera/cust_parquet/.metadata

drwxr-xr-x - cloudera cloudera 0 2020-08-02 22:39 /user/cloudera/cust_parquet/.signals

-rw-r--r-- 1 cloudera cloudera 88944 2020-08-02 22:39 /user/cloudera/cust_parquet/0395a40c-1223-45e3-bff7-2bc061edfb5c.parquet

-rw-r--r-- 1 cloudera cloudera 89047 2020-08-02 22:39 /user/cloudera/cust_parquet/10c94461-e193-49fc-9f4e-c0c3be93c7e3.parquet

-rw-r--r-- 1 cloudera cloudera 88762 2020-08-02 22:39 /user/cloudera/cust_parquet/1ac81c8a-7422-402f-ade5-01173f580a1a.parquet

-rw-r--r-- 1 cloudera cloudera 89163 2020-08-02 22:39 /user/cloudera/cust_parquet/505b1ef1-fdde-4e25-9acb-ead834a6d4af.parquet

//Save as sequence file

$ sqoop import \

-connect jdbc:mysql://localhost/retail_db \

-username root \

-password cloudera \

-table customers \

-target-dir /user/cloudera/cust_sequence \

-as-sequencefile \

-delete-target-dir

hdfs dfs -ls /user/cloudera/cust_sequence

Found 5 items

-rw-r--r-- 1 cloudera cloudera 0 2020-08-02 22:41 /user/cloudera/cust_sequence/_SUCCESS

-rw-r--r-- 1 cloudera cloudera 315998 2020-08-02 22:41 /user/cloudera/cust_sequence/part-m-00000

-rw-r--r-- 1 cloudera cloudera 315711 2020-08-02 22:41 /user/cloudera/cust_sequence/part-m-00001

-rw-r--r-- 1 cloudera cloudera 315814 2020-08-02 22:41 /user/cloudera/cust_sequence/part-m-00002

-rw-r--r-- 1 cloudera cloudera 315633 2020-08-02 22:41 /user/cloudera/cust_sequence/part-m-00003

Import Data from MySQL to HDFS using SQOOP - Revisiting SQOOP Commands

// Import Data from MySQL to HDFS using SQOOP

//mapper option not specified - so default mapper is 4

// target-dir is not specified - so it will be imported in user/cloudera folder

sqoop import \

-connect jdbc:mysql://localhost:3306/retail_db \

-username root \

-password cloudera \

-table customers;

SELECT t.* FROM `customers` AS t LIMIT 1

Num splits: 4 from: 1 to: 12435

Retrieved 12435 records.

URL to check the log : http://quickstart.cloudera:8088/proxy/application_1596428369104_0001/

Result in hdfs:

hdfs dfs -ls /user/cloudera/customers

Found 5 items

-rw-r--r-- 1 cloudera cloudera 0 2020-08-02 21:42 /user/cloudera/customers/_SUCCESS

-rw-r--r-- 1 cloudera cloudera 237145 2020-08-02 21:42 /user/cloudera/customers/part-m-00000

-rw-r--r-- 1 cloudera cloudera 237965 2020-08-02 21:42 /user/cloudera/customers/part-m-00001

-rw-r--r-- 1 cloudera cloudera 238092 2020-08-02 21:42 /user/cloudera/customers/part-m-00002

-rw-r--r-- 1 cloudera cloudera 240323 2020-08-02 21:42 /user/cloudera/customers/part-m-00003

mysql> select count(1) from customers;

+----------+

| count(1) |

+----------+

| 12435 |

+----------+

1 row in set (0.02 sec)

// Line count checking in hdfs files

hdfs dfs -cat /user/cloudera/customers/* | wc -l

12435

// get 1st 10 lines..

[cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/customers/part-m-00000 | head

1,Richard,Hernandez,XXXXXXXXX,XXXXXXXXX,6303 Heather Plaza,Brownsville,TX,78521

2,Mary,Barrett,XXXXXXXXX,XXXXXXXXX,9526 Noble Embers Ridge,Littleton,CO,80126

3,Ann,Smith,XXXXXXXXX,XXXXXXXXX,3422 Blue Pioneer Bend,Caguas,PR,00725

4,Mary,Jones,XXXXXXXXX,XXXXXXXXX,8324 Little Common,San Marcos,CA,92069

5,Robert,Hudson,XXXXXXXXX,XXXXXXXXX,10 Crystal River Mall ,Caguas,PR,00725

6,Mary,Smith,XXXXXXXXX,XXXXXXXXX,3151 Sleepy Quail Promenade,Passaic,NJ,07055

7,Melissa,Wilcox,XXXXXXXXX,XXXXXXXXX,9453 High Concession,Caguas,PR,00725

8,Megan,Smith,XXXXXXXXX,XXXXXXXXX,3047 Foggy Forest Plaza,Lawrence,MA,01841

9,Mary,Perez,XXXXXXXXX,XXXXXXXXX,3616 Quaking Street,Caguas,PR,00725

10,Melissa,Smith,XXXXXXXXX,XXXXXXXXX,8598 Harvest Beacon Plaza,Stafford,VA,22554

// increase the number of mappers to 8:

sqoop import \

-connect jdbc:mysql://localhost:3306/retail_db \

-username root \

-password cloudera \

-table customers

-m 8

// We get the below error

Output directory hdfs://quickstart.cloudera:8020/user/cloudera/customers already exists

// delete the output folder

$ hdfs dfs -rm -r hdfs://quickstart.cloudera:8020/user/cloudera/customers

Deleted hdfs://quickstart.cloudera:8020/user/cloudera/customers

$ hdfs dfs -ls hdfs://quickstart.cloudera:8020/user/cloudera/customers

Found 9 items

-rw-r--r-- 1 cloudera cloudera 0 2020-08-02 21:58 hdfs://quickstart.cloudera:8020/user/cloudera/customers/_SUCCESS

-rw-r--r-- 1 cloudera cloudera 118147 2020-08-02 21:57 hdfs://quickstart.cloudera:8020/user/cloudera/customers/part-m-00000

-rw-r--r-- 1 cloudera cloudera 119075 2020-08-02 21:57 hdfs://quickstart.cloudera:8020/user/cloudera/customers/part-m-00001

-rw-r--r-- 1 cloudera cloudera 119057 2020-08-02 21:57 hdfs://quickstart.cloudera:8020/user/cloudera/customers/part-m-00002

-rw-r--r-- 1 cloudera cloudera 118831 2020-08-02 21:57 hdfs://quickstart.cloudera:8020/user/cloudera/customers/part-m-00003

-rw-r--r-- 1 cloudera cloudera 118798 2020-08-02 21:57 hdfs://quickstart.cloudera:8020/user/cloudera/customers/part-m-00004

-rw-r--r-- 1 cloudera cloudera 119294 2020-08-02 21:57 hdfs://quickstart.cloudera:8020/user/cloudera/customers/part-m-00005

-rw-r--r-- 1 cloudera cloudera 119892 2020-08-02 21:58 hdfs://quickstart.cloudera:8020/user/cloudera/customers/part-m-00006

-rw-r--r-- 1 cloudera cloudera 120431 2020-08-02 21:58 hdfs://quickstart.cloudera:8020/user/cloudera/customers/part-m-00007

// reduced number of mappers as 2

//delete target directory if already exists

sqoop import \

-connect jdbc:mysql://localhost:3306/retail_db \

-username root \

-password cloudera \

-table customers \

-m 2 \

-target-dir /user/cloudera/customer_new \

-delete-target-dir

// only 2 mappers

$ hdfs dfs -ls /user/cloudera/customer_new

Found 3 items

-rw-r--r-- 1 cloudera cloudera 0 2020-08-02 22:12 /user/cloudera/customer_new/_SUCCESS

-rw-r--r-- 1 cloudera cloudera 475038 2020-08-02 22:12 /user/cloudera/customer_new/part-m-00000

-rw-r--r-- 1 cloudera cloudera 478487 2020-08-02 22:12 /user/cloudera/customer_new/part-m-00001

//re-run the same sqoop command again we wont get any error

//as target directory already exists

Friday, 31 July 2020

Import a table from MySQL to Hbase using SQOOP Import

// Import into Hbase

sqoop import \

-connect jdbc:mysql://localhost:3306/ohm \

-driver com.mysql.jdbc.Driver \

-username root \

-password cloudera \

-table employee \

-hbase-create-table \

-hbase-table employee_details \

-column-family employees \

-hbase-row-key id -m 1

// single line statement for the same

sqoop import -connect jdbc:mysql://localhost:3306/ohm -driver com.mysql.jdbc.Driver -username root -password cloudera -table employee -hbase-create-table -hbase-table employee_details -column-family employees -hbase-row-key id -m 1

[cloudera@quickstart ~]$ hbase shell

scan 'employee-details'

// it will display all columnar view

Import a table from MySQL to Hive using SQOOP Import

// Start MySQL

$ sudo mysql -uroot -pcloudera

mysql> create database ohm;

Query OK, 1 row affected (0.00 sec)

mysql> use ohm;

Database changed

CREATE TABLE employee

(

id INT,

first_name VARCHAR(100),

last_name VARCHAR(100),

gender VARCHAR(10),

designation VARCHAR(20),

city VARCHAR(20),

country VARCHAR(20)

);

INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (1, 'Jervis', 'Roll', 'Male', 'Director of Sales', 'Thi Tran Lac', 'Vietnam');

INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (2, 'Gordon', 'Maltster', 'Male', 'Marketing Manager', 'Mabu', 'China');

INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (3, 'Griff', 'Godsafe', 'Male', 'Actuary', 'Kipit', 'Philippines');

INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (4, 'Gracie', 'Franken', 'Female', 'Assistant Manager', 'Xiabuji', 'China');

INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (5, 'Joelly', 'Wellbank', 'Female', 'Account Coordinator', 'Whitehorse', 'Canada');

INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (6, 'Bab', 'Havock', 'Female', 'Accountant II', 'Basyūn', 'Egypt');

INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (7, 'Carmine', 'Courage', 'Female', 'Account Coordinator', 'Boyeros', 'Cuba');

INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (8, 'Estella', 'Marvell', 'Female', 'Structural Analysis Engineer', 'Stettler', 'Canada');

INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (9, 'Celie', 'Trevaskiss', 'Female', 'Assistant Manager', 'Criuleni', 'Moldova');

INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (10, 'Madison', 'Ranyell', 'Male', 'Research Associate', 'Angatel', 'Philippines');

INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (11, 'Haydon', 'Faughey', 'Male', 'Safety Technician IV', 'Masalovka', 'Russia');

INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (12, 'Michele', 'Zarfai', 'Male', 'Legal Assistant', 'Karatau', 'Kazakhstan');

INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (13, 'Ruthi', 'Bowmer', 'Female', 'Analog Circuit Design manager', 'Peski', 'Russia');

INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (14, 'Adolphus', 'Pickthorne', 'Male', 'Senior Developer', 'Mae Fa Luang', 'Thailand');

INSERT INTO employee (id, first_name, last_name, gender, designation, city, country) VALUES (15, 'Kat', 'Dymocke', 'Female', 'Geological Engineer', 'Markópoulo Oropoú', 'Greece');

// Display databases;

$ sqoop list-databases -connect jdbc:mysql://localhost -username root -password cloudera;

information_schema

firehose

hue

metastore

mysql

nav

navms

ohm

oozie

retail_db

rman

// Display tables

$ sqoop list-tables -connect jdbc:mysql://localhost/ohm -username root -password cloudera;

employee

person

// Create a table in Hive

hive> create database ohm;

Time taken: 2.041 seconds

hive> show tables;

Time taken: 0.244 seconds

// Import a table from Mysql To Hive

sqoop import \

-connect jdbc:mysql://localhost:3306/ohm \

-driver com.mysql.jdbc.Driver \

-username root \

-password cloudera \

-table employee \

-hive-import \

-split-by id \

-hive-table ohm.employee

hive> use ohm;

Time taken: 0.017 seconds

hive> show tables;

employee

Time taken: 0.029 seconds, Fetched: 1 row(s)

hive> select * from employee;

1 Jervis Roll Male Director of Sales Thi Tran Lac Vietnam

2 Gordon Maltster Male Marketing Manager Mabu China

3 Griff Godsafe Male Actuary Kipit Philippines

4 Gracie Franken Female Assistant Manager Xiabuji China

5 Joelly Wellbank Female Account Coordinator Whitehorse Canada

6 Bab Havock Female Accountant II Basyūn Egypt

10 Madison Ranyell Male Research Associate Angatel Philippines

11 Haydon Faughey Male Safety Technician IV Masalovka Russia

12 Michele Zarfai Male Legal Assistant Karatau Kazakhstan

13 Ruthi Bowmer Female Analog Circuit Desig Peski Russia

14 Adolphus Pickthorne Male Senior Developer Mae Fa Luang Thailand

15 Kat Dymocke Female Geological Engineer Markópoulo Oropoú Greece

Time taken: 0.92 seconds, Fetched: 12 row(s)

// Verify the record count in MySQL, HDFS and Hive

mysql> use ohm;

Reading table information for completion of table and column names

You can turn off this feature to get a quicker startup with -A

Database changed

mysql> select count(1) from employee;

+----------+

| count(1) |

+----------+

| 12 |

+----------+

1 row in set (0.00 sec)

// do record count from all the files in warehouse

hdfs dfs -cat /user/hive/warehouse/ohm.db/employee/* | wc -l

// do record count from the table

hive> select count(1) from employee;