Tuesday, 11 August 2020
Read, Write CSV files using Spark
Read Hive table in Spark using Spark SQL | Read, Write JSON
Saturday, 8 August 2020
Bucketing a table in Hive
Friday, 7 August 2020
Partitioned table creation in Hive
// How to create a partitioned table in Hive?
// This one is the source table which is non partitioned
CREATE EXTERNAL TABLE orders
(
orderid INT,
date STRING,
custid INT,
status STRING
)
row format delimited
fields terminated by ","
location '/user/cloudera/orders'
hive> describe orders;
OK
order_id int
order_date bigint
order_customer_id int
order_status string
Time taken: 0.051 seconds, Fetched: 4 row(s)
Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/ohm.db/orders
// find the unique statuses
hive> select distinct order_status from orders;
CANCELED
CLOSED
COMPLETE
ON_HOLD
PAYMENT_REVIEW
PENDING
PENDING_PAYMENT
PROCESSING
SUSPECTED_FRAUD
// set the properties
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
// Create partition table based on status - so do not include status column in the table structure
create external table orders_partitioned
(
orderid int,
date string,
custid int
)
partitioned by (status string)
row format delimited
fields terminated by ",";
# Partition Information
# col_name data_type comment
status string
Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/ohm.db/orders_partitioned
Table Parameters:
EXTERNAL TRUE
numPartitions 0 ---> No data available so numPartitions is zero as of now.
transient_lastDdlTime 1596822311
//Populate partitions with data taken from another table
insert overwrite table orders_partitioned
partition (status)
select order_id, order_date,order_customer_id, order_status from orders;
Loading partition {status=PENDING}
Loading partition {status=COMPLETE}
Loading partition {status=CANCELED}
Loading partition {status=CLOSED}
Loading partition {status=PROCESSING}
Loading partition {status=PAYMENT_REVIEW}
Loading partition {status=SUSPECTED_FRAUD}
Loading partition {status=ON_HOLD}
Loading partition {status=PENDING_PAYMENT}
Table Parameters:
EXTERNAL TRUE
numPartitions 9 -- We have populated the data taken from another table. so numPartitions is 9
transient_lastDdlTime 1596822311
Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/ohm.db/orders_partitioned
//Find the files in HDFS after populating the data :
hdfs dfs -ls hdfs://quickstart.cloudera:8020/user/hive/warehouse/ohm.db/orders_partitioned
Found 9 items
drwxrwxrwx - cloudera supergroup 0 2020-08-07 10:49 hdfs://quickstart.cloudera:8020/user/hive/warehouse/ohm.db/orders_partitioned/status=CANCELED
drwxrwxrwx - cloudera supergroup 0 2020-08-07 10:49 hdfs://quickstart.cloudera:8020/user/hive/warehouse/ohm.db/orders_partitioned/status=CLOSED
drwxrwxrwx - cloudera supergroup 0 2020-08-07 10:49 hdfs://quickstart.cloudera:8020/user/hive/warehouse/ohm.db/orders_partitioned/status=COMPLETE
drwxrwxrwx - cloudera supergroup 0 2020-08-07 10:49 hdfs://quickstart.cloudera:8020/user/hive/warehouse/ohm.db/orders_partitioned/status=ON_HOLD
drwxrwxrwx - cloudera supergroup 0 2020-08-07 10:49 hdfs://quickstart.cloudera:8020/user/hive/warehouse/ohm.db/orders_partitioned/status=PAYMENT_REVIEW
drwxrwxrwx - cloudera supergroup 0 2020-08-07 10:49 hdfs://quickstart.cloudera:8020/user/hive/warehouse/ohm.db/orders_partitioned/status=PENDING
drwxrwxrwx - cloudera supergroup 0 2020-08-07 10:49 hdfs://quickstart.cloudera:8020/user/hive/warehouse/ohm.db/orders_partitioned/status=PENDING_PAYMENT
drwxrwxrwx - cloudera supergroup 0 2020-08-07 10:49 hdfs://quickstart.cloudera:8020/user/hive/warehouse/ohm.db/orders_partitioned/status=PROCESSING
drwxrwxrwx - cloudera supergroup 0 2020-08-07 10:49 hdfs://quickstart.cloudera:8020/user/hive/warehouse/ohm.db/orders_partitioned/status=SUSPECTED_FRAUD
hdfs dfs -ls hdfs://quickstart.cloudera:8020/user/hive/warehouse/ohm.db/orders_partitioned/status=COMPLETE
Found 1 items
-rwxrwxrwx 1 cloudera supergroup 571301 2020-08-07 10:49 hdfs://quickstart.cloudera:8020/user/hive/warehouse/ohm.db/orders_partitioned/status=COMPLETE/000000_0
$ hdfs dfs -cat hdfs://quickstart.cloudera:8020/user/hive/warehouse/ohm.db/orders_partitioned/status=COMPLETE/000000_0 | head
3,1374735600000,12111
5,1374735600000,11318
6,1374735600000,7130
7,1374735600000,4530
15,1374735600000,2568
17,1374735600000,2667
22,1374735600000,333
26,1374735600000,7562
28,1374735600000,656
32,1374735600000,3960
// Performance Analysis
// Find the line count
$ hdfs dfs -cat hdfs://quickstart.cloudera:8020/user/hive/warehouse/ohm.db/orders_partitioned/status=COMPLETE/000000_0 | wc -l
22899
// Find the row count against source table which is not a partitioned one
hive> select count(1) from orders where order_status ='COMPLETE';
OK
22899
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.95 sec HDFS Read: 45693 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 950 msec ---> 6 SECONDS
// Find the row count of a partitioned table
hive> select count(1) from orders_partitioned where status ='COMPLETE';
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.38 sec HDFS Read: 579428 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 380 msec ----> 3.4 SECONDS
hive> show partitions ohm.orders_partitioned;
OK
status=CANCELED
status=CLOSED
status=COMPLETE
status=ON_HOLD
status=PAYMENT_REVIEW
status=PENDING
status=PENDING_PAYMENT
status=PROCESSING
status=SUSPECTED_FRAUD
Time taken: 0.113 seconds, Fetched: 9 row(s)
String and Date Functions in Hive
// Here we are going to play with String and Date Functions in Hive
hive> create table cust as select customer_fname as firstname, customer_lname as lastname, customer_city as city from ohm.customers;
hive> select * from cust limit 10;
OK
Richard Hernandez Brownsville
Mary Barrett Littleton
Ann Smith Caguas
Mary Jones San Marcos
Robert Hudson Caguas
Mary Smith Passaic
Melissa Wilcox Caguas
Megan Smith Lawrence
Mary Perez Caguas
Melissa Smith Stafford
hive> describe cust;
OK
firstname string
lastname string
city string
Time taken: 0.041 seconds, Fetched: 3 row(s)
customer_city string
Time taken: 0.06 seconds, Fetched: 3 row(s)
hive> alter table cust change customer_city city string;
OK
Time taken: 0.308 seconds
// Concat function
hive> select concat(firstname , "-", lastname, "-" , city ) from cust limit 2;
OK
Richard-Hernandez-Brownsville
Mary-Barrett-Littleton
hive> select concat(firstname , "|", lastname, "|" , city ) from cust limit 2;
OK
Richard|Hernandez|Brownsville
Mary|Barrett|Littleton
hive> select concat(firstname , "\t", lastname, "\t" , city ) from cust limit 2;
OK
Richard Hernandez Brownsville
Mary Barrett Littleton
hive> select concat_ws("\t",firstname,lastname,city ) from cust limit 2;
OK
Richard Hernandez Brownsville
Mary Barrett Littleton
Time taken: 0.138 seconds, Fetched: 2 row(s)
hive> select concat_ws("$",firstname,lastname,city ) from cust limit 2;
OK
Richard$Hernandez$Brownsville
Mary$Barrett$Littleton
Time taken: 0.057 seconds, Fetched: 2 row(s)
hive> select concat_ws("|",firstname,lastname,city ) from cust limit 2;
OK
Richard|Hernandez|Brownsville
Mary|Barrett|Littleton
Time taken: 0.04 seconds, Fetched: 2 row(s)
hive> select concat(firstname , "\t", lastname, "\t" , city ) from cust limit 10;
OK
Richard Hernandez Brownsville
Mary Barrett Littleton
Ann Smith Caguas
Mary Jones San Marcos
Robert Hudson Caguas
Mary Smith Passaic
Melissa Wilcox Caguas
Megan Smith Lawrence
Mary Perez Caguas
Melissa Smith Stafford
hive> select concat(firstname , "|", lastname, "|" , city ) from cust limit 10;
OK
Richard|Hernandez|Brownsville
Mary|Barrett|Littleton
Ann|Smith|Caguas
Mary|Jones|San Marcos
Robert|Hudson|Caguas
Mary|Smith|Passaic
Melissa|Wilcox|Caguas
Megan|Smith|Lawrence
Mary|Perez|Caguas
Melissa|Smith|Stafford
hive> select length(firstname) as Lenn , firstname from cust where length(firstname) > 8 limit 5;
OK
11 Christopher
9 Katherine
9 Stephanie
9 Katherine
9 Alexander
hive> select length("Raja Ram Mohan Rai"), "Raja Ram Mohan Rai";
OK
18 Raja Ram Mohan Rai
hive> select concat (length("S/o. Mahalakshmi") ,". ", "S/o. Mahalakshmi");
OK
16. S/o. Mahalakshmi
hive> select concat_ws(":","Kadavul","Amaithu","Vaitha","Medai");
OK
Kadavul:Amaithu:Vaitha:Medai
hive> select concat_ws("|","Kadavul","Amaithu","Vaitha","Medai");
OK
Kadavul|Amaithu|Vaitha|Medai
hive> select concat(firstname , " " , lastname) from cust where length(firstname) > 10 limit 5;
OK
Christopher Smith
Christopher Prince
Christopher Smith
Christopher Curtis
Christopher Johnson
Time taken: 0.044 seconds, Fetched: 5 row(s)
hive> select concat(firstname , " " , lastname) from cust where length(concat(firstname , " " , lastname)) > 20 limit 5;
OK
Christopher Richardson
Christopher Villarreal
Christopher Alexander
Time taken: 0.094 seconds, Fetched: 3 row(s)
/// Case conversion
hive> select UPPER(firstname), LOWER(lastname), upper(city) from cust limit 2;
OK
RICHARD hernandez BROWNSVILLE
MARY barrett LITTLETON
Time taken: 0.343 seconds, Fetched: 2 row(s)
/Padding Left, Right
hive> select lpad(firstname,12,"*"), rpad(lastname,12,"*") from cust limit 2;
OK
*****Richard Hernandez***
********Mary Barrett*****
hive> select upper("I love India"), lower("I love India");
OK
I LOVE INDIA i love india
Time taken: 0.031 seconds, Fetched: 1 row(s)
// Split function
hive> select split("Arun Raja Kama Raj"," ");
OK
["Arun","Raja","Kama","Raj"]
hive> select substr("Raja Raja Chozhan",11);
OK
Chozhan
hive> select substr(firstname,1,3), " ", firstname from cust limit 5;
OK
Ric Richard
Mar Mary
Ann Ann
Mar Mary
Rob Robert
hive> select regexp_replace("Raja Raja Chozhan","Raja","King");
OK
King King Chozhan
hive> select firstname ," ", regexp_replace(firstname,"Mary","Laila"), " " from cust limit 5;
OK
Richard Richard
Mary Laila
Ann Ann
Mary Laila
Robert Robert
hive> select current_date(), year(current_date()), month(current_date()) , date(current_date()), day(current_date()) ;
OK
2020-08-07 2020 8 2020-08-07 7
Time taken: 0.04 seconds, Fetched: 1 row(s)
hive> select current_timestamp(), hour(current_timestamp()), minute(current_timestamp()), second(current_timestamp());
OK
2020-08-07 08:32:41.383 8 32 41
hive> select datediff(current_date(), "1976-04-20");
OK
16180
hive> select datediff(current_date(), "2020-01-01");
OK
219
hive> select date_add(current_date(),100), date_sub(current_date(),219);
OK
2020-11-15 2020-01-01
hive> select months_between(current_date(),"2020-01-01");
OK
7.19354839
hive> select add_months("2020-01-01",7);
OK
2020-08-01
hive> select current_date(), current_timestamp();
OK
2020-08-07 2020-08-07 08:38:28.5
hive> select current_date(), current_timestamp(), unix_timestamp();
unix_timestamp(void) is deprecated. Use current_timestamp instead.
OK
2020-08-07 2020-08-07 08:38:53.775 1596814733
hive> select unix_timestamp(current_timestamp);
OK
1596814766
Database, Table commands in Hive
// Create a new database in Hive
hive> create database retail_db;
// create a new table with the data taken from some other table of some other database
hive> create table retail_db.customer_temp as select customer_id as id, customer_fname as fname, customer_lname as lname from ohm.customers;
hive> use retail_db;
OK
Time taken: 0.014 seconds
hive> show tables;
OK
customer_temp
Time taken: 0.01 seconds, Fetched: 1 row(s)
hive> describe customer_temp;
OK
id int
fname string
lname string
Time taken: 0.05 seconds, Fetched: 3 row(s)
hive> describe formatted customer_temp;
OK
# col_name data_type comment
id int
fname string
lname string
# Detailed Table Information
Database: retail_db
Owner: cloudera
CreateTime: Fri Aug 07 05:44:14 PDT 2020
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/retail_db.db/customer_temp
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE true
numFiles 1
numRows 12435
rawDataSize 211716
totalSize 224151
transient_lastDdlTime 1596804254
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.066 seconds, Fetched: 33 row(s)
// Renaming the database is not applicable in Hive (But work around is there)
ALTER DATABASE retail_db RENAME TO retail_new; // not possible in CDH 5.13
// Rename the table
hive> alter table customer_temp rename to cust;
OK
Time taken: 0.114 seconds
hive> show tables;
OK
cust
Time taken: 0.018 seconds, Fetched: 1 row(s)
hive> describe cust;
OK
id int
fname string
lname string
Time taken: 0.057 seconds, Fetched: 3 row(s)
// How to rename the column?
hive> alter table cust change id custid int;
OK
Time taken: 0.128 seconds
hive> describe cust;
OK
custid int
fname string
lname string
Time taken: 0.051 seconds, Fetched: 3 row(s)
// How to replace column names for set of columns?
hive> alter table cust replace columns (id int, firstname string, lastname string);
OK
Time taken: 0.088 seconds
hive> describe cust;
OK
id int
firstname string
lastname string
Time taken: 0.048 seconds, Fetched: 3 row(s)
// How to add new columns to the existing table?
hive> alter table cust add columns (age int, email string);
OK
Time taken: 0.069 seconds
hive> describe cust;
OK
id int
firstname string
lastname string
age int -------> New column
email string -------> New column
Time taken: 0.047 seconds, Fetched: 5 row(s)
// Newly created columns will be having NULL values
hive> select * from cust limit 10;
1 Richard Hernandez NULL NULL
2 Mary Barrett NULL NULL
3 Ann Smith NULL NULL
4 Mary Jones NULL NULL
5 Robert Hudson NULL NULL
6 Mary Smith NULL NULL
7 Melissa Wilcox NULL NULL
8 Megan Smith NULL NULL
9 Mary Perez NULL NULL
10 Melissa Smith NULL NULL
Time taken: 0.063 seconds, Fetched: 10 row(s)
// How to change the file format of existing table?
hive> alter table cust set fileformat parquet;
OK
Time taken: 0.089 seconds
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
hive> alter table cust set fileformat avro;
OK
Time taken: 0.108 seconds
hive> describe formatted cust;
SerDe Library: org.apache.hadoop.hive.serde2.avro.AvroSerDe
InputFormat: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/retail_db.db/cust
// How to change the location of a table?
hive> alter table cust set location '/user/cloudera/customer';
OK
hive> describe formatted cust;
Location: hdfs://quickstart.cloudera:8020/user/cloudera/customer
Import all the tables from MySQL to Hive with Snappy Compression and Parquet file creation using SQOOP
Here we are going to import all the tables of retail_db database of MySQL into Hive.
Additional options we have used are : Snappy Compression, Save as Parquet file
// Delete the database in Hive
hive> drop database ohm cascade;
OK
Time taken: 1.187 seconds
hive> create database ohm;
OK
sqoop import-all-tables \
-connect jdbc:mysql://localhost:3306/retail_db \
-username root \
-password cloudera \
-warehouse-dir /user/hive/warehouse \
-m 1 \
-hive-database ohm \
-hive-import \
-hive-overwrite \
-create-hive-table \
-compress \
-compression-codec snappy \
-as-parquetfile \
-outdir java_out ;
hive> use ohm;
OK
Time taken: 0.01 seconds
hive> show tables;
OK
categories
customers
departments
order_items
orders
products
Time taken: 0.029 seconds, Fetched: 6 row(s)
hive> describe database ohm;
OK
ohm hdfs://quickstart.cloudera:8020/user/hive/warehouse/ohm.db cloudera USER
Time taken: 0.011 seconds, Fetched: 1 row(s)
Thursday, 6 August 2020
Parse Fixed file format using Serde in Hive
Parquet file creation with GZip Compression in Hive
Create Parquet file in Hive
Hive Aggregation Functions
Tuesday, 4 August 2020
CTAS - Create Table As Select Example in Hive
External Table creation in Hive and loading 4 different data files located in HDFS
External table creation in Hive
Flume - Simple Demo
// create a folder in hdfs : $ hdfs dfs -mkdir /user/flumeExa // Create a shell script which generates : Hadoop in real world <n>...
-
How to fetch Spark Application Id programmaticall while running the Spark Job? scala> spark.sparkContext.applicationId res124: String = l...
-
input data: ---------- customerID, itemID, amount 44,8602,37.19 35,5368,65.89 2,3391,40.64 47,6694,14.98 29,680,13.08 91,8900,24.59 ...
-
pattern matching is similar to switch statements in C#, Java no fall-through - at least one condition matched no breaks object PatternExa { ...