Monday, 10 December 2018

Hive SubQueries and Views

subqueries, views:
subquery : query within query
views : virtual table

set operations : Union - combine the results of multiple queries together
same column data types, same column names

Union removes all duplicate records.

Union All - preserves duplicates
, Minus*, Intersect* : not supported


load data local inpath 'products.csv' overwrite into table products;

select * from products;
select * from freshproducts;


select * from products
union
select * from freshproducts

Columns of the 2 result sets need to match exactly - same column names, same data types.


select * from products;
union
select id,title,cost from freshproducts

insert into freshproducts values('iphone7','iPhone 7',950,1234567890);

select * from products;
union all
select id,title,cost from freshproducts

Minus, intersects are currently not supported in Hive


Subqueries:
result set nested within other queries
queries are run on results in any tabular format

Tables created using the create table command, result set of a select statement

part of FROM, WHERE clause

FROM:
select * from  (
   select id as product_id from products
   union
   select id as product_id from freshproducts
   ) t;
 
 
IN, NOT IN, EXISTS, NOT EXISTS  ::: where clause


A typical query with an IN statement
A single list of values to check whether they match with the column values.
subquery should return a list of column values
subquery should not reference the parent query  ===> uncorelated

select id from customers where id in (1,2,3,4);

while doing subquery, please use alias names for the tables involved to resolve the errors.
==> select c.name from customers c where c.id in (select o.customer_id from orders o);
==> select c.name from customers c where c.id not in (select o.customer_id from orders o);

==> select c.id from customers c where c.id in (select o.customer_id from orders o);


EXISTS:

subquery should be correlated - reference the parent query

select c.id from customers c where exists (select o.customer_id from orders o where o.customer_id = c.id);

correlation must exists : checks if there is at least one record in this table



CREATE  a table and insert records using subquery immediately

=> create table allproducts as select brand,title as name, cost from products;

Views:
virtual table with a subset of data from a larger, more extensive table.
Stored as query in Hive's metastore.
Executed when used. (on demand)
Updated when data in the underlying data in physical table changes.
Containts data from single / multiple tables.
Frozen in time, not affected by table changes.

=> describe customers;
=> describe orders;

create view customer_purchases as select o.customer_id,o.product_id,c.address from customers c join orders o where c.id = o.customer_id;

=> show tables will display : customer_purchases view as table;
=> describe customer_purchases;

=> describe formatted customer_purchases;

=> select * from customer_purchases;


Views:
Advantages:
Reduce query complexity
Restrict access to sensive data
Hide sensitive data from other departments (Marketing, Revenue, Accounting)

Construct different logical tables from the same physical table.

Create multiple views of the same table
Alter a view

=> describe customer_purchases;
=> create view product_purchases as select product_id, quantity from orders;

select * from product_purchases;

Make changes in existing view:
alter view product_purchases as select product_id, quantity,amount from orders;

select * from product_purchases;

No comments:

Post a Comment

Flume - Simple Demo

// create a folder in hdfs : $ hdfs dfs -mkdir /user/flumeExa // Create a shell script which generates : Hadoop in real world <n>...