Monday, 10 December 2018

Pig Notes


a = LOAD '/home/cloudera/women.txt' using PigStorage (',')
To display the content of women.txt
cat /home/cloudera/women.txt
001,Nithya,Duraisamy,31,Manager,9586452156,Hyderabad
002,Nandhini,Babu,28,Assistant Manager,Delhi
003,Madhuri,Nathan,51,VP,999333323,Hyderabad
004,Kavitha,Manoharan,45,AVP,992342344,Hyderabad
005,Vijaya,Kandhasamy,45,AVP,23452344,Noida
006,Aarthi,Raj,28,Assistant Manager,99234234,Chennai
007,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai
008,Meena,Baskar,51,VP,2342344444,Hyderabad
009,Gayathri,Raghu,22,Engineer,323423444,Chennai
010,Thenmozhi,Rajan,45,AVP,234234234,Noida


To run pig in local mode
pig -x local
grunt>

To run  pig in MapReduce mode:
pig
grunt>

To exit from pig grunt shell
grunt> quit


LOAD
FILTER
GROUP
AGGREGATE
DUMP
STORE


To load a local file into pig without giving schema info:
a = LOAD '/home/cloudera/women.txt' using PigStorage(',');

Display the output on screen:
DUMP a;
(001,Nithya,Duraisamy,31,Manager,9586452156,Hyderabad)
(002,Nandhini,Babu,28,Assistant Manager,Delhi)
(003,Madhuri,Nathan,51,VP,999333323,Hyderabad)
(004,Kavitha,Manoharan,45,AVP,992342344,Hyderabad)
(005,Vijaya,Kandhasamy,45,AVP,23452344,Noida)
(006,Aarthi,Raj,28,Assistant Manager,99234234,Chennai)
(007,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai)
(008,Meena,Baskar,51,VP,2342344444,Hyderabad)
(009,Gayathri,Raghu,22,Engineer,323423444,Chennai)
(010,Thenmozhi,Rajan,45,AVP,234234234,Noida)

To load a local file into pig with  schema info:
grunt> a = LOAD '/home/cloudera/women.txt' using PigStorage(',')  AS (id:int,fname:chararray,lname:chararray,age:int,design:chararray,mobile:chararray,city:chararray);


Create a directory in hdfs:
hdfs dfs -mkdir /user/cloudera/pig

Renaming existing folder:
hdfs dfs -mv /user/cloudera/pig /user/cloudera/pigexa


Copy women.txt (local file) into hdfs:
hdfs dfs -put /home/cloudera/women.txt /user/cloudera/pigexa/women.txt

Display the content of a file stored in hdfs:
hdfs dfs -cat /user/cloudera/pigexa/women.txt

Load file into pig with schema in MapReduce Mode:
a = load '/user/cloudera/pigexa/women.txt' using PigStorage(',') AS (id:int,fname:chararray,lname:chararray,age:int,design:chararray,mobile:chararray,city:chararray);


To show the schema:
describe a;

a: {id: int,fname: chararray,lname: chararray,age: int,design: chararray,mobile: chararray,city: chararray}

To see the result of a relation:
dump a;

To store the result into a hdfs file:
store a into '/user/cloudera/pigexa/output.txt'

quit

create a script file in local linux:
gedit my1st.pig

a = load '/user/cloudera/pigexa/women.txt' using PigStorage(',') AS (id:int,fname:chararray,lname:chararray,age:int,design:chararray,mobile:chararray,city:chararray);
dump a;
store a into '/user/cloudera/pigexa/output.txt';
save


delete existing my1st.pig which is stored in hdfs:
hdfs dfs -rm /user/cloudera/pigexa/my1st.pig


copy my1st.pig script into hdfs:
hdfs dfs -put /home/cloudera/my1st.pig /user/cloudera/pigexa/my1st.pig


Run a pig script within pig (Grunt shell)
grunt> exec my1st.pig

Run a pig script file in local linux :
$ pig my1st.pig


grunt> ls
hdfs://quickstart.cloudera:8020/user/cloudera/pigexa <dir>
grunt> cd pigexa
grunt> ls
hdfs://quickstart.cloudera:8020/user/cloudera/pigexa/my1st.pig<r 1> 175
hdfs://quickstart.cloudera:8020/user/cloudera/pigexa/output.txt <dir>
hdfs://quickstart.cloudera:8020/user/cloudera/pigexa/women.txt<r 1> 479
grunt> cat output.txt
1 Nithya Duraisamy 31 Manager 9586452156 Hyderabad
2 Nandhini Babu 28 Assistant Manager Delhi
3 Madhuri Nathan 51 VP 999333323 Hyderabad
4 Kavitha Manoharan 45 AVP 992342344 Hyderabad
5 Vijaya Kandhasamy 45 AVP 23452344 Noida
6 Aarthi Raj 28 Assistant Manager 99234234 Chennai
7 Lavanya Sankar 23 Senior Engineer 9923444 Chennai
8 Meena Baskar 51 VP 2342344444 Hyderabad
9 Gayathri Raghu 22 Engineer 323423444 Chennai
10 Thenmozhi Rajan 45 AVP 234234234 Noida
grunt>




a = load '/user/cloudera/pigexa/women.txt' using PigStorage(',') as (id:int,fname:chararray,lname:chararray,age:int,design:chararray,mobile:chararray,city:chararray);
b = load '/user/cloudera/pigexa/employees.txt' using PigStorage(',') as (id:int,fname:chararray,lname:chararray,age:int,mobile:chararray,city:chararray);
dump a;
dump b;
grunt> illustrate a;

--------------------------------------------------------------------------------------------------------------------------------------------
| a     | id:int    | fname:chararray    | lname:chararray    | age:int    | design:chararray    | mobile:chararray    | city:chararray    |
--------------------------------------------------------------------------------------------------------------------------------------------
|       | 007       | Lavanya            | Sankar             | 23         | Senior Engineer     | 9923444             | Chennai           |
--------------------------------------------------------------------------------------------------------------------------------------------


illustrate b;
----------------------------------------------------------------------------------------------------------------------------
| b     | id:int     | fname:chararray     | lname:chararray     | age:int     | mobile:chararray     | city:chararray     |
----------------------------------------------------------------------------------------------------------------------------
|       | 003        | Rajesh              | Rakul               | 22          | 999333               | Kolkatta           |
----------------------------------------------------------------------------------------------------------------------------

b Relation doesn't have designation.

i = GROUP a by age;
dump i;

(22,{(9,Gayathri,Raghu,22,Engineer,323423444,Chennai)})
(23,{(7,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai)})
(28,{(6,Aarthi,Raj,28,Assistant Manager,99234234,Chennai),(2,Nandhini,Babu,28,Assistant Manager,Delhi,)})
(31,{(1,Nithya,Duraisamy,31,Manager,9586452156,Hyderabad)})
(45,{(10,Thenmozhi,Rajan,45,AVP,234234234,Noida),(5,Vijaya,Kandhasamy,45,AVP,23452344,Noida),(4,Kavitha,Manoharan,45,AVP,992342344,Hyderabad)})
(51,{(8,Meena,Baskar,51,VP,2342344444,Hyderabad),(3,Madhuri,Nathan,51,VP,999333323,Hyderabad)})


j = GROUP b by age;
dump j;
(21,{(5,Trupthi,Mohanthy,21,988833,Bhuwaneshwar),(9,Bharathi,Nambiar,21,9888888,Chennai),(4,Preeti,Singh,21,9234243,Hyderabad),(1,Rajiv,Reddy,21,988383,Hyderabad)})
(22,{(8,Bharathi,Nambiar,22,9888888,Hyderabad),(3,Rajesh,Rakul,22,999333,Kolkatta),(2,Siddharth,Malhothra,22,9932323,Kolkatta)})
(23,{(7,Komal,Nayak,23,98888,Chennai),(5,Trupthi,Mohanthy,23,988833,Bhuwaneshwar)})
(24,{(6,Archana,Mishra,24,988323,Chennai)})


internal item is called Atom
ex : Bhuwaneshwar)

Tuple : Surrounded by ( )
ex : (5,Trupthi,Mohanti,21,9888833,Bhuwaneshwar)

Bag : surrounded by  { }

Atom is a single data
Tuple is a set of row
Bag is a set of rows


Group by multiple columns:
k = GROUP a by (age,city);
dump k;

((22,Chennai),{(9,Gayathri,Raghu,22,Engineer,323423444,Chennai)})
((23,Chennai),{(7,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai)})
((28,Chennai),{(6,Aarthi,Raj,28,Assistant Manager,99234234,Chennai)})
((28,),{(2,Nandhini,Babu,28,Assistant Manager,Delhi,)})
((31,Hyderabad),{(1,Nithya,Duraisamy,31,Manager,9586452156,Hyderabad)})
((45,Noida),{(10,Thenmozhi,Rajan,45,AVP,234234234,Noida),(5,Vijaya,Kandhasamy,45,AVP,23452344,Noida)})
((45,Hyderabad),{(4,Kavitha,Manoharan,45,AVP,992342344,Hyderabad)})
((51,Hyderabad),{(8,Meena,Baskar,51,VP,2342344444,Hyderabad),(3,Madhuri,Nathan,51,VP,999333323,Hyderabad)})


s = GROUP a ALL;
dump s;
(all,{(,,,,,,),(10,Thenmozhi,Rajan,45,AVP,234234234,Noida),(9,Gayathri,Raghu,22,Engineer,323423444,Chennai),(8,Meena,Baskar,51,VP,2342344444,Hyderabad),(7,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai),(6,Aarthi,Raj,28,Assistant Manager,99234234,Chennai),(5,Vijaya,Kandhasamy,45,AVP,23452344,Noida),(4,Kavitha,Manoharan,45,AVP,992342344,Hyderabad),(3,Madhuri,Nathan,51,VP,999333323,Hyderabad),(2,Nandhini,Babu,28,Assistant Manager,Delhi,),(1,Nithya,Duraisamy,31,Manager,9586452156,Hyderabad)})

m = COGROUP a by age, b by age;
dump m;

(21,{},{(5,Trupthi,Mohanthy,21,988833,Bhuwaneshwar),(9,Bharathi,Nambiar,21,9888888,Chennai),(4,Preeti,Singh,21,9234243,Hyderabad),(1,Rajiv,Reddy,21,988383,Hyderabad)})
(22,{(9,Gayathri,Raghu,22,Engineer,323423444,Chennai)},{(8,Bharathi,Nambiar,22,9888888,Hyderabad),(3,Rajesh,Rakul,22,999333,Kolkatta),(2,Siddharth,Malhothra,22,9932323,Kolkatta)})
(23,{(7,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai)},{(7,Komal,Nayak,23,98888,Chennai),(5,Trupthi,Mohanthy,23,988833,Bhuwaneshwar)})
(24,{},{(6,Archana,Mishra,24,988323,Chennai)})
(28,{(6,Aarthi,Raj,28,Assistant Manager,99234234,Chennai),(2,Nandhini,Babu,28,Assistant Manager,Delhi,)},{})
(31,{(1,Nithya,Duraisamy,31,Manager,9586452156,Hyderabad)},{})
(45,{(10,Thenmozhi,Rajan,45,AVP,234234234,Noida),(5,Vijaya,Kandhasamy,45,AVP,23452344,Noida),(4,Kavitha,Manoharan,45,AVP,992342344,Hyderabad)},{})
(51,{(8,Meena,Baskar,51,VP,2342344444,Hyderabad),(3,Madhuri,Nathan,51,VP,999333323,Hyderabad)},{})


Joining of tables:

joinned = join a by age, b by age;

(9,Gayathri,Raghu,22,Engineer,323423444,Chennai,8,Bharathi,Nambiar,22,9888888,Hyderabad)
(9,Gayathri,Raghu,22,Engineer,323423444,Chennai,3,Rajesh,Rakul,22,999333,Kolkatta)
(9,Gayathri,Raghu,22,Engineer,323423444,Chennai,2,Siddharth,Malhothra,22,9932323,Kolkatta)
(7,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai,7,Komal,Nayak,23,98888,Chennai)
(7,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai,5,Trupthi,Mohanthy,23,988833,Bhuwaneshwar)


c = join a by (age,city), b by (age,city);
dump c;

(7,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai,7,Komal,Nayak,23,98888,Chennai)


f = join a by age LEFT OUTER, b  by age;

(9,Gayathri,Raghu,22,Engineer,323423444,Chennai,8,Bharathi,Nambiar,22,9888888,Hyderabad)
(9,Gayathri,Raghu,22,Engineer,323423444,Chennai,3,Rajesh,Rakul,22,999333,Kolkatta)
(9,Gayathri,Raghu,22,Engineer,323423444,Chennai,2,Siddharth,Malhothra,22,9932323,Kolkatta)
(7,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai,7,Komal,Nayak,23,98888,Chennai)
(7,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai,5,Trupthi,Mohanthy,23,988833,Bhuwaneshwar)
(6,Aarthi,Raj,28,Assistant Manager,99234234,Chennai,,,,,,)
(2,Nandhini,Babu,28,Assistant Manager,Delhi,,,,,,,)
(1,Nithya,Duraisamy,31,Manager,9586452156,Hyderabad,,,,,,)
(10,Thenmozhi,Rajan,45,AVP,234234234,Noida,,,,,,)
(5,Vijaya,Kandhasamy,45,AVP,23452344,Noida,,,,,,)
(4,Kavitha,Manoharan,45,AVP,992342344,Hyderabad,,,,,,)
(8,Meena,Baskar,51,VP,2342344444,Hyderabad,,,,,,)
(3,Madhuri,Nathan,51,VP,999333323,Hyderabad,,,,,,)


 grunt> g = join a by age right , b  by age;

(,,,,,,,5,Trupthi,Mohanthy,21,988833,Bhuwaneshwar)
(,,,,,,,9,Bharathi,Nambiar,21,9888888,Chennai)
(,,,,,,,4,Preeti,Singh,21,9234243,Hyderabad)
(,,,,,,,1,Rajiv,Reddy,21,988383,Hyderabad)
(9,Gayathri,Raghu,22,Engineer,323423444,Chennai,8,Bharathi,Nambiar,22,9888888,Hyderabad)
(9,Gayathri,Raghu,22,Engineer,323423444,Chennai,3,Rajesh,Rakul,22,999333,Kolkatta)
(9,Gayathri,Raghu,22,Engineer,323423444,Chennai,2,Siddharth,Malhothra,22,9932323,Kolkatta)
(7,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai,7,Komal,Nayak,23,98888,Chennai)
(7,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai,5,Trupthi,Mohanthy,23,988833,Bhuwaneshwar)
(,,,,,,,6,Archana,Mishra,24,988323,Chennai)

 h = join a by age FULL, b by age;

(,,,,,,,1,Rajiv,Reddy,21,988383,Hyderabad)
(9,Gayathri,Raghu,22,Engineer,323423444,Chennai,8,Bharathi,Nambiar,22,9888888,Hyderabad)
(9,Gayathri,Raghu,22,Engineer,323423444,Chennai,3,Rajesh,Rakul,22,999333,Kolkatta)
(9,Gayathri,Raghu,22,Engineer,323423444,Chennai,2,Siddharth,Malhothra,22,9932323,Kolkatta)
(7,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai,7,Komal,Nayak,23,98888,Chennai)
(7,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai,5,Trupthi,Mohanthy,23,988833,Bhuwaneshwar)
(,,,,,,,6,Archana,Mishra,24,988323,Chennai)
(6,Aarthi,Raj,28,Assistant Manager,99234234,Chennai,,,,,,)
(2,Nandhini,Babu,28,Assistant Manager,Delhi,,,,,,,)
(1,Nithya,Duraisamy,31,Manager,9586452156,Hyderabad,,,,,,)
(10,Thenmozhi,Rajan,45,AVP,234234234,Noida,,,,,,)
(5,Vijaya,Kandhasamy,45,AVP,23452344,Noida,,,,,,)
(4,Kavitha,Manoharan,45,AVP,992342344,Hyderabad,,,,,,)
(8,Meena,Baskar,51,VP,2342344444,Hyderabad,,,,,,)
(3,Madhuri,Nathan,51,VP,999333323,Hyderabad,,,,,,)


i = CROSS a,b;
dump i;

(,,,,,,,5,Trupthi,Mohanthy,21,988833,Bhuwaneshwar)
(,,,,,,,9,Bharathi,Nambiar,21,9888888,Chennai)
(,,,,,,,8,Bharathi,Nambiar,22,9888888,Hyderabad)
(,,,,,,,7,Komal,Nayak,23,98888,Chennai)
(,,,,,,,6,Archana,Mishra,24,988323,Chennai)
(,,,,,,,5,Trupthi,Mohanthy,23,988833,Bhuwaneshwar)
(,,,,,,,4,Preeti,Singh,21,9234243,Hyderabad)
(,,,,,,,3,Rajesh,Rakul,22,999333,Kolkatta)
(,,,,,,,2,Siddharth,Malhothra,22,9932323,Kolkatta)
(,,,,,,,1,Rajiv,Reddy,21,988383,Hyderabad)
(10,Thenmozhi,Rajan,45,AVP,234234234,Noida,5,Trupthi,Mohanthy,21,988833,Bhuwaneshwar)
(10,Thenmozhi,Rajan,45,AVP,234234234,Noida,9,Bharathi,Nambiar,21,9888888,Chennai)
(10,Thenmozhi,Rajan,45,AVP,234234234,Noida,8,Bharathi,Nambiar,22,9888888,Hyderabad)
(10,Thenmozhi,Rajan,45,AVP,234234234,Noida,7,Komal,Nayak,23,98888,Chennai)
(10,Thenmozhi,Rajan,45,AVP,234234234,Noida,6,Archana,Mishra,24,988323,Chennai)
(10,Thenmozhi,Rajan,45,AVP,234234234,Noida,5,Trupthi,Mohanthy,23,988833,Bhuwaneshwar)
(10,Thenmozhi,Rajan,45,AVP,234234234,Noida,4,Preeti,Singh,21,9234243,Hyderabad)
(10,Thenmozhi,Rajan,45,AVP,234234234,Noida,3,Rajesh,Rakul,22,999333,Kolkatta)
(10,Thenmozhi,Rajan,45,AVP,234234234,Noida,2,Siddharth,Malhothra,22,9932323,Kolkatta)
(10,Thenmozhi,Rajan,45,AVP,234234234,Noida,1,Rajiv,Reddy,21,988383,Hyderabad)
(9,Gayathri,Raghu,22,Engineer,323423444,Chennai,5,Trupthi,Mohanthy,21,988833,Bhuwaneshwar)
(9,Gayathri,Raghu,22,Engineer,323423444,Chennai,9,Bharathi,Nambiar,21,9888888,Chennai)
(9,Gayathri,Raghu,22,Engineer,323423444,Chennai,8,Bharathi,Nambiar,22,9888888,Hyderabad)
(9,Gayathri,Raghu,22,Engineer,323423444,Chennai,7,Komal,Nayak,23,98888,Chennai)
(9,Gayathri,Raghu,22,Engineer,323423444,Chennai,6,Archana,Mishra,24,988323,Chennai)
(9,Gayathri,Raghu,22,Engineer,323423444,Chennai,5,Trupthi,Mohanthy,23,988833,Bhuwaneshwar)
(9,Gayathri,Raghu,22,Engineer,323423444,Chennai,4,Preeti,Singh,21,9234243,Hyderabad)
(9,Gayathri,Raghu,22,Engineer,323423444,Chennai,3,Rajesh,Rakul,22,999333,Kolkatta)
(9,Gayathri,Raghu,22,Engineer,323423444,Chennai,2,Siddharth,Malhothra,22,9932323,Kolkatta)
(9,Gayathri,Raghu,22,Engineer,323423444,Chennai,1,Rajiv,Reddy,21,988383,Hyderabad)
(8,Meena,Baskar,51,VP,2342344444,Hyderabad,5,Trupthi,Mohanthy,21,988833,Bhuwaneshwar)
(8,Meena,Baskar,51,VP,2342344444,Hyderabad,9,Bharathi,Nambiar,21,9888888,Chennai)
(8,Meena,Baskar,51,VP,2342344444,Hyderabad,8,Bharathi,Nambiar,22,9888888,Hyderabad)
(8,Meena,Baskar,51,VP,2342344444,Hyderabad,7,Komal,Nayak,23,98888,Chennai)
(8,Meena,Baskar,51,VP,2342344444,Hyderabad,6,Archana,Mishra,24,988323,Chennai)
(8,Meena,Baskar,51,VP,2342344444,Hyderabad,5,Trupthi,Mohanthy,23,988833,Bhuwaneshwar)
(8,Meena,Baskar,51,VP,2342344444,Hyderabad,4,Preeti,Singh,21,9234243,Hyderabad)
(8,Meena,Baskar,51,VP,2342344444,Hyderabad,3,Rajesh,Rakul,22,999333,Kolkatta)
(8,Meena,Baskar,51,VP,2342344444,Hyderabad,2,Siddharth,Malhothra,22,9932323,Kolkatta)
(8,Meena,Baskar,51,VP,2342344444,Hyderabad,1,Rajiv,Reddy,21,988383,Hyderabad)
(7,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai,5,Trupthi,Mohanthy,21,988833,Bhuwaneshwar)
(7,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai,9,Bharathi,Nambiar,21,9888888,Chennai)
(7,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai,8,Bharathi,Nambiar,22,9888888,Hyderabad)
(7,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai,7,Komal,Nayak,23,98888,Chennai)
(7,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai,6,Archana,Mishra,24,988323,Chennai)
(7,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai,5,Trupthi,Mohanthy,23,988833,Bhuwaneshwar)
(7,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai,4,Preeti,Singh,21,9234243,Hyderabad)
(7,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai,3,Rajesh,Rakul,22,999333,Kolkatta)
(7,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai,2,Siddharth,Malhothra,22,9932323,Kolkatta)
(7,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai,1,Rajiv,Reddy,21,988383,Hyderabad)
(6,Aarthi,Raj,28,Assistant Manager,99234234,Chennai,5,Trupthi,Mohanthy,21,988833,Bhuwaneshwar)
(6,Aarthi,Raj,28,Assistant Manager,99234234,Chennai,9,Bharathi,Nambiar,21,9888888,Chennai)
(6,Aarthi,Raj,28,Assistant Manager,99234234,Chennai,8,Bharathi,Nambiar,22,9888888,Hyderabad)
(6,Aarthi,Raj,28,Assistant Manager,99234234,Chennai,7,Komal,Nayak,23,98888,Chennai)
(6,Aarthi,Raj,28,Assistant Manager,99234234,Chennai,6,Archana,Mishra,24,988323,Chennai)
(6,Aarthi,Raj,28,Assistant Manager,99234234,Chennai,5,Trupthi,Mohanthy,23,988833,Bhuwaneshwar)
(6,Aarthi,Raj,28,Assistant Manager,99234234,Chennai,4,Preeti,Singh,21,9234243,Hyderabad)
(6,Aarthi,Raj,28,Assistant Manager,99234234,Chennai,3,Rajesh,Rakul,22,999333,Kolkatta)
(6,Aarthi,Raj,28,Assistant Manager,99234234,Chennai,2,Siddharth,Malhothra,22,9932323,Kolkatta)
(6,Aarthi,Raj,28,Assistant Manager,99234234,Chennai,1,Rajiv,Reddy,21,988383,Hyderabad)
(5,Vijaya,Kandhasamy,45,AVP,23452344,Noida,5,Trupthi,Mohanthy,21,988833,Bhuwaneshwar)
(5,Vijaya,Kandhasamy,45,AVP,23452344,Noida,9,Bharathi,Nambiar,21,9888888,Chennai)
(5,Vijaya,Kandhasamy,45,AVP,23452344,Noida,8,Bharathi,Nambiar,22,9888888,Hyderabad)
(5,Vijaya,Kandhasamy,45,AVP,23452344,Noida,7,Komal,Nayak,23,98888,Chennai)
(5,Vijaya,Kandhasamy,45,AVP,23452344,Noida,6,Archana,Mishra,24,988323,Chennai)
(5,Vijaya,Kandhasamy,45,AVP,23452344,Noida,5,Trupthi,Mohanthy,23,988833,Bhuwaneshwar)
(5,Vijaya,Kandhasamy,45,AVP,23452344,Noida,4,Preeti,Singh,21,9234243,Hyderabad)
(5,Vijaya,Kandhasamy,45,AVP,23452344,Noida,3,Rajesh,Rakul,22,999333,Kolkatta)
(5,Vijaya,Kandhasamy,45,AVP,23452344,Noida,2,Siddharth,Malhothra,22,9932323,Kolkatta)
(5,Vijaya,Kandhasamy,45,AVP,23452344,Noida,1,Rajiv,Reddy,21,988383,Hyderabad)
(4,Kavitha,Manoharan,45,AVP,992342344,Hyderabad,5,Trupthi,Mohanthy,21,988833,Bhuwaneshwar)
(4,Kavitha,Manoharan,45,AVP,992342344,Hyderabad,9,Bharathi,Nambiar,21,9888888,Chennai)
(4,Kavitha,Manoharan,45,AVP,992342344,Hyderabad,8,Bharathi,Nambiar,22,9888888,Hyderabad)
(4,Kavitha,Manoharan,45,AVP,992342344,Hyderabad,7,Komal,Nayak,23,98888,Chennai)
(4,Kavitha,Manoharan,45,AVP,992342344,Hyderabad,6,Archana,Mishra,24,988323,Chennai)
(4,Kavitha,Manoharan,45,AVP,992342344,Hyderabad,5,Trupthi,Mohanthy,23,988833,Bhuwaneshwar)
(4,Kavitha,Manoharan,45,AVP,992342344,Hyderabad,4,Preeti,Singh,21,9234243,Hyderabad)
(4,Kavitha,Manoharan,45,AVP,992342344,Hyderabad,3,Rajesh,Rakul,22,999333,Kolkatta)
(4,Kavitha,Manoharan,45,AVP,992342344,Hyderabad,2,Siddharth,Malhothra,22,9932323,Kolkatta)
(4,Kavitha,Manoharan,45,AVP,992342344,Hyderabad,1,Rajiv,Reddy,21,988383,Hyderabad)
(3,Madhuri,Nathan,51,VP,999333323,Hyderabad,5,Trupthi,Mohanthy,21,988833,Bhuwaneshwar)
(3,Madhuri,Nathan,51,VP,999333323,Hyderabad,9,Bharathi,Nambiar,21,9888888,Chennai)
(3,Madhuri,Nathan,51,VP,999333323,Hyderabad,8,Bharathi,Nambiar,22,9888888,Hyderabad)
(3,Madhuri,Nathan,51,VP,999333323,Hyderabad,7,Komal,Nayak,23,98888,Chennai)
(3,Madhuri,Nathan,51,VP,999333323,Hyderabad,6,Archana,Mishra,24,988323,Chennai)
(3,Madhuri,Nathan,51,VP,999333323,Hyderabad,5,Trupthi,Mohanthy,23,988833,Bhuwaneshwar)
(3,Madhuri,Nathan,51,VP,999333323,Hyderabad,4,Preeti,Singh,21,9234243,Hyderabad)
(3,Madhuri,Nathan,51,VP,999333323,Hyderabad,3,Rajesh,Rakul,22,999333,Kolkatta)
(3,Madhuri,Nathan,51,VP,999333323,Hyderabad,2,Siddharth,Malhothra,22,9932323,Kolkatta)
(3,Madhuri,Nathan,51,VP,999333323,Hyderabad,1,Rajiv,Reddy,21,988383,Hyderabad)
(2,Nandhini,Babu,28,Assistant Manager,Delhi,,5,Trupthi,Mohanthy,21,988833,Bhuwaneshwar)
(2,Nandhini,Babu,28,Assistant Manager,Delhi,,9,Bharathi,Nambiar,21,9888888,Chennai)
(2,Nandhini,Babu,28,Assistant Manager,Delhi,,8,Bharathi,Nambiar,22,9888888,Hyderabad)
(2,Nandhini,Babu,28,Assistant Manager,Delhi,,7,Komal,Nayak,23,98888,Chennai)
(2,Nandhini,Babu,28,Assistant Manager,Delhi,,6,Archana,Mishra,24,988323,Chennai)
(2,Nandhini,Babu,28,Assistant Manager,Delhi,,5,Trupthi,Mohanthy,23,988833,Bhuwaneshwar)
(2,Nandhini,Babu,28,Assistant Manager,Delhi,,4,Preeti,Singh,21,9234243,Hyderabad)
(2,Nandhini,Babu,28,Assistant Manager,Delhi,,3,Rajesh,Rakul,22,999333,Kolkatta)
(2,Nandhini,Babu,28,Assistant Manager,Delhi,,2,Siddharth,Malhothra,22,9932323,Kolkatta)
(2,Nandhini,Babu,28,Assistant Manager,Delhi,,1,Rajiv,Reddy,21,988383,Hyderabad)
(1,Nithya,Duraisamy,31,Manager,9586452156,Hyderabad,5,Trupthi,Mohanthy,21,988833,Bhuwaneshwar)
(1,Nithya,Duraisamy,31,Manager,9586452156,Hyderabad,9,Bharathi,Nambiar,21,9888888,Chennai)
(1,Nithya,Duraisamy,31,Manager,9586452156,Hyderabad,8,Bharathi,Nambiar,22,9888888,Hyderabad)
(1,Nithya,Duraisamy,31,Manager,9586452156,Hyderabad,7,Komal,Nayak,23,98888,Chennai)
(1,Nithya,Duraisamy,31,Manager,9586452156,Hyderabad,6,Archana,Mishra,24,988323,Chennai)
(1,Nithya,Duraisamy,31,Manager,9586452156,Hyderabad,5,Trupthi,Mohanthy,23,988833,Bhuwaneshwar)
(1,Nithya,Duraisamy,31,Manager,9586452156,Hyderabad,4,Preeti,Singh,21,9234243,Hyderabad)
(1,Nithya,Duraisamy,31,Manager,9586452156,Hyderabad,3,Rajesh,Rakul,22,999333,Kolkatta)
(1,Nithya,Duraisamy,31,Manager,9586452156,Hyderabad,2,Siddharth,Malhothra,22,9932323,Kolkatta)
(1,Nithya,Duraisamy,31,Manager,9586452156,Hyderabad,1,Rajiv,Reddy,21,988383,Hyderabad)


k = UNION a,b;
dump k;

(1,Nithya,Duraisamy,31,Manager,9586452156,Hyderabad)
(2,Nandhini,Babu,28,Assistant Manager,Delhi,)
(3,Madhuri,Nathan,51,VP,999333323,Hyderabad)
(4,Kavitha,Manoharan,45,AVP,992342344,Hyderabad)
(5,Vijaya,Kandhasamy,45,AVP,23452344,Noida)
(6,Aarthi,Raj,28,Assistant Manager,99234234,Chennai)
(7,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai)
(8,Meena,Baskar,51,VP,2342344444,Hyderabad)
(9,Gayathri,Raghu,22,Engineer,323423444,Chennai)
(10,Thenmozhi,Rajan,45,AVP,234234234,Noida)
(,,,,,,)
(1,Rajiv,Reddy,21,988383,Hyderabad)
(2,Siddharth,Malhothra,22,9932323,Kolkatta)
(3,Rajesh,Rakul,22,999333,Kolkatta)
(4,Preeti,Singh,21,9234243,Hyderabad)
(5,Trupthi,Mohanthy,23,988833,Bhuwaneshwar)
(6,Archana,Mishra,24,988323,Chennai)
(7,Komal,Nayak,23,98888,Chennai)
(8,Bharathi,Nambiar,22,9888888,Hyderabad)
(9,Bharathi,Nambiar,21,9888888,Chennai)
(5,Trupthi,Mohanthy,21,988833,Bhuwaneshwar)


SPLIT a into x if age<28, y if age>28;

dump x;
(7,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai)
(9,Gayathri,Raghu,22,Engineer,323423444,Chennai)

dump y;
(1,Nithya,Duraisamy,31,Manager,9586452156,Hyderabad)
(3,Madhuri,Nathan,51,VP,999333323,Hyderabad)
(4,Kavitha,Manoharan,45,AVP,992342344,Hyderabad)
(5,Vijaya,Kandhasamy,45,AVP,23452344,Noida)
(8,Meena,Baskar,51,VP,2342344444,Hyderabad)
(10,Thenmozhi,Rajan,45,AVP,234234234,Noida)


grunt> l = FILTER a BY city =='Chennai';
grunt> dump l;
(6,Aarthi,Raj,28,Assistant Manager,99234234,Chennai)
(7,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai)
(9,Gayathri,Raghu,22,Engineer,323423444,Chennai)

SELECT QUERY EQUIVALENT
grunt> n = FOREACH a GENERATE fname,age,city;
grunt> dump n;

(Nithya,31,Hyderabad)
(Nandhini,28,)
(Madhuri,51,Hyderabad)
(Kavitha,45,Hyderabad)
(Vijaya,45,Noida)
(Aarthi,28,Chennai)
(Lavanya,23,Chennai)
(Meena,51,Hyderabad)
(Gayathri,22,Chennai)
(Thenmozhi,45,Noida)

grunt> o = ORDER a by age DESC;
grunt> dump o;
(8,Meena,Baskar,51,VP,2342344444,Hyderabad)
(3,Madhuri,Nathan,51,VP,999333323,Hyderabad)
(10,Thenmozhi,Rajan,45,AVP,234234234,Noida)
(5,Vijaya,Kandhasamy,45,AVP,23452344,Noida)
(4,Kavitha,Manoharan,45,AVP,992342344,Hyderabad)
(1,Nithya,Duraisamy,31,Manager,9586452156,Hyderabad)
(6,Aarthi,Raj,28,Assistant Manager,99234234,Chennai)
(2,Nandhini,Babu,28,Assistant Manager,Delhi,)
(7,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai)
(9,Gayathri,Raghu,22,Engineer,323423444,Chennai)
(,,,,,,)


grunt> limitexa = LIMIT a 4;
grunt> dump limitexa;
(2,Nandhini,Babu,28,Assistant Manager,Delhi,)
(1,Nithya,Duraisamy,31,Manager,9586452156,Hyderabad)
(3,Madhuri,Nathan,51,VP,999333323,Hyderabad)
(4,Kavitha,Manoharan,45,AVP,992342344,Hyderabad)

A = LOAD 'somefile' USING PigStorage(',')
AS (field1, field2, field3);

results = FOREACH a GENERATE field1;
DUMP results;


pig
ls
ls -l
help
grunt >  cd /user/hue
grunt > ls
grunt > cat apple.csv
grunt > mkdir demo
grunt > rm demo


apple.csv, nasdaq.csv

Load Data from HDFS
Write JOIN in Pig Latin
Store Results in HDFS


hdfs dfs -ls /user/thehenson
hdfs dfs -put /tmp/appl_daily.csv /user/thehenson/appl_daily.csv
hdfs dfs -put /tmp/nasdaq_daily.csv /user/thehenson/nasdaq_daily.csv
hdfs dfs -ls /user/thehenson


hdfs dfs -cp /user/thehenson/appledaily.csv /user/thehenson/nasdaqdaily.csv
pig

grunt>
apple = LOAD '/user/thehenson/appledaily.csv' using PigStorage(',') as (date:chararray, open:int, high:int, low:int, close:int, volume:int, adj_close:int);
DUMP apple;

nasdaq = LOAD '/user/thehenson/nasdaqdaily.csv' using PigStorage(',') as (date:chararray, open:int, high:int, low:int, close:int, volume:int, adj_close:int);
DUMP nasdaq;

apple_high = FOREACH apple GENERATE date,high;
dump apple_high;

nasdaq_high = FOREACH nasdaq GENERATE date,high;
dump nasdaq_high;

combined = JOIN apple_high BY date, nasdaq_high BY date;
dump combined;

STORE combined INTO '/user/thehenson/apple_nasdaq_highs' using PigStoarge(',');


Pig_Aru
-----------
$ pig

grunt> A = LOAD '/user/cloudera/pig/ds/user_info.txt' USING PigStorage(',') AS (user_id:chararray,firstname:chararray,lastname:chararray,date:chararray);


grunt> describe A;
A: {user_id: chararray,firstname: chararray,lastname: chararray,date: chararray}


grunt> A01 = LOAD '/user/cloudera/pig/ds/user_info.txt' using PigStorage(',');
grunt> describe A01;
Schema for A01 unknown.


grunt> dump A;
(A11,john,keith,2016-03-12)
(A12,aru,mugam,2016-03-13)
(A13,ram,manohar,2016-03-11)
(A14,mani,vasagam,2016-03-12)
(A21,suman,ganguly,2016-03-11)
(A22,arul,parithi,2016-03-12)
(A23,sachin,tendulkar,2016-03-13)
(A24,aritra,banarjee,2016-03-12)



grunt> B = filter A by date =='2016-03-11';

grunt> dump B;
(A13,ram,manohar,2016-03-11)
(A21,suman,ganguly,2016-03-11)

grunt> describe B;
B: {user_id: chararray,firstname: chararray,lastname: chararray,date: chararray}



grunt> C = foreach B generate user_id,firstname;

grunt> describe C;
C: {user_id: chararray,firstname: chararray}

grunt> dump C;
(A13,ram)
(A21,suman)

grunt> B1 = foreach A generate date;

grunt> dump B1;
(2016-03-12)
(2016-03-13)
(2016-03-11)
(2016-03-12)
(2016-03-11)
(2016-03-12)
(2016-03-13)
(2016-03-12)

grunt> D = distinct B1;
grunt> dump D;
(2016-03-11)
(2016-03-12)
(2016-03-13)

grunt> store D into '/user/cloudera/pig/ds/date_exa/';

[cloudera@quickstart ~]$ hdfs dfs -ls '/user/cloudera/pig/ds/date_exa/';
Found 2 items
-rw-r--r--   1 cloudera cloudera          0 2018-09-07 00:40 /user/cloudera/pig/ds/date_exa/_SUCCESS
-rw-r--r--   1 cloudera cloudera         33 2018-09-07 00:40 /user/cloudera/pig/ds/date_exa/part-r-00000
[cloudera@quickstart ~]$

[cloudera@quickstart ~]$ hdfs dfs -cat '/user/cloudera/pig/ds/date_exa/part-r-00000';
2016-03-11
2016-03-12
2016-03-13


A = LOAD '/user/cloudera/pig/ds/user_info.txt' USING PigStorage(',') AS (user_id:chararray,firstname:chararray,lastname:chararray,date:chararray);
B = filter A by date =='2016-03-11';
C = foreach B generate user_id,firstname;
B1 = foreach A generate date;
D = distinct B1;
E = group A by date;
dump E;
  (2016-03-11,{(A21,suman,ganguly,2016-03-11),(A13,ram,manohar,2016-03-11)})
(2016-03-12,{(A24,aritra,banarjee,2016-03-12),(A22,arul,parithi,2016-03-12),(A14,mani,vasagam,2016-03-12),(A11,john,keith,2016-03-12)})
(2016-03-13,{(A23,sachin,tendulkar,2016-03-13),(A12,aru,mugam,2016-03-13)})

grunt> F = group A all;
grunt> dump F;
(all,{(A24,aritra,banarjee,2016-03-12),
  (A23,sachin,tendulkar,2016-03-13),
  (A22,arul,parithi,2016-03-12),
  (A21,suman,ganguly,2016-03-11),
  (A14,mani,vasagam,2016-03-12),
  (A13,ram,manohar,2016-03-11),
  (A12,aru,mugam,2016-03-13),
  (A11,john,keith,2016-03-12)})


  A = LOAD '/user/cloudera/pig/ds/user_info.txt' USING PigStorage(',') AS (user_id:chararray,firstname:chararray,lastname:chararray,date:chararray);
  B = filter A by date =='2016-03-11';
  C = foreach B generate user_id,firstname;
  B1 = foreach A generate date;
  D = distinct B1;
  E = group A by date;
  A01 = LOAD '/user/cloudera/pig/ds/cogroup.txt' using PigStorage(',') AS (user_id:chararray,date:chararray);
  G = COGROUP A By date, A01 BY date;
  X01 = foreach G generate group,COUNT(A)+COUNT(A01);

  grunt> dump X01;
(2016-03-12,7)
(2016-03-13,5)
(2016-03-11,4)
(,0)


customers = LOAD '/user/cloudera/pig/ds/customers.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, address:chararray,salary:int);
orders = LOAD '/user/cloudera/pig/ds/orders.txt' USING PigStorage(',') as (oid:int, date:chararray,customer_id:int, amount:int);
customer_orders  = JOIN customers By id, orders By customer_id;
result = LIMIT customer_orders 10;
dump result;

(1,Richard,,XXXXXXXXX,,22945,2013-12-13 00:00:00,1,)
(2,Mary,,XXXXXXXXX,,33865,2014-02-18 00:00:00,2,)
(2,Mary,,XXXXXXXXX,,57963,2013-08-02 00:00:00,2,)
(2,Mary,,XXXXXXXXX,,67863,2013-11-30 00:00:00,2,)
(2,Mary,,XXXXXXXXX,,15192,2013-10-29 00:00:00,2,)
(3,Ann,,XXXXXXXXX,,35158,2014-02-26 00:00:00,3,)
(3,Ann,,XXXXXXXXX,,46399,2014-05-09 00:00:00,3,)
(3,Ann,,XXXXXXXXX,,56178,2014-07-15 00:00:00,3,)
(3,Ann,,XXXXXXXXX,,22646,2013-12-11 00:00:00,3,)
(3,Ann,,XXXXXXXXX,,23662,2013-12-19 00:00:00,3,)


How to connect Hive with Pig?
pig -useHCatalog  :: Start Pig with this -useHCatalog option


Start MySQL server in Local Linux:
sudo service mysqld start;


Run Hive:
hive
show databases;
use batch_may;

hive> show tables;
OK
student
student_part
student_part_stat

Pig continued here...
grunt> A = LOAD 'batch_may.student' using org.apache.hive.hcatalog.pig.HCatLoader();
grunt > dump A;
(102,saran,CHN)
(103,Maran,TRC)
(101,sara,MDR)

Why we connect Pig with Hive?
Pig eats anything. It supports structured, semi structured, unstructed datas. So, in order to handle semi or unstructured data in Hive,

we use Pig to transform semi / unstructured data to Hive.

create a hive table which is copy of student.
hive> create table mine like student;  /// newly created table name : mine but it is empty.
OK
Time taken: 0.364 seconds

HCatLoader() // To load hive table in Pig session
HCatStorer() // To write Pig relation into Hive table. Table must exists in Hive while running Storer.
Here
grunt> A = LOAD 'batch_may.student' using org.apache.hive.hcatalog.pig.HCatLoader();

grunt> STORE A into 'batch_may.mine' using org.apache.hive.hcatalog.pig.HCatStorer();

hive> select * from mine;
OK
102 saran CHN
103 Maran TRC
101 sara MDR
102 saran CHN
103 Maran TRC
101 sara MDR


creat hello.txt file in local and copy it into hdfs and the content is:
hdfs dfs -cat /user/cloudera/hello.txt
I love India
India is my I
I love India


Word count program in Pig:
---------------------------
A = LOAD '/user/cloudera/hello.txt' USING PigStorage() AS ( lines:chararray );

B = FOREACH A GENERATE FLATTEN ( TOKENIZE ( lines ) ) as word;

C = FOREACH B generate word, 1 as Count;

D = GROUP C BY word;

E = FOREACH D GENERATE group,SUM(C.Count);

store E into 'output1' using PigStorage(',');

Result:
(I,3)
(is,1)
(my,1)
(love,2)
(India,3)

[cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/output1/part-r-00000
I,3
is,1
my,1
love,2
India,3

deleted that folder:
[cloudera@quickstart ~]$ hdfs dfs -rm -r /user/cloudera/output1
Deleted /user/cloudera/output1

run the .pig file in $ prompt and not in grunt> shell

creating wordcount.pig in $ prompt using cat.

[cloudera@quickstart ~]$ cat > wordcount.pig
A = LOAD '/user/cloudera/hello.txt' USING PigStorage() AS ( lines:chararray );

B = FOREACH A GENERATE FLATTEN ( TOKENIZE ( lines ) ) as word;

C = FOREACH B generate word, 1 as Count;

D = GROUP C BY word;

E = FOREACH D GENERATE group,SUM(C.Count);

store E into 'output1' using PigStorage(',');
^C

view the created file:

[cloudera@quickstart ~]$ cat wordcount.pig
A = LOAD '/user/cloudera/hello.txt' USING PigStorage() AS ( lines:chararray );

B = FOREACH A GENERATE FLATTEN ( TOKENIZE ( lines ) ) as word;

C = FOREACH B generate word, 1 as Count;

D = GROUP C BY word;

E = FOREACH D GENERATE group,SUM(C.Count);

store E into 'output1' using PigStorage(',');


Run pig file:
pig -f wordcount.pig

Input(s):
Successfully read 3 records (415 bytes) from: "/user/cloudera/hello.txt"

Output(s):
Successfully stored 5 records (29 bytes) in: "hdfs://quickstart.cloudera:8020/user/cloudera/output1"

Counters:
Total records written : 5
Total bytes written : 29


creat an external table in hive:
---------------------------------

hive> use batch_may;
OK
Time taken: 0.506 seconds
hive> create external table ipadr (
date1 string,
ipaddress string)
row format delimited
fields terminated by '|'
location '/user/cloudera/weblog/output2';
OK
Time taken: 0.371 seconds

Script to analyse weblog:
--------------------------
A = LOAD '/user/cloudera/pig/ds/Web_log.txt' using TextLoader() as (lines:chararray);

B = FOREACH A GENERATE REGEX_EXTRACT(lines,'(\\d*/\\w*/\\d*)',1) as Date,REGEX_EXTRACT(lines,'(\\d*.\\d*.\\d*.\\d*)',1) as Ips;

F = distinct B;

C = Group F by Date;

D = FOREACH C GENERATE group as Date,COUNT(F) as cnt;

E = ORDER D by cnt DESC;


dump E:
(04/Nov/2011,305)
(05/Nov/2011,286)
(08/Nov/2011,283)
(07/Nov/2011,266)
(06/Nov/2011,257)
(03/Nov/2011,253)
(02/Nov/2011,230)
(10/Nov/2011,229)
(09/Nov/2011,227)
(30/Oct/2011,217)
(28/Oct/2011,213)
(01/Nov/2011,210)
(27/Oct/2011,203)
(31/Oct/2011,200)
(29/Oct/2011,182)
(11/Nov/2011,70)
(26/Oct/2011,42)
(25/Oct/2011,3)

No comments:

Post a Comment

Flume - Simple Demo

// create a folder in hdfs : $ hdfs dfs -mkdir /user/flumeExa // Create a shell script which generates : Hadoop in real world <n>...