Sunday, 24 May 2020

CSV Import using Spark and Export the data into Hive using Spark with Scala

Header information not found initially:

scala> val df = spark.read.format("csv").load("hdfs://localhost:9000/SparkFiles/person.csv")
df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 2 more fields]

scala> df.show() #Headers have _c0, _c1 etc.,
+-----+---+---+----+
|  _c0|_c1|_c2| _c3|
+-----+---+---+----+
| Ravi| 23|  M|5000|
|Rahul| 24|  M|6300|
| Siva| 22|  M|3200|
+-----+---+---+----+

#Applied Headers
hadoop@hadoop:~/scratch$ cat person.csv
Name,Age,Gender,Salary
Ravi,23,M,5000
Rahul,24,M,6300
Siva,22,M,3200

#Deleting old file
hadoop@hadoop:~/scratch$ hdfs dfs -rm /SparkFiles/person.csv
Deleted /SparkFiles/person.csv

hadoop@hadoop:~/scratch$ hdfs dfs -copyFromLocal person.csv /SparkFiles

hadoop@hadoop:~/scratch$ hdfs dfs -cat /SparkFiles/person.csv
Name,Age,Gender,Salary
Ravi,23,M,5000
Rahul,24,M,6300
Siva,22,M,3200


scala> val df = spark.read.format("csv").option("header","True").load("hdfs://localhost:9000/SparkFiles/person.csv")
df: org.apache.spark.sql.DataFrame = [Name: string, Age: string ... 2 more fields]

scala> df.show()
+-----+---+------+------+
| Name|Age|Gender|Salary|
+-----+---+------+------+
| Ravi| 23|     M|  5000|
|Rahul| 24|     M|  6300|
| Siva| 22|     M|  3200|
+-----+---+------+------+

#write the dataframe into Hive table in rocks database
scala> df.write.saveAsTable("rocks.employee")

hadoop@hadoop:~$ hive

hive> show databases;
OK
default
rocks


hive> show tables;
OK
employee
person


hive> select * from employee;
OK
Ravi 23 M 5000
Rahul 24 M 6300
Siva 22 M 3200

Earlier we did the below one time activity:
#Downloaded mysql connector and put that jar into spark/jars folder

hadoop@hadoop:~/spark-3.0.0-preview2-bin-hadoop3.2/jars$ wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.20/mysql-connector-java-8.0.20.jar

Copied hive/conf/hive-site.xml into : spark/conf
cp /home/hadoop/apache-hive-3.1.2-bin/conf/hive-site.xml //home/hadoop/spark-3.0.0-preview2-bin-hadoop3.2/conf



hive-site.xml configuration used :

hive-site.xml:
--------------

<configuration>
   <property>
      <name>javax.jdo.option.ConnectionURL</name>
      <value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
      <description>metadata is stored in a MySQL server</description>
   </property>
   <property>
      <name>javax.jdo.option.ConnectionDriverName</name>
      <value>com.mysql.jdbc.Driver</value>
      <description>MySQL JDBC driver class</description>
   </property>
   <property>
      <name>javax.jdo.option.ConnectionUserName</name>
      <value>hiveuser</value>
      <description>user name for connecting to mysql server</description>
   </property>
   <property>
      <name>javax.jdo.option.ConnectionPassword</name>
      <value>mypassword</value>
      <description>password for connecting to mysql server</description>
   </property>
</configuration>

No comments:

Post a Comment

Flume - Simple Demo

// create a folder in hdfs : $ hdfs dfs -mkdir /user/flumeExa // Create a shell script which generates : Hadoop in real world <n>...