scala> val x:Int = 20
x: Int = 20
scala> val s:String = "Scala"
s: String = Scala
scala> val y = 1234.6
y: Double = 1234.6
// Type inference done by automatically
scala> val a = 35
a: Int = 35
scala> val p = "Spark"
p: String = Spark
Note: Only one SparkContext per application is allowed
But multiple number of SparkSessions are allowed
Create a text file: file1
gedit file1
spark scala
scala spark hadoop
mapreduce hive pig spark
// Create an RDD using file read
//read a file which is located in local system (CentOS)
scala> val rd1 = sc.textFile("file:///home/cloudera/Desktop/file1")
rd1: org.apache.spark.rdd.RDD[String] = file:///home/cloudera/Desktop/file1 MapPartitionsRDD[1] at textFile at <console>:27
// get number of rows
scala> rd1.count
res0: Long = 3
// display the RDD content
scala> rd1.collect
res1: Array[String] = Array(spark scala, scala spark hadoop, mapreduce hive pig spark)
// row by row
scala> rd1.collect.foreach(println)
spark scala
scala spark hadoop
mapreduce hive pig spark
// do a transformation
// every row content will be concated with the word : " tech "
scala> val rd2 = rd1.map (x => x + " tech ")
rd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at map at <console>:31
// every line will be ending with tech because of the above transformation
scala> rd2.collect.foreach(println)
spark scala tech
scala spark hadoop tech
mapreduce hive pig spark tech
// rd1 is root RDD which is the parent of rd2
// we did only one transformation
//Write the RDD content as text file in a disk in local system
scala> rd2.saveAsTextFile("file:///home/cloudera/Desktop/output1")
// output location is desktop
[cloudera@quickstart output1]$ pwd
/home/cloudera/Desktop/output1
[cloudera@quickstart output1]$ ls -l
total 8
-rw-r--r-- 1 cloudera cloudera 43 Jan 18 10:35 part-00000
-rw-r--r-- 1 cloudera cloudera 31 Jan 18 10:35 part-00001
-rw-r--r-- 1 cloudera cloudera 0 Jan 18 10:35 _SUCCESS
[cloudera@quickstart output1]$ cat part-00000
spark scala tech
scala spark hadoop tech
[cloudera@quickstart output1]$ cat part-00001
mapreduce hive pig spark tech
[cloudera@quickstart output1]$
We did file read operation of local file and did a transformation
and wrote the output in local file system.
// copy file1 into hdfs
[cloudera@quickstart Desktop]$ hdfs dfs -copyFromLocal file1 /user/cloudera/file1
// check it
[cloudera@quickstart Desktop]$ hdfs dfs -ls /user/cloudera
-rw-r--r-- 1 cloudera cloudera 56 2019-01-18 10:38 /user/cloudera/file1
hdfs dfs -cat /user/cloudera/file1
spark scala
scala spark hadoop
mapreduce hive pig spark
hdfs dfs -cat hdfs://localhost/user/cloudera/file1
spark scala
scala spark hadoop
mapreduce hive pig spark
// load hdfs file into RDD
scala> val r1 = sc.textFile("hdfs://localhost/user/cloudera/file1")
r1: org.apache.spark.rdd.RDD[String] = hdfs://localhost/user/cloudera/file1 MapPartitionsRDD[5] at textFile at <console>:27
scala> r1.count
res5: Long = 3
scala> r1.collect.foreach(println)
spark scala
scala spark hadoop
mapreduce hive pig spark
// Write the output in hdfs file system
scala> r1.saveAsTextFile("hdfs://localhost/user/cloudera/myOutput")
// Display the content of myOutput folder which is located in hdfs
hdfs dfs -ls hdfs://localhost/user/cloudera/myOutput
Found 3 items
-rw-r--r-- 1 cloudera cloudera 0 2019-01-18 10:43 hdfs://localhost/user/cloudera/myOutput/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 31 2019-01-18 10:43 hdfs://localhost/user/cloudera/myOutput/part-00000
-rw-r--r-- 1 cloudera cloudera 25 2019-01-18 10:43 hdfs://localhost/user/cloudera/myOutput/part-00001
hdfs dfs -cat hdfs://localhost/user/cloudera/myOutput/part-00000
spark scala
scala spark hadoop
hdfs dfs -cat hdfs://localhost/user/cloudera/myOutput/part-00001
mapreduce hive pig spark
//filter transformation
//search for a line which contains 'scala'
scala> val r2 = r1.filter (x => x.contains("scala"))
r2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at filter at <console>:31
scala> r2.collect.foreach(println)
spark scala
scala spark hadoop
Subscribe to:
Post Comments (Atom)
Flume - Simple Demo
// create a folder in hdfs : $ hdfs dfs -mkdir /user/flumeExa // Create a shell script which generates : Hadoop in real world <n>...
-
How to fetch Spark Application Id programmaticall while running the Spark Job? scala> spark.sparkContext.applicationId res124: String = l...
-
input data: ---------- customerID, itemID, amount 44,8602,37.19 35,5368,65.89 2,3391,40.64 47,6694,14.98 29,680,13.08 91,8900,24.59 ...
-
pattern matching is similar to switch statements in C#, Java no fall-through - at least one condition matched no breaks object PatternExa { ...
No comments:
Post a Comment