Sankara's Big Data Notes: Spark Notes

>>> ut = sc.textFile("Uber-Jan-Feb-FOIL.csv")
>>> ut.count()

>>> ut.first()
'dispatching_base_number,date,active_vehicles,trips'

rows = ut.map(lambda line:line.split(","))

rows.map(lambda row : row[0]).distinct().count()

>>> base02617 = rows.filter(lambda row:"B02617" in row)
>>> base02617.collect()

>>> base02617.filter(lambda row:int(row[3]) > 15000) .map(lambda day:day[1]).distinct().count()

Spark Core Intro:
-----------------
Want to learn Spark
to know fundamentals of Spark
Evaluate Spark

Engine for efficient large-scale data processing. Faster than Hadoop MapReduce

Spark can complement existing Hadoop investments such as HDFS and Hive

Rich Echo system including support for SQL, ML, Multiple language APIs Java,Scala,Python

RDD - Resilient Distributed Datasets

Transformation

Actions

Spark Driver Programs and SparkContext

RDDs :
Primary abstraction for data interaction (lazy, in memory)

Immutable, distributed collection of elements separated into partitions

Multiple Types of RDDs

RDDs can be created from an external data sets such as Hadoop InputFormats, text Files on a variety of file systems
or existings RDDs via Spark Transformations

RDD functions which return pointers to new RDDs (lazy)

map,flatMap,filter

Transformation creates new RDDs

Action functions (RDD functions) which returns values to the driver
reduce, collect, count etc.,

Transformations ==> RDDs ==> Actions ==> output results

Spark Driver, Workers

Spark Driver ==> program that declares transformations and actions on RDDs of data
Driver submits the serialized RDD graph to the master where the master creates tasks.
These tasks are delegated to the works for execution

Works are where the tasks are actually executed.

Driver Program (Spark Context)
Cluster Manager
Worker Node [ Executor, Cache, Tasks ]

Driver Program using a Spark Context interact with Cluster Manager and distributing that load across worker nodes
Parallel processing

RDDs support 2 types of operations:
Transformations
Creates new dataset from an existing one
Lazy ; Only computed when a result is required
Transformed RDDs are recomputed each time an action is run against it.
persist / cache to avoid recomputing
Actions
Returna a value to the driver program

ubercsv = sc.textFile("uber.csv") ==> New RDD Created

rows = ubercsv.map(lambda l: len(l)) ==> Another new RDD

totalRows = rows.reduce(lambda a,b:a+b) ==> RDD is now computed across different machines

row.cache() ==> reuse without recomputing

Actions aggregate all the RDD elements using some function such as the previously seen Reduce

Returns the final result back to the driver program.

baby_names = sc.textFile("baby_names.csv")
rows = baby_names.map(lambda line : line.split(",")
for row in rows.take(rows.count()) : print(row[1])

First Name
DOMINIC
ADDISION.....

rows.filter(lambda line:"MICHAEL" in line).collect()
[u'2012',u'MICHAEL',u'KINGS',u'M',u'172']....

Algorithm difficulties
Optimization for existing hadoop
100 times faster

Processing power, time, code : Shrinks
tinier code, increase readability
expressiveness
fast

computation against disk - MapReduce
Computation against cached data in Memory - Spark

Directly interact with data using local machine

ScaleUp / ScaleOut

Fault Tolerant

Unify Big Data - Batch, Stream (Real time), MLLib

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Basics").getOrCreate()
df = spark.read.json("people.json")

df.printSchema()

root
|-- age: long (nullable = true)
|-- name: string (nullable = true)

df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+

df.describe().show()

+-------+------------------+-------+
|summary| age| name|
+-------+------------------+-------+
| count| 2| 3|
| mean| 24.5| null|
| stddev|7.7781745930520225| null|
| min| 19| Andy|
| max| 30|Michael|
+-------+------------------+-------+

df.columns

['age', 'name']

df.columns[0]

'age'

df.columns[-1]
'name'

df.describe()

DataFrame[summary: string, age: string, name: string]

from pyspark.sql.types import (StructField,StringType,
IntegerType,StructType)

data_schema = [StructField("age",IntegerType(),True),
StructField("name",StringType(),True)]

final_struct = StructType(fields=data_schema)

df = spark.read.json("people.json",schema=final_struct)

df.printSchema()

root
|-- age: integer (nullable = true)
|-- name: string (nullable = true)

df.describe().show()

+-------+------------------+-------+
|summary| age| name|
+-------+------------------+-------+
| count| 2| 3|
| mean| 24.5| null|
| stddev|7.7781745930520225| null|
| min| 19| Andy|
| max| 30|Michael|
+-------+------------------+-------+

df.show()

+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+

df["age"]
Column<b'age'>

type(df["age"])
pyspark.sql.column.Column

df.select("age")
DataFrame[age: int]

df.select("age").show()
+----+
| age|
+----+
|null|
| 30|
| 19|
+----+

df.head(2)
[Row(age=None, name='Michael'), Row(age=30, name='Andy')]

df.head(2)[0]
Row(age=None, name='Michael')

df.head(2)[1]
Row(age=30, name='Andy')

df.select("age").show()

+----+
| age|
+----+
|null|
| 30|
| 19|
+----+

df.select(["age","name"])
DataFrame[age: int, name: string]

df.select(["age","name"]).show()

+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+

df.withColumn("nameAge",df["age"]).show()

+----+-------+-------+
| age| name|nameAge|
+----+-------+-------+
|null|Michael| null|
| 30| Andy| 30|
| 19| Justin| 19|
+----+-------+-------+

df.withColumn("DoubleAge",df["age"]*2).show()

+----+-------+---------+
| age| name|DoubleAge|
+----+-------+---------+
|null|Michael| null|
| 30| Andy| 60|
| 19| Justin| 38|
+----+-------+---------+

df.withColumnRenamed("age","my_new_age").show()
+----------+-------+
|my_new_age| name|
+----------+-------+
| null|Michael|
| 30| Andy|
| 19| Justin|
+----------+-------+

df.createOrReplaceTempView("people")
results = spark.sql("SELECT * FROM people")
results.show()

+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+

results = spark.sql("SELECT * FROM people WHERE age = 30")

results.show()
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+

spark.sql("SELECT * FROM people WHERE age = 30").show()
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ops').getOrCreate()
df = spark.read.csv("AAPL.csv",inferSchema=True,header=True)

df.head(2)[0]
Row(Date=datetime.datetime(2018, 2, 13, 0, 0), Open=161.949997, High=164.75, Low=161.649994, Close=164.339996, Adj Close=164.339996, Volume=32549200)

df.printSchema()
root
|-- Date: timestamp (nullable = true)
|-- Open: double (nullable = true)
|-- High: double (nullable = true)
|-- Low: double (nullable = true)
|-- Close: double (nullable = true)
|-- Adj Close: double (nullable = true)
|-- Volume: integer (nullable = true)

df.show()
+-------------------+----------+----------+----------+----------+----------+--------+
| Date| Open| High| Low| Close| Adj Close| Volume|
+-------------------+----------+----------+----------+----------+----------+--------+
|2018-02-13 00:00:00|161.949997| 164.75|161.649994|164.339996|164.339996|32549200|
|2018-02-14 00:00:00|163.039993|167.539993|162.880005|167.369995|167.369995|40644900|
|2018-02-15 00:00:00|169.789993|173.089996| 169.0|172.990005|172.990005|51147200|
|2018-02-16 00:00:00|172.360001|174.820007|171.770004|172.429993|172.429993|40176100|
|2018-02-20 00:00:00|172.050003|174.259995|171.419998|171.850006|171.850006|33930500|
|2018-02-21 00:00:00|172.830002|174.119995|171.009995|171.070007|171.070007|37471600|
|2018-02-22 00:00:00|171.800003|173.949997|171.710007| 172.5| 172.5|30991900|
|2018-02-23 00:00:00|173.669998|175.649994|173.539993| 175.5| 175.5|33812400|
|2018-02-26 00:00:00|176.350006|179.389999|176.210007|178.970001|178.970001|38162200|
|2018-02-27 00:00:00|179.100006|180.479996|178.160004|178.389999|178.389999|38928100|
|2018-02-28 00:00:00|179.259995|180.619995|178.050003|178.119995|178.119995|37782100|
|2018-03-01 00:00:00|178.539993|179.779999|172.660004| 175.0| 175.0|48802000|
|2018-03-02 00:00:00|172.800003|176.300003|172.449997|176.210007|176.210007|38454000|
|2018-03-05 00:00:00|175.210007|177.740005|174.520004|176.820007|176.820007|28401400|
|2018-03-06 00:00:00|177.910004| 178.25|176.130005|176.669998|176.669998|23788500|
|2018-03-07 00:00:00|174.940002|175.850006|174.270004|175.029999|175.029999|31703500|
|2018-03-08 00:00:00|175.479996|177.119995|175.070007|176.940002|176.940002|23774100|
|2018-03-09 00:00:00|177.960007| 180.0|177.389999|179.979996|179.979996|32185200|
|2018-03-12 00:00:00|180.289993|182.389999|180.210007|181.720001|181.720001|32162900|
+-------------------+----------+----------+----------+----------+----------+--------+

df.head(3)[2]
Row(Date=datetime.datetime(2018, 2, 15, 0, 0), Open=169.789993, High=173.089996, Low=169.0, Close=172.990005, Adj Close=172.990005, Volume=51147200)

df.filter("Close = 172.5").show()

+-------------------+----------+----------+----------+-----+---------+--------+
| Date| Open| High| Low|Close|Adj Close| Volume|
+-------------------+----------+----------+----------+-----+---------+--------+
|2018-02-22 00:00:00|171.800003|173.949997|171.710007|172.5| 172.5|30991900|
+-------------------+----------+----------+----------+-----+---------+--------+

df.filter("Close > 175").show()
+-------------------+----------+----------+----------+----------+----------+--------+
| Date| Open| High| Low| Close| Adj Close| Volume|
+-------------------+----------+----------+----------+----------+----------+--------+
|2018-02-23 00:00:00|173.669998|175.649994|173.539993| 175.5| 175.5|33812400|
|2018-02-26 00:00:00|176.350006|179.389999|176.210007|178.970001|178.970001|38162200|
|2018-02-27 00:00:00|179.100006|180.479996|178.160004|178.389999|178.389999|38928100|
|2018-02-28 00:00:00|179.259995|180.619995|178.050003|178.119995|178.119995|37782100|
|2018-03-02 00:00:00|172.800003|176.300003|172.449997|176.210007|176.210007|38454000|
|2018-03-05 00:00:00|175.210007|177.740005|174.520004|176.820007|176.820007|28401400|
|2018-03-06 00:00:00|177.910004| 178.25|176.130005|176.669998|176.669998|23788500|
|2018-03-07 00:00:00|174.940002|175.850006|174.270004|175.029999|175.029999|31703500|
|2018-03-08 00:00:00|175.479996|177.119995|175.070007|176.940002|176.940002|23774100|
|2018-03-09 00:00:00|177.960007| 180.0|177.389999|179.979996|179.979996|32185200|
|2018-03-12 00:00:00|180.289993|182.389999|180.210007|181.720001|181.720001|32162900|
+-------------------+----------+----------+----------+----------+----------+--------+

df.filter("Close > 175").select("High").show()
+----------+
| High|
+----------+
|175.649994|
|179.389999|
|180.479996|
|180.619995|
|176.300003|
|177.740005|
| 178.25|
|175.850006|
|177.119995|
| 180.0|
|182.389999|
+----------+

df.filter("Close > 175").select(["High","Low","Volume"]).show()
+----------+----------+--------+
| High| Low| Volume|
+----------+----------+--------+
|175.649994|173.539993|33812400|
|179.389999|176.210007|38162200|
|180.479996|178.160004|38928100|
|180.619995|178.050003|37782100|
|176.300003|172.449997|38454000|
|177.740005|174.520004|28401400|
| 178.25|176.130005|23788500|
|175.850006|174.270004|31703500|
|177.119995|175.070007|23774100|
| 180.0|177.389999|32185200|
|182.389999|180.210007|32162900|
+----------+----------+--------+

df.filter(df["Close"] > 175).show()
+-------------------+----------+----------+----------+----------+----------+--------+
| Date| Open| High| Low| Close| Adj Close| Volume|
+-------------------+----------+----------+----------+----------+----------+--------+
|2018-02-23 00:00:00|173.669998|175.649994|173.539993| 175.5| 175.5|33812400|
|2018-02-26 00:00:00|176.350006|179.389999|176.210007|178.970001|178.970001|38162200|
|2018-02-27 00:00:00|179.100006|180.479996|178.160004|178.389999|178.389999|38928100|
|2018-02-28 00:00:00|179.259995|180.619995|178.050003|178.119995|178.119995|37782100|
|2018-03-02 00:00:00|172.800003|176.300003|172.449997|176.210007|176.210007|38454000|
|2018-03-05 00:00:00|175.210007|177.740005|174.520004|176.820007|176.820007|28401400|
|2018-03-06 00:00:00|177.910004| 178.25|176.130005|176.669998|176.669998|23788500|
|2018-03-07 00:00:00|174.940002|175.850006|174.270004|175.029999|175.029999|31703500|
|2018-03-08 00:00:00|175.479996|177.119995|175.070007|176.940002|176.940002|23774100|
|2018-03-09 00:00:00|177.960007| 180.0|177.389999|179.979996|179.979996|32185200|
|2018-03-12 00:00:00|180.289993|182.389999|180.210007|181.720001|181.720001|32162900|
+-------------------+----------+----------+----------+----------+----------+--------+

df.filter(df["Close"] > 175).select(["Volume"]).show()

+--------+
| Volume|
+--------+
|33812400|
|38162200|
|38928100|
|37782100|
|38454000|
|28401400|
|23788500|
|31703500|
|23774100|
|32185200|
|32162900|
+--------+

df.filter((df["Close"] > 175) & (df["Volume"] >= 38454000)).show()
+-------------------+----------+----------+----------+----------+----------+--------+
| Date| Open| High| Low| Close| Adj Close| Volume|
+-------------------+----------+----------+----------+----------+----------+--------+
|2018-02-27 00:00:00|179.100006|180.479996|178.160004|178.389999|178.389999|38928100|
|2018-03-02 00:00:00|172.800003|176.300003|172.449997|176.210007|176.210007|38454000|
+-------------------+----------+----------+----------+----------+----------+--------+

df.filter(df["Low"] == 178.160004).show()

+-------------------+----------+----------+----------+----------+----------+--------+
| Date| Open| High| Low| Close| Adj Close| Volume|
+-------------------+----------+----------+----------+----------+----------+--------+
|2018-02-27 00:00:00|179.100006|180.479996|178.160004|178.389999|178.389999|38928100|
+-------------------+----------+----------+----------+----------+----------+--------+

result = df.filter(df["Low"] == 178.160004).collect()
result[0]
Row(Date=datetime.datetime(2018, 2, 27, 0, 0), Open=179.100006, High=180.479996, Low=178.160004, Close=178.389999, Adj Close=178.389999, Volume=38928100)

result[0].asDict()

{'Adj Close': 178.389999,
'Close': 178.389999,
'Date': datetime.datetime(2018, 2, 27, 0, 0),
'High': 180.479996,
'Low': 178.160004,
'Open': 179.100006,
'Volume': 38928100}

result[0].asDict().keys()
dict_keys(['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'])

result[0].asDict()["Volume"]
38928100

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("aggs").getOrCreate()
df = spark.read.csv("sales_info.csv",inferSchema=True,header=True)
df = spark.read.csv("sales_info.csv",inferSchema=True,header=True)
df.show()

+-------+-------+-----+
|Company| Person|Sales|
+-------+-------+-----+
| GOOG| Sam|200.0|
| GOOG|Charlie|120.0|
| GOOG| Frank|340.0|
| MSFT| Tina|600.0|
| MSFT| Amy|124.0|
| MSFT|Vanessa|243.0|
| FB| Carl|870.0|
| FB| Sarah|350.0|
| APPL| John|250.0|
| APPL| Linda|130.0|
| APPL| Mike|750.0|
| APPL| Chris|350.0|
+-------+-------+-----+

df.printSchema()
root
|-- Company: string (nullable = true)
|-- Person: string (nullable = true)
|-- Sales: double (nullable = true)

df.groupBy("Company")
<pyspark.sql.group.GroupedData at 0x7f59125082b0>

df.groupBy("Company").mean()
DataFrame[Company: string, avg(Sales): double]

df.groupBy("Company").mean().show()
+-------+-----------------+
|Company| avg(Sales)|
+-------+-----------------+
| APPL| 370.0|
| GOOG| 220.0|
| FB| 610.0|
| MSFT|322.3333333333333|
+-------+-----------------+

df.groupBy("Company").sum().show()
+-------+----------+
|Company|sum(Sales)|
+-------+----------+
| APPL| 1480.0|
| GOOG| 660.0|
| FB| 1220.0|
| MSFT| 967.0|
+-------+----------+

df.groupBy("Company").max().show()

+-------+----------+
|Company|max(Sales)|
+-------+----------+
| APPL| 750.0|
| GOOG| 340.0|
| FB| 870.0|
| MSFT| 600.0|
+-------+----------+

df.groupBy("Company").min().show()
+-------+----------+
|Company|min(Sales)|
+-------+----------+
| APPL| 130.0|
| GOOG| 120.0|
| FB| 350.0|
| MSFT| 124.0|
+-------+----------+

df.groupBy("Company").count().show()
+-------+-----+
|Company|count|
+-------+-----+
| APPL| 4|
| GOOG| 3|
| FB| 2|
| MSFT| 3|
+-------+-----+

df.agg({"Sales":"sum"}).show()
+----------+
|sum(Sales)|
+----------+
| 4327.0|
+----------+

df.agg({"Sales":"max"}).show()
+----------+
|max(Sales)|
+----------+
| 870.0|
+----------+

group_data.agg({"Sales":"max"}).show()

+-------+----------+
|Company|max(Sales)|
+-------+----------+
| APPL| 750.0|
| GOOG| 340.0|
| FB| 870.0|
| MSFT| 600.0|
+-------+----------+

from pyspark.sql.functions import (countDistinct,avg,stddev)
df.select(countDistinct("Sales")).show()
+---------------------+
|count(DISTINCT Sales)|
+---------------------+
| 11|
+---------------------+

df.select(avg("Sales")).show()
+-----------------+
| avg(Sales)|
+-----------------+
|360.5833333333333|
+-----------------+

df.select(avg("Sales").alias("Average Sales")).show()
+-----------------+
| Average Sales|
+-----------------+
|360.5833333333333|
+-----------------+

df.select(stddev("Sales")).show()

+------------------+
|stddev_samp(Sales)|
+------------------+
|250.08742410799007|
+------------------+

from pyspark.sql.functions import format_number
sales_std = df.select(stddev("Sales").alias("Std"))
sales_std.show()
+------------------+
| Std|
+------------------+
|250.08742410799007|
+------------------+

sales_std.select(format_number('std',2).alias("Standard Deviation")).show()
+------------------+
|Standard Deviation|
+------------------+
| 250.09|
+------------------+

df.orderBy("Sales").show()
+-------+-------+-----+
|Company| Person|Sales|
+-------+-------+-----+
| GOOG|Charlie|120.0|
| MSFT| Amy|124.0|
| APPL| Linda|130.0|
| GOOG| Sam|200.0|
| MSFT|Vanessa|243.0|
| APPL| John|250.0|
| GOOG| Frank|340.0|
| FB| Sarah|350.0|
| APPL| Chris|350.0|
| MSFT| Tina|600.0|
| APPL| Mike|750.0|
| FB| Carl|870.0|
+-------+-------+-----+

df.orderBy(df["Sales"].desc()).show()
+-------+-------+-----+
|Company| Person|Sales|
+-------+-------+-----+
| FB| Carl|870.0|
| APPL| Mike|750.0|
| MSFT| Tina|600.0|
| FB| Sarah|350.0|
| APPL| Chris|350.0|
| GOOG| Frank|340.0|
| APPL| John|250.0|
| MSFT|Vanessa|243.0|
| GOOG| Sam|200.0|
| APPL| Linda|130.0|
| MSFT| Amy|124.0|
| GOOG|Charlie|120.0|
+-------+-------+-----+

Missing Data:
Keep the missing data points as NULLs
Drop the missing points including entire row
Fill it with some other value

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Missing").getOrCreate()
df = spark.read.csv("ContainsNull.csv",inferSchema=True,header=True)
df.show()

+----+-----+-----+
| Id| Name|Sales|
+----+-----+-----+
|emp1| John| null|
|emp2| null| null|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+

df.na.drop().show() // it will drop any row which has null value single or double null columns will be filtered
+----+-----+-----+
| Id| Name|Sales|
+----+-----+-----+
|emp4|Cindy|456.0|
+----+-----+-----+

df.na.drop(thresh=2).show() // threshold = 2 means if 2 columns have null value that row will be eliminated
+----+-----+-----+
| Id| Name|Sales|
+----+-----+-----+
|emp1| John| null|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+

df.na.drop(how="any").show() // if any single column has null value, that row will be eliminated
+----+-----+-----+
| Id| Name|Sales|
+----+-----+-----+
|emp4|Cindy|456.0|
+----+-----+-----+

df.na.drop(how="all").show()
+----+-----+-----+
| Id| Name|Sales|
+----+-----+-----+
|emp1| John| null|
|emp2| null| null|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+

df.na.drop(subset=["Sales"]).show() // if Sales column has null value that row will be ommitted
+----+-----+-----+
| Id| Name|Sales|
+----+-----+-----+
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+

df.printSchema()
root
|-- Id: string (nullable = true)
|-- Name: string (nullable = true)
|-- Sales: double (nullable = true)

df.na.fill("FILLER").show() // any string column which has null values will be filled with "FILLER"

+----+------+-----+
| Id| Name|Sales|
+----+------+-----+
|emp1| John| null|
|emp2|FILLER| null|
|emp3|FILLER|345.0|
|emp4| Cindy|456.0|
+----+------+-----+

df.na.fill(0).show() // any numeric column which has null values will be filled with "0" Zero.

+----+-----+-----+
| Id| Name|Sales|
+----+-----+-----+
|emp1| John| 0.0|
|emp2| null| 0.0|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+

df.na.fill("No Name",subset=["Name"]).show() // A column named as 'Name' which has null values will be filled with 'No Name'

+----+-------+-----+
| Id| Name|Sales|
+----+-------+-----+
|emp1| John| null|
|emp2|No Name| null|
|emp3|No Name|345.0|
|emp4| Cindy|456.0|
+----+-------+-----+

from pyspark.sql.functions import mean
mean_val = df.select(mean(df["Sales"])).collect()

mean_val
[Row(avg(Sales)=400.5)]

mean_val[0][0]
400.5

mean_sales = mean_val[0][0]
df.na.fill(mean_sales,["Sales"]).show()

+----+-----+-----+
| Id| Name|Sales|
+----+-----+-----+
|emp1| John|400.5|
|emp2| null|400.5|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+

what ever mean result will be filled in the place of NULL

df.na.fill(df.select(mean(df["Sales"])).collect()[0][0],["Sales"]).show()

+----+-----+-----+
| Id| Name|Sales|
+----+-----+-----+
|emp1| John|400.5|
|emp2| null|400.5|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+

---------------------------------------------------------------------
Spark Sql and Hql
[cloudera@quickstart ~]$ sudo find / -name

'hive-site.xml'

[cloudera@quickstart ~]$ sudo chmod -R 777

/usr/lib/spark/conf
[cloudera@quickstart ~]$ cp

/etc/hive/conf.dist/hive-site.xml

/usr/lib/spark/conf
_____________________________________
from hive-site.xml -->

hive.metastore.warehouse.dir

from spark 2.0.0 onwards above opt is

depricated
use following option..

------> spark.sql.warehouse.dir
_____________________________________________

____ [ tested in cloudera 5.8
spark version 1.6.0 ]

[cloudera@quickstart ~]$ls

/usr/lib/hue/apps/beeswax/data/sample_07.csv

[cloudera@quickstart ~]$ head -n 2

/usr/lib/hue/apps/beeswax/data/sample_07.csv
_____________________
val hq = new

org.apache.spark.sql.hive.HiveContext(sc)

hq.sql("create database sparkdb")

hq.sql("CREATE TABLE sample_07 (code

string,description string,total_emp

int,salary int) ROW FORMAT DELIMITED FIELDS

TERMINATED BY '\t' STORED AS TextFile")

[cloudera@quickstart ~]$ hadoop fs -mkdir

sparks

[cloudera@quickstart ~]$ hadoop fs -

copyFromLocal

/usr/lib/hue/apps/beeswax/data/sample_07.csv

sparks
[cloudera@quickstart ~]$ hadoop fs -ls

sparks

hq.sql("LOAD DATA INPATH

'/user/cloudera/sparks/sample_07.csv'

OVERWRITE INTO TABLE sample_07")

val df = hq.sql("SELECT * from sample_07")

__________________________________________
scala> df.filter(df("salary") > 150000).show

()
+-------+--------------------+---------

+------+
| code| description|total_emp|

salary|
+-------+--------------------+---------

+------+
|11-1011| Chief executives| 299160|

151370|
|29-1022|Oral and maxillof...| 5040|

178440|
|29-1023| Orthodontists| 5350|

185340|
|29-1024| Prosthodontists| 380|

169360|
|29-1061| Anesthesiologists| 31030|

192780|
|29-1062|Family and genera...| 113250|

153640|
|29-1063| Internists, general| 46260|

167270|
|29-1064|Obstetricians and...| 21340|

183600|
|29-1067| Surgeons| 50260|

191410|
|29-1069|Physicians and su...| 237400|

155150|
+-------+--------------------+---------

+------+
____________________________________________

val sqlContext = new

org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._

[cloudera@quickstart ~]$ gedit json1
[cloudera@quickstart ~]$ hadoop fs -

copyFromLocal json1 sparks
[cloudera@quickstart ~]$ hadoop fs -cat

sparks/json1
{"name":"Ravi","age":23,"sex":"M"}
{"name":"Rani","age":24,"sex":"F"}
{"name":"Mani","sex":"M"}
{"name":"Vani","age":34}
{"name":"Veni","age":29,"sex":"F"}
[cloudera@quickstart ~]$

scala> val df = sqlContext.read.json

("/user/cloudera/sparks/json1")

scala> df.show()
+----+----+----+
| age|name| sex|
+----+----+----+
| 23|Ravi| M|
| 24|Rani| F|
|null|Mani| M|
| 34|Vani|null|
| 29|Veni| F|
+----+----+----+

scala> df.printSchema()
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
|-- sex: string (nullable = true)

scala> df.select("name").show()
+----+
|name|
+----+
|Ravi|
|Rani|
|Mani|
|Vani|
|Veni|
+----+

scala> df.select("age").show()
+----+
| age|
+----+
| 23|
| 24|
|null|
| 34|
| 29|
+----+

scala>

scala> df.select("name","age").show()
+----+----+
|name| age|
+----+----+
|Ravi| 23|
|Rani| 24|
|Mani|null|
|Vani| 34|
|Veni| 29|
+----+----+

scala> df.select("name","sex").show()
+----+----+
|name| sex|
+----+----+
|Ravi| M|
|Rani| F|
|Mani| M|
|Vani|null|
|Veni| F|
+----+----+

scala>

scala> df.select(df("name"),df

("age")+10).show()
+----+----------+
|name|(age + 10)|
+----+----------+
|Ravi| 33|
|Rani| 34|
|Mani| null|
|Vani| 44|
|Veni| 39|
+----+----------+

scala> df.filter(df("age")<34).show()
+---+----+---+
|age|name|sex|
+---+----+---+
| 23|Ravi| M|
| 24|Rani| F|
| 29|Veni| F|
+---+----+---+

scala> df.filter(df("age")>=5 && df("age")

<30).show()
+---+----+---+
|age|name|sex|
+---+----+---+
| 23|Ravi| M|
| 24|Rani| F|
| 29|Veni| F|
+---+----+---+

scala> df.groupBy("age").count().show()
+----+-----+

| age|count|
+----+-----+
| 34| 1|
|null| 1|
| 23| 1|
| 24| 1|
| 29| 1|
+----+-----+

scala> df.groupBy("sex").count().show()
+----+-----+

| sex|count|
+----+-----+
| F| 2|
| M| 2|
|null| 1|
+----+-----+

scala>
scala> df.registerTempTable("df")

scala> sqlContext.sql("select * from

df").collect.foreach(println)

[23,Ravi,M]
[24,Rani,F]
[null,Mani,M]
[34,Vani,null]
[29,Veni,F]

scala> val mm = sqlContext.sql("select * from

df")
mm: org.apache.spark.sql.DataFrame = [age:

bigint, name: string, sex: string]

scala> mm.registerTempTable("mm")

scala> sqlContext.sql("select * from

mm").collect.foreach(println)
[23,Ravi,M]
[24,Rani,F]
[null,Mani,M]
[34,Vani,null]
[29,Veni,F]

scala> mm.show()
+----+----+----+
| age|name| sex|
+----+----+----+
| 23|Ravi| M|
| 24|Rani| F|
|null|Mani| M|
| 34|Vani|null|
| 29|Veni| F|
+----+----+----+

scala> val x = mm
x: org.apache.spark.sql.DataFrame = [age:

bigint, name: string, sex: string]

scala>

cala> val aggr1 = df.groupBy("sex").agg( max

("age"), min("age"))
aggr1: org.apache.spark.sql.DataFrame = [sex:

string, max(age): bigint, min(age): bigint]

scala> aggr1.collect.foreach(println)
[F,29,24]

[M,23,23]
[null,34,34]

scala> aggr1.show()
+----+--------+--------+

| sex|max(age)|min(age)|
+----+--------+--------+
| F| 29| 24|
| M| 23| 23|
|null| 34| 34|
+----+--------+--------+

scala>

____________________

ex:
[cloudera@quickstart ~]$ cat > emp1
101,aaa,30000,m,11
102,bbbb,40000,f,12
103,cc,60000,m,12
104,dd,80000,f,11
105,cccc,90000,m,12
[cloudera@quickstart ~]$ cat > emp2
201,dk,90000,m,11
202,mm,100000,f,12
203,mmmx,80000,m,12
204,vbvb,70000,f,11
[cloudera@quickstart ~]$ hadoop fs -

copyFromLocal emp1 sparklab
[cloudera@quickstart ~]$ hadoop fs -

copyFromLocal emp2 sparklab
[cloudera@quickstart ~]$

scala> val emp1 = sc.textFile

("/user/cloudera/sparklab/emp1")

scala> val emp2 = sc.textFile

("/user/cloudera/sparklab/emp2")

scala> case class Emp(id:Int, name:String,
| sal:Int, sex:String, dno:Int)

scala> def toEmp(x:String) = {
| val w = x.split(",")
| Emp(w(0).toInt,
| w(1), w(2).toInt,
| w(3), w(4).toInt)
| }
toEmp: (x: String)Emp

scala> val e1 = emp1.map(x => toEmp(x))
e1: org.apache.spark.rdd.RDD[Emp] =

MapPartitionsRDD[43] at map at <console>:37

scala> val e2 = emp2.map(x => toEmp(x))
e2: org.apache.spark.rdd.RDD[Emp] =

MapPartitionsRDD[44] at map at <console>:37

scala>

scala> val df1 = e1.toDF
df1: org.apache.spark.sql.DataFrame = [id:

int, name: string, sal: int, sex: string,

dno: int]

scala> val df2 = e2.toDF
df2: org.apache.spark.sql.DataFrame = [id:

int, name: string, sal: int, sex: string,

dno: int]

scala>

scala> val df = sqlContext.sql("select * from

df1 union all select * from df2")

scala> val res = sqlContext.sql("select sex,

sum(sal) as tot, count(*) as cnt
from df group by sex")

scala>
scala> val wrres = res.map(x => x(0)+","+x

(1)+","+x(2))

scala> wrres.saveAsTextFile

("/user/cloudera/mytemp")

scala> hq.sql("create database park")
scala> hq.sql("use park")
scala> hq.sql("create table urres(sex string,
tot int, cnt int)
row format delimited
fields terminated by ',' ")

scala> hq.sql("load data inpath

'/user/cloudera/mytemp/part-00000' into table

urres ")

scala> val hiveres = hq.sql("select * from

urres")

scala> hiveres.show()
_____________________________________

spark lab1 : Spark Aggregations : map, flatMap, sc.textFile(), reduceByKey(), groupByKey()
spark Lab1:
___________
[cloudera@quickstart ~]$ cat > comment
i love hadoop
i love spark
i love hadoop and spark
[cloudera@quickstart ~]$ hadoop fs -mkdir spark
[cloudera@quickstart ~]$ hadoop fs -copyFromLocal comment spark

Word Count using spark:

scala> val r1 = sc.textFile("hdfs://quickstart.cloudera/user/cloudera/spark/comment")

scala> r1.collect.foreach(println)

scala> val r2 = r1.map(x => x.split(" "))

scala> val r3 = r2.flatMap(x => x)

instead of writing r2 and r3.

scala> val words = r1.flatMap(x =>
| x.split(" ") )

scala> val wpair = words.map( x =>
| (x,1) )

scala> val wc = wpair.reduceByKey((x,y) => x+y)

scala> wc.collect

scala> val wcres = wc.map( x =>
| x._1+","+x._2 )

scala> wcres.saveAsTextFile("hdfs://quickstart.cloudera/user/cloudera/spark/results2")

[cloudera@quickstart ~]$ cat emp
101,aa,20000,m,11
102,bb,30000,f,12
103,cc,40000,m,11
104,ddd,50000,f,12
105,ee,60000,m,12
106,dd,90000,f,11
[cloudera@quickstart ~]$ hadoop fs -copyFromLocal emp spark
[cloudera@quickstart ~]$

scala> val e1 = sc.textFile("/user/cloudera/spark/emp")

scala> val e2 = e1.map(_.split(","))

scala> val epair = e2.map( x=>
| (x(3), x(2).toInt ) )

scala> val res = epair.reduceByKey(_+_)
res: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[18] at reduceByKey at <console>:24

scala> res.collect.foreach(println)
(f,170000)
(m,120000)

scala> val resmax = epair.reduceByKey(
| (x,y) => Math.max(x,y))

scala> val resmin = epair.reduceByKey(Math.min(_,_))

scala> resmax.collect.foreach(println)
(f,90000)
(m,60000)

scala> resmin.collect.foreach(println)
(f,30000)
(m,20000)

scala> val grpd = epair.groupByKey()

scala> val resall = grpd.map(x =>
| (x._1, x._2.sum,x._2.size,x._2.max,x._2.min,x._2.sum/x._2.size) )
scala> resall.collect.foreach(println)

------------------------------------------------------

Spark Lab2
scala> val emp = sc.textFile("hdfs://quickstart.cloudera/user/cloudera/spark/emp")

scala> val earray = emp.map(x=> x.split(","))
earray: org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[2] at map at <console>:14

scala> earray.collect

Array[Array[String]] = Array(Array(101, aa, 20000, m, 11), Array(102, bb, 30000, f, 12), Array(103, cc, 40000, m, 11), Array(104, ddd, 50000, f, 12), Array(105, ee, 60000, m, 12), Array(106, dd, 90000, f, 11))

scala> val epair = earray.map( x =>
| (x(4), x(2).toInt))

scala> val ressum = epair.reduceByKey(_+_)

scala> val resmax = epair.reduceByKey(Math.max(_,_))

scala> val resmin = epair.reduceByKey(Math.min(_,_))

scala> ressum.collect.foreach(println)
(12,140000)
(11,150000)

scala> val grpByDno =
epair.groupByKey()

scala> grpByDno.collect
Array[(String, Iterable[Int])] = Array((12,CompactBuffer(30000, 50000, 60000)), (11,CompactBuffer(20000, 40000, 90000)))

scala> val resall = grpByDno.map(x =>
x._1+"\t"+
x._2.sum+"\t"+
x._2.size+"\t"+
x._2.sum/x._2.size+"\t"+
x._2.max+"\t"+
x._2.min )

12 140000 3 46666 60000 30000
11 150000 3 50000 90000 20000

[cloudera@quickstart ~]$ hadoop fs -cat spark/today1/part-00000
12 140000 3 46666 60000 30000
11 150000 3 50000 90000 20000
[cloudera@quickstart ~]$

____________________________________

aggregations by multiple grouping.

ex: equivalant sql/hql query.

select dno, sex , sum(sal) from emp
group by dno, sex;
---
scala> val DnoSexSalPair = earray.map(
| x => ((x(4),x(3)),x(2).toInt) )
scala> DnoSexSalPair.collect.foreach(println)

((11,m),20000)
((12,f),30000)
((11,m),40000)
((12,f),50000)
((12,m),60000)
((11,f),90000)

scala> val rsum = DnoSexSalPair.reduceByKey(_+_)

scala> rsum.collect.foreach(println)

((11,f),90000)
((12,f),80000)
((12,m),60000)
((11,m),60000)

scala> val rs = rsum.map( x =>
x._1._1+"\t"+x._1._2+"\t"+
x._2 )

scala> rs.collect.foreach(println)

11 f 90000
12 f 80000
12 m 60000
11 m 60000

_______________________________________

grouping by multiple columns, and multiple aggregations.

Assignment:

select dno, sex, sum(sal), max(sal) ,
min(sal), count(*), avg(sal)
from emp group by dno, sex;

val grpDnoSex =
DnoSexSalPair.groupByKey();

val r = grpDnoSex.map( x =>
x._1._1+"\t"+
x._1._2+"\t"+
x._2.sum+"\t"+
x._2.max+"\t"+
x._2.min+"\t"+
x._2.size+"\t"+
x._2.sum/x._2.size )
r.collect.foreach(println)

11 f 90000 90000 90000 1 90000
12 f 80000 50000 30000 2 40000
12 m 60000 60000 60000 1 60000
11 m 60000 40000 20000 2 30000
______________________________________

spark sql with json and xml processing
-----------

Spark Sql
---------------

[ ]
Spark sql is a library,
to process spark data objects,
using sql select statements.

Spark sql follows mysql based sql syntaxes.
==============================================
Spark sql provides,
two types of contexts.
i) sqlContext
ii) HiveContext.

import org.apache.spark.sql.SqlContext

val sqlCon = new SqlContext(sc)

using sqlContext ,
we can process spark objects using select statements.

Using HiveContext,
we can integrate , Hive with Spark.
Hive, is data warehouse environment in hadoop framework,
So total is stored and managed at Hive tables.
using HiveContext we can access entire hive enviroment (hive tables) from Spark.

difference between, hql statement from Hive,
and hql statement from Spark.
--> if hql is executed from Hive Environment,
the statement to process, will be converted
as mAPREDUCE job.
---> if same hive is integrated with spark,
and hql is submitted from spark,
it uses, DAG and Inmemory computing models.
which is more faster than MapReduce.

import org.apache.spark.sql.hive.HiveContext

val hc = new HiveContext(sc)
-----------------------------
Example of sqlContext.

val sqc = new SqlContext(sc)

file name --> file1
sample ---> 100,200,300
300,400,400
:
:
step1)
create case class for the data.

case class Rec(a:Int, b:Int, c:Int)

step2) create a function ,
to convert raw line into case object.
[function to provide schema ]

def makeRec(line:String)={
val w = line.split(",")
val a = w(0).toInt
val b = w(1).toInt
val c = w(2).toInt
val r = Rec(a, b,c)
r
}
--------
step3) load data.
val data = sc.textFile("/user/cloudera/sparklab/file1")

100,200,300
2000,340,456
:
:

step4) transform each record into case Object

val recs = data.map(x => makeRec(x))

step5) convert rdd into data frme.

val df = recs.toDF

step6) create table instance for the dataframe.

df.registerTempTable("samp")

step7) apply select statement of sql on temp table.

val r1 = sqc.sql("select a+b+c as tot from samp")

r1
------
tot
----
600
900

r1.registerTempTable(samp1)

val r2 = sqc.sql("select sum(tot) as gtot from samp1")

once "select" statement is applied on
temp table, returned object will be dataframe.

to apply sql on processed results,
again we need to register the dataframe
as temp table.

r1.registerAsTempTable("Samp2")

val r2 = sqc.sql("select * from samp2
where tot>=200")

-----------------------------------

sales
--------------------
:
12/27/2016,10000,3,10
:
:

-------------------------
Steps involing in Spark Sql.[sqlContext]
----------------------------
monthly sales report...
schema ---> date, price, qnt, discount

step1)
case class Sales(mon : Int, price:Int,
qnt :Int, disc: Int)

step2)
def toSales(line: String) = {
val w = line.split(",")
val mon = w(0).split("/")(0)
val p = w(1).toInt
val q = w(2).toInt
val d = w(3).toInt
val srec = Sales(mon,p,q,d)
srec
}
step3)
val data = sc.textFile("/user/cloudera/mydata/sales.txt")
step4)
val strans = data.map(x => toSales(x))
step5)

val sdf = strans.toDF
sdf.show

step6)

sdf.registerTempTable("SalesTrans")

step7) // play with select
---> mon, price, qnt, disc

val res1 = sqlContext.sql("select mon ,
sum(
(price - price*disc/100)*qnt
) as tsales from SalesTrans
group by mon")

res1.show
res1.printSchema

-----------------------------------------

val res2 = res1

res1.registerTempTable("tab1")
res2.registerTempTable("tab2")

val res3 = sqlContext.sql("select l.mon as m1,
r.mon as m2, l.tsales as t1,
r.tsales as t2
from tab1 l join tab2 r
where (l.mon-r.mon)==1")
// 11 rows.
res3.registerTempTable("tab3")

------------------------------
val res4 = sqlContext.sql("select
m1, m2, t1, t2, ((t2-t1)*100)/t1 as sgrowth
from tab3")

res4.show()
------------------------------------------
json1.json
--------------------------
{"name":"Ravi","age":25,"city":"Hyd"}
{"name":"Rani","sex":"F","city":"Del"}
:
:
---------------------------------------

val df = sqlContext.read.json("/user/cloudera/mydata/json1.json")

df.show
------------------------
name age sex city
----------------------------------------
ravi 25 null hyd
rani null F del
:
:
--------------------------------------

json2.json
------------------
{"name":"Ravi","age":25,
"wife":{"name":"Rani","age":23},"city":"Hyd"}}
:
:

val df2 = sqlContext.read.json("/../json2.json")

df2
-----------------------
name age wife city
Ravi 25 {"name":"rani","age":23} HYd
:
---------------------

df2.registerTempTable("Info")

val df3 = sqlContext.sql("select name,
wife.name as wname,
age, wife.age as wage,
abs(age-wife.age) as diff,
city from Info")
----------------------------------------
xml data processing with spark sql.

---spark sql does not have, direct libraries
for xml processing.
two ways.
i) 3 rd party api [ ex: databricks]
ii) using Hive Integreation.

2nd is best.

How to integrate Hive with spark .
---Using HiveContext.

step1)
copy hive-site.xml file into,
/usr/lib/spark/conf directory.

what if , hive-site.xml is not copied into
conf directory of spark?
--- spark can not understand,
hive's metastore location [derby/mysql/oracle ....]
this info is available with hive-site.xml .

step2)

create hive Context object

import org.apache.spark.sql.hive.HiveContext
val hc = new HiveContext(sc)

step3) access Hive Environment from spark

hc.sql("create database mydb")
hc.sql("use mydb")
hc.sql("create table samp(line string)")
hc.sql("load data local inpath 'file1'
into table samp")
val df = hc.sql("select * from samp")
-------------------------------------
xml1.xml
----------------------------------
<rec><name>Ravi</name><age>25</age></rec>
<rec><name>Rani</name><sex>F</sex></rec>
:
:
------------------------------------------

hc.sql("create table raw(line string)")
hc.sql("load data local inpath 'xml1.xml'
into table raw")
hc.sql("create table info(name string,
age int, sex string)")
hc.sql("insert overwrite table info
select xpath_string(line,'rec/name'),
xpath_int(line, 'rec/age'),
xpath_string(line, 'rec/sex')
from raw")
----------------------------------------
xml2.xml
------------
<rec><name><fname>Ravi</fname><lname>kumar</lname><age>24</age><contact><email><personal>ravi@gmail.com</personal><official>ravi@ventech.com</official></email><phone><mobile>12345</mobile><office>123900</office><residence>127845</residence></phone></contact><city>Hyd</city></rec>

hc.sql("create table xraw(line string)")
hc.sql("load data local inpath 'xml2.xml'
into table xraw")
hc.sql("create table xinfo(fname string ,
lname string, age int,
personal_email string,
official_email string,
mobile String,
office_phone string ,
residence_phone string,
city string)")

hc.sql("insert overwrite table xinfo
select
xpath_string(line,'rec/name/fname'),
xpath_string(line,'rec/name/lname'),
xpath_int(line,'rec/age'),
xpath_string(line,'rec/contact/email/personal'),
xpath_string(line,'rec/contact/email/official'),
xpath_string(line,'rec/contact/phone/mobile'),
xpath_string(line,'rec/contact/phone/office'),
xpath_string(line,'rec/contact/phone/residence'),
xpath_string(line,'rec/city')
from xraw")
-------------------------
xml3.xml
----------------
<tr><cid>101</cid><pr>200</pr><pr>300</pr><pr>300</pr></tr>
<tr><cid>102</cid><pr>400</pr><pr>800</pr></tr>
<tr><cid>101</cid><pr>1000</pr></tr>
--------------------------------

hc.sql("create table sraw")
hc.sql("load data local inpath 'xml3.xml'
into table sraw")
hc.sql("create table raw2(cid int, pr array<String>)")

hc.sql("insert overwrite table raw2
select xpath_int(line, 'tr/cid'),
xpath(line,'tr/pr/text()')
from sraw")
hc.sql("select * from raw2").show
-------------------------------
cid pr
101 [100,300,300]
102 [400,800]
101 [1000]

hc.sql("select explode(pr) as price from raw2").show

100
300
300
400
800
1000

hc.sql("select cid, explode(pr) as price from raw2").show

----> above is invalid.

hc.sql("create table raw3(cid int, pr int)")
hc.sql("Insert overwrite table raw3
select name, mypr from raw2
lateral view explode(pr) p as mypr")

hc.sql("select * from raw3").show

cid pr
101 200
101 300
101 300
102 400
102 800
101 1000

hc.sql("create table summary(cid int, totbill long)")

hc.sql("insert overwrite table summary
select cid , sum(pr) from raw3
group by cid")

--------------------

Spark Grouping Aggregations

demo grouping aggregations on structured data.
----------------------------------------------
[cloudera@quickstart ~]$ ls emp
emp
[cloudera@quickstart ~]$ cat emp
101,aaaa,40000,m,11
102,bbbbbb,50000,f,12
103,cccc,50000,m,12
104,dd,90000,f,13
105,ee,10000,m,12
106,dkd,40000,m,12
107,sdkfj,80000,f,13
108,iiii,50000,m,11
[cloudera@quickstart ~]$ hadoop fs -ls spLab
ls: `spLab': No such file or directory
[cloudera@quickstart ~]$ hadoop fs -mkdir spLab
[cloudera@quickstart ~]$ hadoop fs -

copyFromLocal emp spLab

scala> val data = sc.textFile

("/user/cloudera/spLab/emp")
data: org.apache.spark.rdd.RDD[String] =

/user/cloudera/spLab/emp MapPartitionsRDD[1] at

textFile at <console>:27

scala> data.collect.foreach(println)
101,aaaa,40000,m,11
102,bbbbbb,50000,f,12
103,cccc,50000,m,12
104,dd,90000,f,13
105,ee,10000,m,12
106,dkd,40000,m,12
107,sdkfj,80000,f,13
108,iiii,50000,m,11

scala>

scala> val arr = data.map(_.split(","))
arr: org.apache.spark.rdd.RDD[Array[String]] =

MapPartitionsRDD[2] at map at <console>:29

scala> arr.collect
res1: Array[Array[String]] = Array(Array(101,

aaaa, 40000, m, 11), Array(102, bbbbbb, 50000,

f, 12), Array(103, cccc, 50000, m, 12), Array

(104, dd, 90000, f, 13), Array(105, ee, 10000,

m, 12), Array(106, dkd, 40000, m, 12), Array

(107, sdkfj, 80000, f, 13), Array(108, iiii,

50000, m, 11))

scala>

scala> val pair1 = arr.map(x => (x(3), x

(2).toInt) )
pair1: org.apache.spark.rdd.RDD[(String, Int)] =

MapPartitionsRDD[3] at map at <console>:31

scala> // or

scala> val pair1 = arr.map{ x =>
| val sex = x(3)
| val sal = x(2).toInt
| (sex, sal)
| }
pair1: org.apache.spark.rdd.RDD[(String, Int)] =

MapPartitionsRDD[4] at map at <console>:31

scala>

scala> pair1.collect.foreach(println)
(m,40000)
(f,50000)
(m,50000)
(f,90000)
(m,10000)
(m,40000)
(f,80000)
(m,50000)

scala>

scala> // select sex, sum(sal) from emp group by

sex

scala> val rsum = pair1.reduceByKey((a,b) => a

+b)
rsum: org.apache.spark.rdd.RDD[(String, Int)] =

ShuffledRDD[5] at reduceByKey at <console>:33

scala> // or

scala> val rsum = pair1.reduceByKey(_+_)
rsum: org.apache.spark.rdd.RDD[(String, Int)] =

ShuffledRDD[6] at reduceByKey at <console>:33

scala> rsum.collect
res3: Array[(String, Int)] = Array((f,220000),

(m,190000))

scala>

// select sex, max(sal) from emp group by sex;

scala> val rmax = pair1.reduceByKey(Math.max

(_,_))
rmax: org.apache.spark.rdd.RDD[(String, Int)] =

ShuffledRDD[7] at reduceByKey at <console>:33

scala> rmax.collect
res4: Array[(String, Int)] = Array((f,90000),

(m,50000))

scala>

// select sex, min(sal) from emp group by sex;

scala> val rmin = pair1.reduceByKey(Math.min

(_,_))
rmin: org.apache.spark.rdd.RDD[(String, Int)] =

ShuffledRDD[8] at reduceByKey at <console>:33

scala> rmin.collect
res5: Array[(String, Int)] = Array((f,50000),

(m,10000))

scala>

// select sex, count(*) from emp
group by sex

scala> pair1.collect
res6: Array[(String, Int)] = Array((m,40000),

(f,50000), (m,50000), (f,90000), (m,10000),

(m,40000), (f,80000), (m,50000))

scala> pair1.countByKey
res7: scala.collection.Map[String,Long] = Map(f

-> 3, m -> 5)

scala> val pair2 = pair1.map(x => (x._1 , 1)

)
pair2: org.apache.spark.rdd.RDD[(String, Int)] =

MapPartitionsRDD[11] at map at <console>:33

scala> pair2.collect
res8: Array[(String, Int)] = Array((m,1), (f,1),

(m,1), (f,1), (m,1), (m,1), (f,1), (m,1))

scala> val rcnt = pair2.reduceByKey(_+_)
rcnt: org.apache.spark.rdd.RDD[(String, Int)] =

ShuffledRDD[12] at reduceByKey at <console>:35

scala> rcnt.collect
res9: Array[(String, Int)] = Array((f,3), (m,5))

scala>

// select sex, avg(sal) from emp group by sex;
scala> rsum.collect.foreach(println)
(f,220000)
(m,190000)

scala> rcnt.collect.foreach(println)
(f,3)
(m,5)

scala> val j = rsum.join(rcnt)
j: org.apache.spark.rdd.RDD[(String, (Int,

Int))] = MapPartitionsRDD[15] at join at

<console>:39

scala> j.collect
res12: Array[(String, (Int, Int))] = Array((f,

(220000,3)), (m,(190000,5)))

scala>

scala> j.collect
res13: Array[(String, (Int, Int))] = Array((f,

(220000,3)), (m,(190000,5)))
scala> val ravg = j.map{ x =>
| val sex = x._1
| val v = x._2
| val tot = v._1
| val cnt = v._2
| val avg = tot/cnt
| (sex, avg.toInt)
| }
ravg: org.apache.spark.rdd.RDD[(String, Int)] =

MapPartitionsRDD[17] at map at <console>:41

scala> ravg.collect
res15: Array[(String, Int)] = Array((f,73333),

(m,38000))

scala>

// select dno, range(sal) from emp
group by dno;

--> range is a difference between max and min.

scala> val pair3 = arr.map(x => ( x(4), x

(2).toInt ) )
pair3: org.apache.spark.rdd.RDD[(String, Int)] =

MapPartitionsRDD[18] at map at <console>:31

scala> pair3.collect.foreach(println)
(11,40000)
(12,50000)
(12,50000)
(13,90000)
(12,10000)
(12,40000)
(13,80000)
(11,50000)

scala>
scala> val dmax = pair3.reduceByKey(Math.max

(_,_))
dmax: org.apache.spark.rdd.RDD[(String, Int)] =

ShuffledRDD[19] at reduceByKey at <console>:33

scala> val dmin = pair3.reduceByKey(Math.min

(_,_))
dmin: org.apache.spark.rdd.RDD[(String, Int)] =

ShuffledRDD[20] at reduceByKey at <console>:33

scala> val dj = dmax.join(dmin)
dj: org.apache.spark.rdd.RDD[(String, (Int,

Int))] = MapPartitionsRDD[23] at join at

<console>:37

scala> val drange = dj.map{ x =>
| val dno = x._1
| val max = x._2._1
| val min = x._2._2
| val r = max-min
| (dno, r)
| }
drange: org.apache.spark.rdd.RDD[(String, Int)]

= MapPartitionsRDD[25] at map at <console>:39

scala> drange.collect.foreach(println)
(12,40000)
(13,10000)
(11,10000)

scala>

-------------------------------------

scala> // multiple aggregations.

scala> pair1.collect
res18: Array[(String, Int)] = Array((m,40000),

(f,50000), (m,50000), (f,90000), (m,10000),

(m,40000), (f,80000), (m,50000))

scala> val grp = pair1.groupByKey()
grp: org.apache.spark.rdd.RDD[(String, Iterable

[Int])] = ShuffledRDD[26] at groupByKey at

<console>:33

scala> grp.collect
res19: Array[(String, Iterable[Int])] = Array

((f,CompactBuffer(50000, 90000, 80000)),

(m,CompactBuffer(40000, 50000, 10000, 40000,

50000)))

scala> val r1 = grp.map(x => (x._1 , x._2.sum )

)
r1: org.apache.spark.rdd.RDD[(String, Int)] =

MapPartitionsRDD[27] at map at <console>:35

scala> r1.collect.foreach(println)
(f,220000)
(m,190000)

scala>

// select sex, sum(sal), count(*) ,
avg(sal) , max(sal), min(sal),
max(sal)-min(sal) as range
from emp group by sex;

scala> val rall = grp.map{ x =>
| val sex = x._1
| val cb = x._2
| val tot = cb.sum
| val cnt = cb.size
| val avg = (tot/cnt).toInt
| val max = cb.max
| val min = cb.min
| val r = max-min
| (sex,tot,cnt,avg,max,min,r)
| }
rall: org.apache.spark.rdd.RDD[(String, Int,

Int, Int, Int, Int, Int)] = MapPartitionsRDD[28]

at map at <console>:35

scala> rall.collect.foreach(println)
(f,220000,3,73333,90000,50000,40000)
(m,190000,5,38000,50000,10000,40000)

--------------------------------------------------

Spark : Performing grouping Aggregations based on Multiple Keys and saving results

// performing aggregations grouping by

multiple columns;

sql:
select dno, sex, sum(sal) from emp
group by dno, sex;

scala> val data = sc.textFile

("/user/cloudera/spLab/emp")
data: org.apache.spark.rdd.RDD[String] =

/user/cloudera/spLab/emp MapPartitionsRDD[1] at

textFile at <console>:27

scala> data.collect.foreach(println)
101,aaaa,40000,m,11
102,bbbbbb,50000,f,12
103,cccc,50000,m,12
104,dd,90000,f,13
105,ee,10000,m,12
106,dkd,40000,m,12
107,sdkfj,80000,f,13
108,iiii,50000,m,11

scala> val arr = data.map(_.split(","))
arr: org.apache.spark.rdd.RDD[Array[String]] =

MapPartitionsRDD[2] at map at <console>:29

scala> arr.collect
res1: Array[Array[String]] = Array(Array(101,

aaaa, 40000, m, 11), Array(102, bbbbbb, 50000,

f, 12), Array(103, cccc, 50000, m, 12), Array

(104, dd, 90000, f, 13), Array(105, ee, 10000,

m, 12), Array(106, dkd, 40000, m, 12), Array

(107, sdkfj, 80000, f, 13), Array(108, iiii,

50000, m, 11))

scala>
scala> val pair = arr.map(x => ( (x(4),x(3)) ,

x(2).toInt) )
pair: org.apache.spark.rdd.RDD[((String,

String), Int)] = MapPartitionsRDD[3] at map at

<console>:31

scala> pair.collect.foreach(println)
((11,m),40000)
((12,f),50000)
((12,m),50000)
((13,f),90000)
((12,m),10000)
((12,m),40000)
((13,f),80000)
((11,m),50000)

scala>
//or

val pair = data.map{ x =>
val w = x.split(",")
val dno = w(4)
val sex = w(3)
val sal = w(2).toInt
val mykey = (dno,sex)
val p = (mykey , sal)
p
}

scala> val res = pair.reduceByKey(_+_)

scala> res.collect.foreach(println)
((12,f),50000)
((13,f),170000)
((12,m),100000)
((11,m),90000)

scala> val r = res.map(x =>

(x._1._1,x._1._2,x._2) )

scala> r.collect.foreach(println)
(12,f,50000)
(13,f,170000)
(12,m,100000)
(11,m,90000)

-------------------------------------
spark reduceByKey() allows only single key for

grouping.
when you want grouping by multiple columns,
make multiple columns as a tuple,
keep the tuple as key in the pair.

---------------------------------------
sql:--> multi grouping and multi aggregations.

select dno, sex, sum(sal), count(*),
avg(sal) , max(sal), min(sal) from emp
group by dno, sex;

scala> val grp = pair.groupByKey()
grp: org.apache.spark.rdd.RDD[((String, String),

Iterable[Int])] = ShuffledRDD[7] at groupByKey

at <console>:31

scala> grp.collect.foreach(println)
((12,f),CompactBuffer(50000))
((13,f),CompactBuffer(90000, 80000))
((12,m),CompactBuffer(50000, 10000, 40000))
((11,m),CompactBuffer(40000, 50000))

scala> val agr = grp.map{ x =>
val dno = x._1._1
val sex = x._1._2
val cb = x._2
val tot = cb.sum
val cnt = cb.size
val avg = (tot/cnt).toInt
val max = cb.max
val min = cb.min
val r = (dno,sex,tot,cnt,avg,max,min)
r
}
agr: org.apache.spark.rdd.RDD[(String, String,

Int, Int, Int, Int, Int)] = MapPartitionsRDD[8]

at map at <console>:37

scala>

scala> agr.collect.foreach(println)
(12,f,50000,1,50000,50000,50000)
(13,f,170000,2,85000,90000,80000)
(12,m,100000,3,33333,50000,10000)
(11,m,90000,2,45000,50000,40000)

scala> // to save results into file.

agr.saveAsTextFile("/user/cloudera/spLab/res1")

[cloudera@quickstart ~]$ hadoop fs -ls spLab
Found 2 items
-rw-r--r-- 1 cloudera cloudera 158

2017-05-01 20:17 spLab/emp
drwxr-xr-x - cloudera cloudera 0

2017-05-02 20:29 spLab/res1
[cloudera@quickstart ~]$ hadoop fs -ls

spLab/res1
Found 2 items
-rw-r--r-- 1 cloudera cloudera 0

2017-05-02 20:29 spLab/res1/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 134

2017-05-02 20:29 spLab/res1/part-00000
[cloudera@quickstart ~]$ hadoop fs -cat

spLab/res1/part-00000
(12,f,50000,1,50000,50000,50000)
(13,f,170000,2,85000,90000,80000)
(12,m,100000,3,33333,50000,10000)
(11,m,90000,2,45000,50000,40000)
[cloudera@quickstart ~]$

// here, output is written as tuple shape.
// which is not valid format for hive, rdbms, or

other systems.
// before saving results following

transformation should be done.

val r1 = agr.map{ x=>
x._1+","+x._2+","+x._3+","+
x._4+","+x._5+","+x._6+","+x._7
}

scala> val r1 = agr.map{ x=>
| x._1+","+x._2+","+x._3+","+
| x._4+","+x._5+","+x._6+","+x._7
| }
r1: org.apache.spark.rdd.RDD[String] =

MapPartitionsRDD[5] at map at <console>:35

scala> r1.collect.foreach(println)
12,f,50000,1,50000,50000,50000
13,f,170000,2,85000,90000,80000
12,m,100000,3,33333,50000,10000
11,m,90000,2,45000,50000,40000

scala>
// or

scala> val r2 = agr.map{ x =>
| val dno = x._1
| val sex = x._2
| val tot = x._3
| val cnt = x._4
| val avg = x._5
| val max = x._6
| val min = x._7
| Array(dno,sex,tot.toString,cnt.toString,
| avg.toString, max.toString,

min.toString).mkString("\t")
| }
r2: org.apache.spark.rdd.RDD[String] =

MapPartitionsRDD[6] at map at <console>:35

scala> r2.collect.foreach(println)
12 f 50000 1 50000 50000

50000
13 f 170000 2 85000 90000

80000
12 m 100000 3 33333 50000

10000
11 m 90000 2 45000 50000

40000

scala>

[cloudera@quickstart ~]$ hadoop fs -ls

spLab/res2
Found 2 items
-rw-r--r-- 1 cloudera cloudera 0

2017-05-02 20:44 spLab/res2/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 126

2017-05-02 20:44 spLab/res2/part-00000
[cloudera@quickstart ~]$ hadoop fs -cat

spLab/res2/part-00000
12 f 50000 1 50000 50000

50000
13 f 170000 2 85000 90000

80000
12 m 100000 3 33333 50000

10000
11 m 90000 2 45000 50000

40000
[cloudera@quickstart ~]$

-- this results , can be directly exported into

rdbms.

[cloudera@quickstart ~]$ mysql -u root
-pcloudera

mysql> create database spres;
Query OK, 1 row affected (0.03 sec)

mysql> use spres;
Database changed
mysql> create table summary(dno int, sex char

(1),
-> tot int , cnt int, avg int, max int,

min int);
Query OK, 0 rows affected (0.10 sec)

mysql> select * from summary;
Empty set (0.00 sec)

mysql>

[cloudera@quickstart ~]$ sqoop export --connect

jdbc:mysql://localhost/spres --username root --

password cloudera --table summary --export-dir

'/user/cloudera/spLab/res2/part-00000' --input-

fields-terminated-by '\t'

to use spark written results by hive.

hive> create table info(dno int, sex string,
tot int, cnt int, avg int, max int,
min int)
row format delimited
fields terminated by '\t';
hive> load data

'/user/cloudera/spLab/res2/part-00000' into

table info;

mysql> select * from summary;
+------+------+--------+------+-------+-------

+-------+
| dno | sex | tot | cnt | avg | max |

min |
+------+------+--------+------+-------+-------

+-------+
| 12 | m | 100000 | 3 | 33333 | 50000 |

10000 |
| 11 | m | 90000 | 2 | 45000 | 50000 |

40000 |
| 12 | f | 50000 | 1 | 50000 | 50000 |

50000 |
| 13 | f | 170000 | 2 | 85000 | 90000 |

80000 |
+------+------+--------+------+-------+-------

+-------+
4 rows in set (0.03 sec)

Spark : Entire Column Aggregations

Entire Column Aggregations:

sql:
select sum(sal) from emp;

scala> val emp = sc.textFile("/user/cloudera/spLab/emp")
emp: org.apache.spark.rdd.RDD[String] = /user/cloudera/spLab/emp MapPartitionsRDD[1] at textFile at <console>:27

scala> val sals = emp.map(x => x.split(",")(2).toInt)
sals: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[2] at map at <console>:29

scala> sals.collect
res0: Array[Int] = Array(40000, 50000, 50000, 90000, 10000, 40000, 80000, 50000)

scala> sals.sum
res1: Double = 410000.0

scala> sals.reduce((a,b) => a + b)
res2: Int = 410000

scala>

---> reduce will be computed cluster. n
---> sum will collect all partitions data into client and computation happens at local.

sql:
select sum(sal), count(*), avg(sal),
max(sal) , min(sal) from emp;
scala> val tot = sals.sum
tot: Double = 410000.0

scala> val cnt = sals.count
cnt: Long = 8

scala> val avg = sals.mean
avg: Double = 51250.0

scala> val max = sals.max
max: Int = 90000

scala> val min = sals.min
min: Int = 10000

scala> val m = sals.reduce(Math.max(_,_))
m: Int = 90000

scala>

scala> val res = (tot,cnt,avg,max,min)
res: (Double, Long, Double, Int, Int) = (410000.0,8,51250.0,90000,10000)

scala> tot
res3: Double = 410000.0

scala>
----------------------------------------

Spark : Handling CSV files .. Removing Headers
scala> val l = List(10,20,30,40,50,56,67)

scala> val r2 = r.collect.reverse.take(3)
r2: Array[Int] = Array(67, 56, 50)

scala> val r2 = sc.parallelize(r.collect.reverse.take(3))
r2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:31

-------------------------------
hadling CSV files [ first is header ]

[cloudera@quickstart ~]$ gedit prods
[cloudera@quickstart ~]$ hadoop fs -copyFromLocal prods spLab

scala> val raw = sc.textFile("/user/cloudera/spLab/prods")
raw: org.apache.spark.rdd.RDD[String] = /user/cloudera/spLab/prods MapPartitionsRDD[11] at textFile at <console>:27

scala> raw.collect.foreach(println)
"pid","name","price"
p1,Tv,50000
p2,Lap,70000
p3,Ipod,8000
p4,Mobile,9000

scala> raw.count
res18: Long = 5

to eleminate first element, slice is used .

scala> l
res19: List[Int] = List(10, 20, 30, 40, 50, 50, 56, 67)

scala> l.slice(2,5)
res20: List[Int] = List(30, 40, 50)

scala> l.slice(1,l.size)
res21: List[Int] = List(20, 30, 40, 50, 50, 56, 67)

way1:

scala> raw.collect
res29: Array[String] = Array("pid","name","price", p1,Tv,50000, p2,Lap,70000, p3,Ipod,8000, p4,Mobile,9000)

scala> val data = sc.parallelize(raw.collect.slice(1,raw.collect.size))
data: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[12] at parallelize at <console>:29

scala> data.collect.foreach(println)
p1,Tv,50000
p2,Lap,70000
p3,Ipod,8000
p4,Mobile,9000

scala>

here slice is not available with rdd.
so , data to be collected into local , then slice has to applied.
if rdd volume is bigger, client can not collect it. flow will be failed.

Way2:
------

val data = raw.filter(x =>
!line.contains("pid"))

data.persist
--adv: no need to collect data into client[local]

--disadv : to eleminate 1 row, scanning all rows.

-----------------------------------------

Spark : Conditional Transformations

Conditions Transformations:

val trans = emp.map{ x =>
val w = x.split(",");
val sal = w(2).toInt

val grade = if(sal>=70000) "A" else
if(sal>=50000) "B" else
if(sal>=30000) "C" else "D"
val tax = sal*10/100

val dno = w(4).toInt
val dname = dno match{
case 11 => "Marketing"
case 12 => "Hr"
case 13 => "Finance"
case _ => "Other"
}
var sex = w(3).toLowerCase
sex = if(sex=="m") "Male" else "Female"

val res = Array(w(0), w(1),
w(2),tax.toString, grade, sex, dname).mkString(",")
res
}

trans.saveAsTextFile("/user/cloudera/spLab/results4")
-----------------------------------------

Spark : Union and Distinct

Unions in spark.
val l1 = List(10,20,30,40,50)
val l2 = List(100,200,300,400,500)
val r1 = sc.parallelize(l1)
val r2 = sc.parallelize(l2)
val r = r1.union(r2)
scala> r.collect.foreach(println)
[Stage 0:> (0 + 0 10
20
30
40
50
100
200
300
400
500
scala> r.count
res1: Long = 10

spark union allows duplicates.
Using ++ operatory, merging can be done.
scala> val r3 = r1 ++ r2
r3: org.apache.spark.rdd.RDD[Int] = UnionRDD[3] at $plus$plus at <console>:35
scala> r3.collect
res4: Array[Int] = Array(10, 20, 30, 40, 50, 100, 200, 300, 400, 500)
scala>
meging more than two sets.
^
scala> val rr = r1.union(r2).union(rx)
rr: org.apache.spark.rdd.RDD[Int] = UnionRDD[6] at union at <console>:37
scala> rr.count
res5: Long = 13
scala> rr.collect
res6: Array[Int] = Array(10, 20, 30, 40, 50, 100, 200, 300, 400, 500, 15, 25, 35)
scala>// or
scala> val rr = r1 ++ r2 ++ rx
rr: org.apache.spark.rdd.RDD[Int] = UnionRDD[8] at $plus$plus at <console>:37
scala> rr.collect
res7: Array[Int] = Array(10, 20, 30, 40, 50, 100, 200, 300, 400, 500, 15, 25, 35)
scala>
--- eleminate duplicates.
scala> val x = List(10,20,30,40,10,10,20)
x: List[Int] = List(10, 20, 30, 40, 10, 10, 20)
scala> x.distinct
res8: List[Int] = List(10, 20, 30, 40)
scala> val y = sc.parallelize(x)
y: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:29
scala> r1.collect
res14: Array[Int] = Array(10, 20, 30, 40, 50)
scala> y.collect
res15: Array[Int] = Array(10, 20, 30, 40, 10, 10, 20)
scala> val nodupes = (r1 ++ y).distinct
nodupes: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[13] at distinct at <console>:35
scala> nodupes.collect
[Stage 10:> (0 + 0 res16: Array[Int] = Array(30, 50, 40, 20, 10)
scala>
---------------------------------------
[cloudera@quickstart ~]$ hadoop fs -cat spLab/emp
101,aaaa,40000,m,11
102,bbbbbb,50000,f,12
103,cccc,50000,m,12
104,dd,90000,f,13
105,ee,10000,m,12
106,dkd,40000,m,12
107,sdkfj,80000,f,13
108,iiii,50000,m,11
[cloudera@quickstart ~]$ hadoop fs -cat spLab/emp2
201,Ravi,80000,m,12
202,Varun,90000,m,11
203,Varuna,100000,f,13
204,Vanila,50000,f,12
205,Mani,30000,m,14
206,Manisha,30000,f,14
[cloudera@quickstart ~]$
scala> val branch1 = sc.textFile("/user/cloudera/spLab/emp")
branch1: org.apache.spark.rdd.RDD[String] = /user/cloudera/spLab/emp MapPartitionsRDD[15] at textFile at <console>:27
scala> val branch2 = sc.textFile("/user/cloudera/spLab/emp2")
branch2: org.apache.spark.rdd.RDD[String] = /user/cloudera/spLab/emp2 MapPartitionsRDD[17] at textFile at <console>:27
scala> val emp = branch1.union(branch2)
emp: org.apache.spark.rdd.RDD[String] = UnionRDD[18] at union at <console>:31
scala> emp.collect.foreach(println)
scala> emp.collect.foreach(println)
101,aaaa,40000,m,11
102,bbbbbb,50000,f,12
103,cccc,50000,m,12
104,dd,90000,f,13
105,ee,10000,m,12
106,dkd,40000,m,12
107,sdkfj,80000,f,13
108,iiii,50000,m,11
201,Ravi,80000,m,12
202,Varun,90000,m,11
203,Varuna,100000,f,13
204,Vanila,50000,f,12
205,Mani,30000,m,14
206,Manisha,30000,f,14
--------------------------------
distinct:
to eleminated duplicates
based on entire row match.
limitations: can not eleminated based on some column(s) match.
for this solution:
by iterating compactBuffer.
[ later we will see ]
grouping aggregation on merged set.
scala> val pair = emp.map{ x =>
| val w = x.split(",")
| val dno = w(4).toInt
| val sal = w(2).toInt
| (dno, sal)
| }
pair: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[19] at map at <console>:35
scala> val eres = pair.reduceByKey(_+_)
eres: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[20] at reduceByKey at <console>:37
scala> eres.collect.foreach(println)
(14,60000)
(12,280000)
(13,270000)
(11,180000)
scala>
-- in this output we dont have seperate total for branch1 and branch2.

Spark : CoGroup And Handling Empty Compact Buffers

Co Grouping using Spark:-
-------------------------
scala> branch1.collect.foreach(println)
101,aaaa,40000,m,11
102,bbbbbb,50000,f,12
103,cccc,50000,m,12
104,dd,90000,f,13
105,ee,10000,m,12
106,dkd,40000,m,12
107,sdkfj,80000,f,13
108,iiii,50000,m,11
scala> branch2.collect.foreach(println)
201,Ravi,80000,m,12
202,Varun,90000,m,11
203,Varuna,100000,f,13
204,Vanila,50000,f,12
205,Mani,30000,m,14
206,Manisha,30000,f,14
scala> def toDnoSalPair(line:String) = {
val w = line.split(",")
val dno = w(4).toInt
val dname = dno match{
case 11 => "Marketing"
case 12 => "Hr"
case 13 => "Finance"
case _ => "Other"
}
val sal = w(2).toInt
(dname, sal)
}
toDnoSalPair: (line: String)(String, Int)
scala> toDnoSalPair("101,aaaaa,60000,m,12")
res22: (String, Int) = (Hr,60000)
scala>
scala> val pair1 = branch1.map(x => toDnoSalPair(x))
pair1: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[21] at map at <console>:33
scala> val pair2 = branch2.map(x => toDnoSalPair(x))
pair2: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[22] at map at <console>:33
scala> pair1.collect.foreach(println)
(Marketing,40000)
(Hr,50000)
(Hr,50000)
(Finance,90000)
(Hr,10000)
(Hr,40000)
(Finance,80000)
(Marketing,50000)
scala> pair2.collect.foreach(println)
(Hr,80000)
(Marketing,90000)
(Finance,100000)
(Hr,50000)
(Other,30000)
(Other,30000)
scala>
scala> val cg = pair1.cogroup(pair2)
cg: org.apache.spark.rdd.RDD[(String, (Iterable[Int], Iterable[Int]))] = MapPartitionsRDD[24] at cogroup at <console>:39
scala> cg.collect.foreach(println)
(Hr,(CompactBuffer(50000, 50000, 10000, 40000),CompactBuffer(80000, 50000)))
(Other,(CompactBuffer(),CompactBuffer(30000, 30000)))
(Marketing,(CompactBuffer(40000, 50000),CompactBuffer(90000)))
(Finance,(CompactBuffer(90000, 80000),CompactBuffer(100000)))
scala>
scala> val res = cg.map{ x =>
val dname = x._1
val cb1 = x._2._1
val cb2 = x._2._2
val tot1 = cb1.sum
val tot2 = cb2.sum
val tot = tot1+tot2
(dname,tot1,tot2,tot)
}
scala> res.collect.foreach(println)
(Hr,150000,130000,280000)
(Other,0,60000,60000)
(Marketing,90000,90000,180000)
(Finance,170000,100000,270000)
from above , sum of empty compact buffer ,
size of empty compact buffer are zero.
but we get problem with
sum/size and max , min

val res = cg.map{ x =>
val dname = x._1
val cb1 = x._2._1
val cb2 = x._2._2
val max1 = cb1.max
val max2 = cb2.max
(dname,max1,max2)
}
-- res.collect , can not execute.
problem with max on empty compact buffer.
-- same we get for min.
val res = cg.map{ x =>
val dname = x._1
val cb1 = x._2._1
val cb2 = x._2._2
val tot1 = cb1.sum
val tot2 = cb2.sum
val cnt1 = cb1.size
val cnt2 = cb2.size
(dname, (tot1,cnt1), (tot2,cnt2))
}
-- no problem with sum and size on empty compact buffer.
val res = cg.map{ x =>
val dname = x._1
val cb1 = x._2._1
val cb2 = x._2._2
val tot1 = cb1.sum
val tot2 = cb2.sum
val cnt1 = cb1.size
val cnt2 = cb2.size
val avg1 = (tot1/cnt1).toInt
val avg2 = (tot2/cnt2).toInt
(dname, avg1, avg2)

}
res.collect will be failed.
bcoz, for avg in denominator zero is applied.
Solution:
----------
val res = cg.map{ x =>
val dname = x._1
val cb1 = x._2._1
val cb2 = x._2._2
val tot1 = cb1.sum
val tot2 = cb2.sum
val cnt1 = cb1.size
val cnt2 = cb2.size
var max1 = 0
var min1 = 0
var avg1 = 0
if (cnt1!=0){
avg1 = tot1/cnt1
max1 = cb1.max
min1 = cb1.min
}
var max2 = 0
var min2 = 0
var avg2 = 0
if (cnt2!=0){
avg2 = tot2/cnt2
max2 = cb2.max
min2 = cb2.min
}
(dname,(tot1,cnt1,avg1,max1,min1),
(tot2,cnt2,avg2,max2,min2)) }
scala> res.collect.foreach(println)
(Hr,(150000,4,37500,50000,10000),(130000,2,65000,80000,50000))
(Other,(0,0,0,0,0),(60000,2,30000,30000,30000))
(Marketing,(90000,2,45000,50000,40000),(90000,1,90000,90000,90000))
(Finance,(170000,2,85000,90000,80000),(100000,1,100000,100000,100000))
-----------------------------
Cogroup on more than two
scala> val p1 = sc.parallelize(List(("m",10000),("f",30000),("m",50000)))
p1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[30] at parallelize at <console>:27
scala> val p2 = sc.parallelize(List(("m",10000),("f",30000)))
p2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[31] at parallelize at <console>:27
scala> val p3 = sc.parallelize(List(("m",10000),("m",30000)))
p3: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[32] at parallelize at <console>:27
scala> val cg = p1.cogroup(p2,p3)
cg: org.apache.spark.rdd.RDD[(String, (Iterable[Int], Iterable[Int], Iterable[Int]))] = MapPartitionsRDD[34] at cogroup at <console>:33
scala> cg.collect.foreach(println)
(f,(CompactBuffer(30000),CompactBuffer(30000),CompactBuffer()))
(m,(CompactBuffer(10000, 50000),CompactBuffer(10000),CompactBuffer(10000, 30000)))
scala> val r = cg.map{x =>
| val sex = x._1
| val tot1 = x._2._1.sum
| val tot2 = x._2._2.sum
| val tot3 = x._2._3.sum
| (sex, tot1, tot2, tot3)
| }
r: org.apache.spark.rdd.RDD[(String, Int, Int, Int)] = MapPartitionsRDD[35] at map at <console>:37
scala> r.collect.foreach(println)
(f,30000,30000,0)
(m,60000,10000,40000)
scala>

Spark : Joins

[cloudera@quickstart ~]$ hadoop fs -copyFromLocal emp spLab/e
[cloudera@quickstart ~]$ hadoop fs -copyFromLocal dept spLab/d
[cloudera@quickstart ~]$ hadoop fs -cat spLab/e
101,aaaa,40000,m,11
102,bbbbbb,50000,f,12
103,cccc,50000,m,12
104,dd,90000,f,13
105,ee,10000,m,12
106,dkd,40000,m,12
107,sdkfj,80000,f,13
108,iiii,50000,m,11
109,jj,10000,m,14
110,kkk,20000,f,15
111,dddd,30000,m,15
[cloudera@quickstart ~]$ hadoop fs -cat spLab/d
11,marketing,hyd
12,hr,del
13,fin,del
21,admin,hyd
22,production,del
[cloudera@quickstart ~]$
val emp = sc.textFile("/user/cloudera/spLab/e")
val dept = sc.textFile("/user/cloudera/spLab/d")
val epair = emp.map{x =>
val w = x.split(",")
val dno = w(4).toInt
val sal = w(2).toInt
(dno, sal)
}
epair.collect.foreach(println)
(11,40000)
(12,50000)
(12,50000)
(13,90000)
(12,10000)
(12,40000)
(13,80000)
(11,50000)
(14,10000)
val dpair = dept.map{ x =>
val w = x.split(",")
val dno = w(0).toInt
val loc = w(2)
(dno, loc)
}
scala> dpair.collect.foreach(println)
(11,hyd)
(12,del)
(13,del)
(21,hyd)
(22,del)
-- inner join
val ij = epair.join(dpair)
ij.collect.foreach(println)
ij.collect.foreach(println)
(13,(90000,del))
(13,(80000,del))
(11,(40000,hyd))
(11,(50000,hyd))
(12,(50000,del))
(12,(50000,del))
(12,(10000,del))
(12,(40000,del))
-- left outer join
val lj = epair.leftOuterJoin(dpair)
lj.collect.foreach(println)
scala> lj.collect.foreach(println)
(13,(90000,Some(del)))
(13,(80000,Some(del)))
(15,(20000,None))
(15,(30000,None))
(11,(40000,Some(hyd)))
(11,(50000,Some(hyd)))
(14,(10000,None))
(12,(50000,Some(del)))
(12,(50000,Some(del)))
(12,(10000,Some(del)))
(12,(40000,Some(del)))
-- right outer join
val rj = epair.rightOuterJoin(dpair)
rj.collect.foreach(println)
(13,(Some(90000),del))
(13,(Some(80000),del))
(21,(None,hyd))
(22,(None,del))
(11,(Some(40000),hyd))
(11,(Some(50000),hyd))
(12,(Some(50000),del))
(12,(Some(50000),del))
(12,(Some(10000),del))
(12,(Some(40000),del))
-- full outer join
val fj = epair.fullOuterJoin(dpair)
fj.collect.foreach(println)
(13,(Some(90000),Some(del)))
(13,(Some(80000),Some(del)))
(15,(Some(20000),None))
(15,(Some(30000),None))
(21,(None,Some(hyd)))
(22,(None,Some(del)))
(11,(Some(40000),Some(hyd)))
(11,(Some(50000),Some(hyd)))
(14,(Some(10000),None))
(12,(Some(50000),Some(del)))
(12,(Some(50000),Some(del)))
(12,(Some(10000),Some(del)))
(12,(Some(40000),Some(del)))

location based aggregations:
val locSal = fj.map{ x =>
val sal = x._2._1
val loc = x._2._2
val s = if(sal==None) 0 else sal.get
val l = if(loc==None) "NoCity" else loc.get
(l, s)
}
locSal.collect.foreach(println)
(del,90000)
(del,80000)
(NoCity,20000)
(NoCity,30000)
(hyd,0)
(del,0)
(hyd,40000)
(hyd,50000)
(NoCity,10000)
(del,50000)
(del,50000)
(del,10000)
(del,40000)
val locSummary = locSal.reduceByKey(_+_)
locSummary.collect.foreach(println)
scala> locSummary.collect.foreach(println)
(hyd,90000)
(del,320000)
(NoCity,60000)
-----------------
val stats = fj.map{ x =>
val sal = x._2._1
val loc = x._2._2
val stat = if(sal!=None & loc!=None) "Working" else
if(sal==None) "BenchProj" else "BenchTeam"
val s = if(sal==None) 0 else sal.get
(stat, s)
}
stats.collect.foreach(println)
(Working,90000)
(Working,80000)
(BenchTeam,20000)
(BenchTeam,30000)
(BenchProj,0)
(BenchProj,0)
(Working,40000)
(Working,50000)
(BenchTeam,10000)
(Working,50000)
(Working,50000)
(Working,10000)
(Working,40000)
val res = stats.reduceByKey(_+_)
res.collect.foreach(println)
(BenchTeam,60000)
(Working,410000)
(BenchProj,0)

Spark : Joins 2
Denormalizing datasets using Joins
[cloudera@quickstart ~]$ cat > children
c101,p101,Ravi,34
c102,p101,Rani,24
c103,p102,Mani,20
c104,p103,Giri,22
c105,p102,Vani,22
[cloudera@quickstart ~]$ cat > parents
p101,madhu,madhavi,hyd
p102,Sathya,Veni,Del
p103,Varma,Varuna,hyd
[cloudera@quickstart ~]$ hadoop fs -copyFromLocal children spLab
[cloudera@quickstart ~]$ hadoop fs -copyFromLocal parents spLab
[cloudera@quickstart ~]$
val children = sc.textFile("/user/cloudera/spLab/children")
val parents = sc.textFile("/user/cloudera/spLab/parents")
val chPair = children.map{ x =>
val w = x.split(",")
val pid = w(1)
val chInfo =Array(w(0), w(2), w(3)).
mkString(",")
(pid, chInfo)
}
chPair.collect.foreach(println)
(p101,c101,Ravi,34)
(p101,c102,Rani,24)
(p102,c103,Mani,20)
(p103,c104,Giri,22)
(p102,c105,Vani,22)
val PPair = parents.map{ x =>
val w = x.split(",")
val pid = w(0)
val pInfo = Array(w(1),w(2),w(3)).mkString(",")
(pid, pInfo)
}
PPair.collect.foreach(println)

PPair.collect.foreach(println)
(p101,madhu,madhavi,hyd)
(p102,Sathya,Veni,Del)
(p103,Varma,Varuna,hyd)
val family = chPair.join(PPair)
family.collect.foreach(println)
(p101,(c101,Ravi,34,madhu,madhavi,hyd))
(p101,(c102,Rani,24,madhu,madhavi,hyd))
(p102,(c103,Mani,20,Sathya,Veni,Del))
(p102,(c105,Vani,22,Sathya,Veni,Del))
(p103,(c104,Giri,22,Varma,Varuna,hyd))
val profiles = family.map{ x =>
val cinfo = x._2._1
val pinfo = x._2._2
val info = cinfo +","+ pinfo
info
}
profiles.collect.foreach(println)
c101,Ravi,34,madhu,madhavi,hyd
c102,Rani,24,madhu,madhavi,hyd
c103,Mani,20,Sathya,Veni,Del
c105,Vani,22,Sathya,Veni,Del
c104,Giri,22,Varma,Varuna,hyd

profiles.saveAsTextFile("/user/cloudera/spLab/profiles")
[cloudera@quickstart ~]$ hadoop fs -ls spLab/profiles
Found 2 items
-rw-r--r-- 1 cloudera cloudera 0 2017-05-08 21:02 spLab/profiles/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 150 2017-05-08 21:02 spLab/profiles/part-00000
[cloudera@quickstart ~]$ hadoop fs -cat spLab/profiles/part-00000
c101,Ravi,34,madhu,madhavi,hyd
c102,Rani,24,madhu,madhavi,hyd
c103,Mani,20,Sathya,Veni,Del
c105,Vani,22,Sathya,Veni,Del
c104,Giri,22,Varma,Varuna,hyd
[cloudera@quickstart ~]$

Spark : Spark streaming and Kafka Integration
steps:

1) start zookeper server
2) Start Kafka brokers [ one or more ]
3) create topic .
4) start console producer [ to write messages into topic ]
5) start console consumer [ to test , whether messages are stremed ]
6) create spark streaming context,
which streams from kafka topic.
7) perform transformations or aggregations
8) output operation : which will direct the results into another kafka topic.
------------------------------------------

following code tested with ,
spark 1.6.0 and kafka 0.10.2.0

kafka and spark streaming

bin/zookeeper-server-start.sh config/zookeeper.properties

bin/kafka-server-start.sh config/server.properties

/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic spark-topic

bin/kafka-topics.sh --list --zookeeper localhost:2181

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic spark-topic

bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic spark-topic --from-beginning

import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
val ssc = new StreamingContext(sc, Seconds(5))
import org.apache.spark.streaming.kafka.KafkaUtils
//1.
val kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181","spark-streaming-consumer-group", Map("spark-topic" -> 5))
val lines = kafkaStream.map(x => x._2.toUpperCase)

val warr = lines.map(x => x.split(" "))
val pair = warr.map(x => (x,1))
val wc = pair.reduceByKey(_+_)

wc.print()
// use below code to write results into kafka topic
ssc.start

------------------------------
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic results1

// writing into kafka topic.

import org.apache.kafka.clients.producer.ProducerConfig
import java.util.HashMap
import org.apache.kafka.clients.producer.KafkaProducer
import org.apache.kafka.clients.producer.ProducerRecord

wc.foreachRDD(rdd =>
rdd.foreachPartition(partition =>

partition.foreach{
case t:(w:String,cnt:Long)=>{
val x = w+"\t"+cnt
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,
"localhost:9092")
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")

println(x)
val producer = new KafkaProducer[String,String](props)
val message=new ProducerRecord[String, String]("results1",null,x)
producer.send(message)
}
}))

-- execute above code before ssc.start.
--------------------------------------------
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic results1 --from-beginning

-------------------
val kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181","spark-streaming-consumer-group", Map("spark-topic" -> 5))

1. --? KafkaUtils.createStream()..
needs 4 arguments.
1st ---> streaming Context
2nd --> zk details.
3rd --- > consumer group id
4th ----> Topics.

spark streaming can read from multiple topics.
topic should be as a key value pair of map object

key ---> topic name
value ---> no.of consumer threads.

to read from multiple topics,
the 4th argument should be as follows.
Map("t1"->2,"t2"->4,"t3"->1)

-------------------------

each given number of consumer threads will applied on each partition of kafka topic.

ex: topic has 3 threads,
consumber threads are 5.
so , total number of threads = 15.

but these 15 theads are not parallely executed.

at shot, 5 threads for one partiton will be parallely consuming data.

to make all (15) parallel.

val numparts = 3
val kstreams = (1 to numparts).map{x =>
val kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181","spark-streaming-consumer- group", Map("spark-topic" -> 5))
}
---------------------------------------------------------------------
scala> val x = sc.parallelize(List(1,2,3,4,5,6));
x: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:27

scala> val times2 = x.map(x=>x*2);
times2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:29

scala> times2.foreach(println);
2
4
6
8
10
12

scala> times2.collect();
res5: Array[Int] = Array(2, 4, 6, 8, 10, 12)

code:
val x = sc.parallelize(List(1,2,3,4,5));
val times2 = x.map(_*2);
times2.collect();

O/P
Array[Int] = Array(2, 4, 6, 8, 10)

val rdd = sc.parallelize(List("Hi All","Welcome to India"));
val fm =rdd.flatMap(_.split(" ")); // word by word split
fm.collect();

output:
Array[String] = Array(Hi, All, Welcome, to, India)

scala> fm.foreach(println);
Hi
All
Welcome
to
India

val rdd = sc.parallelize(List("APPLE","BALL","CAT","DEER","CAN"));
val filtered = rdd.filter(_.contains("C"));

scala> filtered.collect();
res9: Array[String] = Array(CAT, CAN)

scala> filtered.foreach(println);
CAT
CAN

//distinct example
val r1 = sc.makeRDD(List(1,2,3,4,5,3,1));
println("\n distinct output");
r1.distinct().foreach(x => print(x + " "));
4 1 3 5 2

//union example - combined together
val r2 = sc.makeRDD(List(1,4,3,6,7));
r1.union(r2).foreach(x=>print(x+" "));
1 2 3 4 5 3 1 1 4 3 6 7

//common among them
r1.intersection(r2).foreach(x=>print(x + " "));
4 1 3

scala> r1.collect();
res19: Array[Int] = Array(1, 2, 3, 4, 5, 3, 1)

scala> r2.collect();
res20: Array[Int] = Array(1, 4, 3, 6, 7)

scala> r1.subtract(r2).foreach(x=>print(x+" "));
2 5 // leave 7 of r2

Cross join:
-----------
r1.cartesian(r2).foreach(x=>print(x+" "));
(1,1) (1,4) (1,3) (1,6) (1,7) (2,1) (2,4) (2,3) (2,6) (2,7) (3,1) (3,4) (3,3) (3,6) (3,7) (4,1) (4,4) (4,3) (4,6) (4,7) (5,1) (5,4) (5,3) (5,6) (5,7) (3,1) (3,4) (3,3) (3,6) (3,7) (1,1) (1,4) (1,3) (1,6) (1,7)
scala>

count:
--------
val rdd = sc.parallelize(List('A','B','C','D'));
rdd.count();
res24: Long = 4

Sum:
------
scala> val rdd = sc.parallelize(List(1,2,3,4));
scala> rdd.reduce(_+_)
res27: Int = 10

scala> val rdd = sc.parallelize(List("arun-1","kalai-2","siva-3","nalan-4","aruvi-5"));
rdd.first();
res33: String = arun-1

scala> rdd.take(3);
res34: Array[String] = Array(arun-1, kalai-2, siva-3)

scala> rdd.foreach(println);
arun-1
kalai-2
siva-3
nalan-4
aruvi-5

C:\Users\Anbu\Google Drive\Bigdata_Weekend_Batch_July_2018\Assignments\spark\Nasa_data_logs\Nasa_Webserver_log.tsv

val r1 = sc.textFile("file:///home/cloudera/Desktop/Nasa_Webserver_log.tsv");
val visitsCount = r1.filter (x => x.contains("countdown.html"));
visitsCount.count();

res0: Long = 8586

val IPAddress = r1.map (line => line.split("\t")).map(parts => parts.take(1))
val logTime = r1.map(l => l.split("\t")).map(p=>p.take(2));
IPAddress.collect()
logTime.collect()

val v1 = sqlContext.read.format("json").load("file:///home/cloudera/Desktop/Files/customer_data.json")
scala> v1.printSchema
root
|-- email: string (nullable = true)
|-- first_name: string (nullable = true)
|-- gender: string (nullable = true)
|-- id: long (nullable = true)
|-- ip_address: string (nullable = true)
|-- last_name: string (nullable = true)

scala> v1.count
res1: Long = 1000

scala> v1.show(2)
+--------------------+----------+------+---+--------------+---------+
| email|first_name|gender| id| ip_address|last_name|
+--------------------+----------+------+---+--------------+---------+
| knattrass0@loc.gov| Kaspar| Male| 1| 244.159.51.76| Nattrass|
|rnulty1@multiply.com| Rosamund|Female| 2|237.123.21.130| Nulty|
+--------------------+----------+------+---+--------------+---------+
only showing top 2 rows

scala> v1.rdd.partitions.size
res3: Int = 1

scala> v1.columns
res4: Array[String] = Array(email, first_name, gender, id, ip_address, last_name)

scala> v1.columns.foreach(println)
email
first_name
gender
id
ip_address
last_name

scala> v1.show()
+--------------------+----------+------+---+---------------+---------+
| email|first_name|gender| id| ip_address|last_name|
+--------------------+----------+------+---+---------------+---------+
| knattrass0@loc.gov| Kaspar| Male| 1| 244.159.51.76| Nattrass|
|rnulty1@multiply.com| Rosamund|Female| 2| 237.123.21.130| Nulty|
|pglasbey2@deviant...| Pia|Female| 3| 80.11.243.170| Glasbey|
|dgowthorpe3@buzzf...| Dante| Male| 4| 197.253.81.98|Gowthorpe|
|wsprowson4@accuwe...| Willamina|Female| 5| 64.125.155.144| Sprowson|
| tbraunds5@ning.com| Trish|Female| 6| 38.111.102.64| Braunds|
|tmasey6@businessw...| Tybie|Female| 7| 44.87.135.133| Masey|
|lpapaccio7@howstu...| Leona|Female| 8| 64.233.173.104| Papaccio|
|hsaltrese8@cbsloc...| Hendrick| Male| 9| 179.21.162.161| Saltrese|
|mkingsnod9@archiv...| Marna|Female| 10| 66.254.243.50| Kingsnod|
| akybirda@mysql.com| Abram| Male| 11|166.179.168.234| Kybird|
| ktuiteb@ucoz.com| Kenneth| Male| 12| 132.182.90.153| Tuite|
|gberingerc@creati...| Gerhardt| Male| 13| 222.102.76.16| Beringer|
| adreweryd@hibu.com| Avictor| Male| 14| 196.191.41.114| Drewery|
| dupexe@myspace.com| Diahann|Female| 15| 226.50.117.72| Upex|
|dcoldbathef@wikip...| Daryl|Female| 16| 7.99.204.200|Coldbathe|
| gkestong@tamu.edu| Galven| Male| 17| 35.16.66.151| Keston|
|dilchenkoh@istock...| Daffi|Female| 18| 192.45.226.104| Ilchenko|
|lwychardi@sfgate.com| Ladonna|Female| 19| 94.194.233.152| Wychard|
| lsapirj@unblog.fr| Latrena|Female| 20|107.141.139.191| Sapir|
+--------------------+----------+------+---+---------------+---------+
only showing top 20 rows

scala> v1.show(false);
+------------------------------+----------+------+---+---------------+---------+
|email |first_name|gender|id |ip_address |last_name|
+------------------------------+----------+------+---+---------------+---------+
|knattrass0@loc.gov |Kaspar |Male |1 |244.159.51.76 |Nattrass |
|rnulty1@multiply.com |Rosamund |Female|2 |237.123.21.130 |Nulty |
|pglasbey2@deviantart.com |Pia |Female|3 |80.11.243.170 |Glasbey |
|dgowthorpe3@buzzfeed.com |Dante |Male |4 |197.253.81.98 |Gowthorpe|
|wsprowson4@accuweather.com |Willamina |Female|5 |64.125.155.144 |Sprowson |
|tbraunds5@ning.com |Trish |Female|6 |38.111.102.64 |Braunds |
|tmasey6@businessweek.com |Tybie |Female|7 |44.87.135.133 |Masey |
|lpapaccio7@howstuffworks.com |Leona |Female|8 |64.233.173.104 |Papaccio |
|hsaltrese8@cbslocal.com |Hendrick |Male |9 |179.21.162.161 |Saltrese |
|mkingsnod9@archive.org |Marna |Female|10 |66.254.243.50 |Kingsnod |
|akybirda@mysql.com |Abram |Male |11 |166.179.168.234|Kybird |
|ktuiteb@ucoz.com |Kenneth |Male |12 |132.182.90.153 |Tuite |
|gberingerc@creativecommons.org|Gerhardt |Male |13 |222.102.76.16 |Beringer |
|adreweryd@hibu.com |Avictor |Male |14 |196.191.41.114 |Drewery |
|dupexe@myspace.com |Diahann |Female|15 |226.50.117.72 |Upex |
|dcoldbathef@wikipedia.org |Daryl |Female|16 |7.99.204.200 |Coldbathe|
|gkestong@tamu.edu |Galven |Male |17 |35.16.66.151 |Keston |
|dilchenkoh@istockphoto.com |Daffi |Female|18 |192.45.226.104 |Ilchenko |
|lwychardi@sfgate.com |Ladonna |Female|19 |94.194.233.152 |Wychard |
|lsapirj@unblog.fr |Latrena |Female|20 |107.141.139.191|Sapir |
+------------------------------+----------+------+---+---------------+---------+
only showing top 20 rows

scala> v1.select("email","first_name").show
+--------------------+----------+
| email|first_name|
+--------------------+----------+
| knattrass0@loc.gov| Kaspar|
|rnulty1@multiply.com| Rosamund|
|pglasbey2@deviant...| Pia|
|dgowthorpe3@buzzf...| Dante|
|wsprowson4@accuwe...| Willamina|
| tbraunds5@ning.com| Trish|
|tmasey6@businessw...| Tybie|
|lpapaccio7@howstu...| Leona|
|hsaltrese8@cbsloc...| Hendrick|
|mkingsnod9@archiv...| Marna|
| akybirda@mysql.com| Abram|
| ktuiteb@ucoz.com| Kenneth|
|gberingerc@creati...| Gerhardt|
| adreweryd@hibu.com| Avictor|
| dupexe@myspace.com| Diahann|
|dcoldbathef@wikip...| Daryl|
| gkestong@tamu.edu| Galven|
|dilchenkoh@istock...| Daffi|
|lwychardi@sfgate.com| Ladonna|
| lsapirj@unblog.fr| Latrena|
+--------------------+----------+
only showing top 20 rows

scala> v1.select("email","first_name").show(2)
+--------------------+----------+
| email|first_name|
+--------------------+----------+
| knattrass0@loc.gov| Kaspar|
|rnulty1@multiply.com| Rosamund|
+--------------------+----------+

scala> v1.write.parquet("file:///home/cloudera/Desktop/Files/myParquet")

scala> val v2 = sqlContext.read.format("parquet").load("file:///home/cloudera/Desktop/Files/myParquet");
v2: org.apache.spark.sql.DataFrame = [email: string, first_name: string, gender: string, id: bigint, ip_address: string, last_name: string]

v2.printSchema
root
|-- email: string (nullable = true)
|-- first_name: string (nullable = true)
|-- gender: string (nullable = true)
|-- id: long (nullable = true)
|-- ip_address: string (nullable = true)
|-- last_name: string (nullable = true)

scala> v2.write.orc("file:///home/cloudera/Desktop/Files/myOrc");

scala> import com.databricks.spark.avro_
<console>:25: error: object avro_ is not a member of package com.databricks.spark
import com.databricks.spark.avro_

Login as Admin
su
password : cloudera
cp /etc/hive/conf/hive-site.xml /etc/spark/conf/hive-site.xml
It will give direct access to Hive within SparkSQL.

Here we create a database and a table and a record in Hive and soon we are going to access the same within SparkSQL
[cloudera@quickstart ~]$ hive

Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> show databases;
OK
default
Time taken: 1.384 seconds, Fetched: 1 row(s)
hive> create database sara;
OK
Time taken: 13.242 seconds
hive> use sara;
OK
Time taken: 0.205 seconds
hive> create table mytable (id int, name string);
OK
Time taken: 0.712 seconds
hive> insert into mytable (id,name) values(101,'Raja');

hive> select * from mytable;
OK
101 Raja

hive> describe mytable;
OK
id int
name string

Here we use SparkSQL to access Hive (hive-site.xml is already copied into spark/conf so, no need to mention anything related to hive. By default Hive will be accessible when we use sqlContext
scala> sqlContext.sql("show databases").show;
+-------+
| result|
+-------+
|default|
| sara|
+-------+

scala> sqlContext.sql("use sara");
res2: org.apache.spark.sql.DataFrame = [result: string]

scala> sqlContext.sql("show tables").show
+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
| mytable| false|
+---------+-----------+

Here we import content from .orc external file into SparkSQL and export it into Hive.

scala> val v3 = sqlContext.read.format("orc").load("file:///home/cloudera/Desktop/Files/myOrc")
v3: org.apache.spark.sql.DataFrame = [email: string, first_name: string, gender: string, id: bigint, ip_address: string, last_name: string]

scala> v3.printSchema
root
|-- email: string (nullable = true)
|-- first_name: string (nullable = true)
|-- gender: string (nullable = true)
|-- id: long (nullable = true)
|-- ip_address: string (nullable = true)
|-- last_name: string (nullable = true)

scala> v3.show(2);
+--------------------+----------+------+---+--------------+---------+
| email|first_name|gender| id| ip_address|last_name|
+--------------------+----------+------+---+--------------+---------+
| knattrass0@loc.gov| Kaspar| Male| 1| 244.159.51.76| Nattrass|
|rnulty1@multiply.com| Rosamund|Female| 2|237.123.21.130| Nulty|
+--------------------+----------+------+---+--------------+---------+
only showing top 2 rows

we export the orc content imported into a temp table which resides in Spark in memory and not in hive disk storage
scala> v3.registerTempTable("customer_temp");

scala> sqlContext.sql("show tables").show
+-------------+-----------+
| tableName|isTemporary|
+-------------+-----------+
|customer_temp| true| //// see here customer_temp is flagged as isTemporary. It wont be available in Hive
| mytable| false|
+-------------+-----------+

hive> show tables;
OK
mytable
// Here hive shows mytable only and not displaying customer_temp because it resides in memory
Logged out from Hive
hive> exit

Logged out from spark
scala>exit

Logon to spark again

$ spark-shell

scala> sqlContext.sql("use sara");

scala> sqlContext.sql("show tables").show
+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
| mytable| false|
+---------+-----------+

customer_temp is not there because once we logged out and login back, the session will end and it will destroy temp tables

val v3 = sqlContext.read.format("orc").load("file:///home/cloudera/Desktop/Files/myOrc");

To export dataframe with data into hive tble (permanent table)
v3.saveAsTable("customer_per");

scala> sqlContext.sql("show tables").show
+------------+-----------+
| tableName|isTemporary|
+------------+-----------+
|customer_per| false|
| mytable| false|
+------------+-----------+

scala> sqlContext.sql("select * from customer_per").show /// long data will be shrinked and use ....
18/08/16 01:29:58 WARN parquet.CorruptStatistics: Ignoring statistics because created_by is null or empty! See PARQUET-251 and PARQUET-297
+--------------------+----------+------+---+---------------+---------+
| email|first_name|gender| id| ip_address|last_name|
+--------------------+----------+------+---+---------------+---------+
| knattrass0@loc.gov| Kaspar| Male| 1| 244.159.51.76| Nattrass|
|rnulty1@multiply.com| Rosamund|Female| 2| 237.123.21.130| Nulty|
|pglasbey2@deviant...| Pia|Female| 3| 80.11.243.170| Glasbey|
|dgowthorpe3@buzzf...| Dante| Male| 4| 197.253.81.98|Gowthorpe|
|wsprowson4@accuwe...| Willamina|Female| 5| 64.125.155.144| Sprowson|
| tbraunds5@ning.com| Trish|Female| 6| 38.111.102.64| Braunds|
|tmasey6@businessw...| Tybie|Female| 7| 44.87.135.133| Masey|
|lpapaccio7@howstu...| Leona|Female| 8| 64.233.173.104| Papaccio|
|hsaltrese8@cbsloc...| Hendrick| Male| 9| 179.21.162.161| Saltrese|
|mkingsnod9@archiv...| Marna|Female| 10| 66.254.243.50| Kingsnod|
| akybirda@mysql.com| Abram| Male| 11|166.179.168.234| Kybird|
| ktuiteb@ucoz.com| Kenneth| Male| 12| 132.182.90.153| Tuite|
|gberingerc@creati...| Gerhardt| Male| 13| 222.102.76.16| Beringer|
| adreweryd@hibu.com| Avictor| Male| 14| 196.191.41.114| Drewery|
| dupexe@myspace.com| Diahann|Female| 15| 226.50.117.72| Upex|
|dcoldbathef@wikip...| Daryl|Female| 16| 7.99.204.200|Coldbathe|
| gkestong@tamu.edu| Galven| Male| 17| 35.16.66.151| Keston|
|dilchenkoh@istock...| Daffi|Female| 18| 192.45.226.104| Ilchenko|
|lwychardi@sfgate.com| Ladonna|Female| 19| 94.194.233.152| Wychard|
| lsapirj@unblog.fr| Latrena|Female| 20|107.141.139.191| Sapir|
+--------------------+----------+------+---+---------------+---------+
only showing top 20 rows

scala> sqlContext.sql("select * from customer_per").show(false); // output will be displayed full long data (see there : no .... here)
+------------------------------+----------+------+---+---------------+----------
|email |first_name|gender|id |ip_address |last_name|
+------------------------------+----------+------+---+---------------+---------+
|knattrass0@loc.gov |Kaspar |Male |1 |244.159.51.76 |Nattrass |
|rnulty1@multiply.com |Rosamund |Female|2 |237.123.21.130 |Nulty |
|pglasbey2@deviantart.com |Pia |Female|3 |80.11.243.170 |Glasbey |
|dgowthorpe3@buzzfeed.com |Dante |Male |4 |197.253.81.98 |Gowthorpe|
|wsprowson4@accuweather.com |Willamina |Female|5 |64.125.155.144 |Sprowson |
|tbraunds5@ning.com |Trish |Female|6 |38.111.102.64 |Braunds |
|tmasey6@businessweek.com |Tybie |Female|7 |44.87.135.133 |Masey |
|lpapaccio7@howstuffworks.com |Leona |Female|8 |64.233.173.104 |Papaccio |
|hsaltrese8@cbslocal.com |Hendrick |Male |9 |179.21.162.161 |Saltrese |
|mkingsnod9@archive.org |Marna |Female|10 |66.254.243.50 |Kingsnod |
|akybirda@mysql.com |Abram |Male |11 |166.179.168.234|Kybird |
|ktuiteb@ucoz.com |Kenneth |Male |12 |132.182.90.153 |Tuite |
|gberingerc@creativecommons.org|Gerhardt |Male |13 |222.102.76.16 |Beringer |
|adreweryd@hibu.com |Avictor |Male |14 |196.191.41.114 |Drewery |
|dupexe@myspace.com |Diahann |Female|15 |226.50.117.72 |Upex |
|dcoldbathef@wikipedia.org |Daryl |Female|16 |7.99.204.200 |Coldbathe|
|gkestong@tamu.edu |Galven |Male |17 |35.16.66.151 |Keston |
|dilchenkoh@istockphoto.com |Daffi |Female|18 |192.45.226.104 |Ilchenko |
|lwychardi@sfgate.com |Ladonna |Female|19 |94.194.233.152 |Wychard |
|lsapirj@unblog.fr |Latrena |Female|20 |107.141.139.191|Sapir |
+------------------------------+----------+------+---+---------------+---------+
only showing top 20 rows

scala> v3.show
+--------------------+----------+------+---+---------------+---------+
| email|first_name|gender| id| ip_address|last_name|
+--------------------+----------+------+---+---------------+---------+
| knattrass0@loc.gov| Kaspar| Male| 1| 244.159.51.76| Nattrass|
|rnulty1@multiply.com| Rosamund|Female| 2| 237.123.21.130| Nulty|
|pglasbey2@deviant...| Pia|Female| 3| 80.11.243.170| Glasbey|
|dgowthorpe3@buzzf...| Dante| Male| 4| 197.253.81.98|Gowthorpe|
|wsprowson4@accuwe...| Willamina|Female| 5| 64.125.155.144| Sprowson|
| tbraunds5@ning.com| Trish|Female| 6| 38.111.102.64| Braunds|
|tmasey6@businessw...| Tybie|Female| 7| 44.87.135.133| Masey|
|lpapaccio7@howstu...| Leona|Female| 8| 64.233.173.104| Papaccio|
|hsaltrese8@cbsloc...| Hendrick| Male| 9| 179.21.162.161| Saltrese|
|mkingsnod9@archiv...| Marna|Female| 10| 66.254.243.50| Kingsnod|
| akybirda@mysql.com| Abram| Male| 11|166.179.168.234| Kybird|
| ktuiteb@ucoz.com| Kenneth| Male| 12| 132.182.90.153| Tuite|
|gberingerc@creati...| Gerhardt| Male| 13| 222.102.76.16| Beringer|
| adreweryd@hibu.com| Avictor| Male| 14| 196.191.41.114| Drewery|
| dupexe@myspace.com| Diahann|Female| 15| 226.50.117.72| Upex|
|dcoldbathef@wikip...| Daryl|Female| 16| 7.99.204.200|Coldbathe|
| gkestong@tamu.edu| Galven| Male| 17| 35.16.66.151| Keston|
|dilchenkoh@istock...| Daffi|Female| 18| 192.45.226.104| Ilchenko|
|lwychardi@sfgate.com| Ladonna|Female| 19| 94.194.233.152| Wychard|
| lsapirj@unblog.fr| Latrena|Female| 20|107.141.139.191| Sapir|
+--------------------+----------+------+---+---------------+---------+
only showing top 20 rows

scala> v3.show(2); // data frame syntax to view data
+--------------------+----------+------+---+--------------+---------+
| email|first_name|gender| id| ip_address|last_name|
+--------------------+----------+------+---+--------------+---------+
| knattrass0@loc.gov| Kaspar| Male| 1| 244.159.51.76| Nattrass|
|rnulty1@multiply.com| Rosamund|Female| 2|237.123.21.130| Nulty|
+--------------------+----------+------+---+--------------+---------+
only showing top 2 rows

scala> sqlContext.sql("select * from customer_per").show(2);
+--------------------+----------+------+---+--------------+---------+
| email|first_name|gender| id| ip_address|last_name|
+--------------------+----------+------+---+--------------+---------+
| knattrass0@loc.gov| Kaspar| Male| 1| 244.159.51.76| Nattrass|
|rnulty1@multiply.com| Rosamund|Female| 2|237.123.21.130| Nulty|
+--------------------+----------+------+---+--------------+---------+
only showing top 2 rows

To know number of record count:
scala> sqlContext.sql("select * from customer_per").count
res18: Long = 1000

scala> sqlContext.sql("select count(*) from customer_per").show
18/08/16 01:36:42 WARN hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
+----+
| _c0|
+----+
|1000|
+----+

// Dataframe syntax to know the number of records count
scala> v3.count
res21: Long = 1000

Resilient Distributed Dataset (RDD) into Data Frame (df)
we need to include
sqlContext.implicits._

scala> import sqlContext.implicits._
import sqlContext.implicits._

create a wordcount.txt file in Files folder of Desktop
Launch terminal
gedit wordcount.txt
I love India
You know that or not
I love my India
You know that or not
I love Singapore
You know that or not
I love Bangalore
Why I am writing these here?
Kaipulla Thoongudha?
save this file in /home/cloudera/Desktop/Files/wordcount.txt

Now load this file into RDD:
scala> val r1 = sc.textFile("file:///home/cloudera/Desktop/Files/wordcount.txt");

scala> r1.collect.foreach(println);
I love India
You know that or not
I love my India
You know that or not
I love Singapore
You know that or not
I love Bangalore
Why I am writing these here?
Kaipulla Thoongudha?

scala> r1.partitions.size
res25: Int = 1

scala> r1.foreach(println);
I love India
You know that or not
I love my India
You know that or not
I love Singapore
You know that or not
I love Bangalore
Why I am writing these here?
Kaipulla Thoongudha?

scala> val r2 = r1.flatMap(l => l.split(" "))
r2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[69] at flatMap at <console>:32

scala> r2.foreach(println);
I
love
India
You
know
that
or
not
I
love
my
India
You
know
that
or
not
I
love
Singapore
You
know
that
or
not
I
love
Bangalore
Why
I
am
writing
these
here?
Kaipulla
Thoongudha?

scala> val r3 = r2.map (x => (x,1));
r3: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[70] at map at <console>:34

scala> r3.foreach(println);
(I,1)
(love,1)
(India,1)
(You,1)
(know,1)
(that,1)
(or,1)
(not,1)
(I,1)
(love,1)
(my,1)
(India,1)
(You,1)
(know,1)
(that,1)
(or,1)
(not,1)
(I,1)
(love,1)
(Singapore,1)
(You,1)
(know,1)
(that,1)
(or,1)
(not,1)
(I,1)
(love,1)
(Bangalore,1)
(Why,1)
(I,1)
(am,1)
(writing,1)
(these,1)
(here?,1)
(Kaipulla,1)
(Thoongudha?,1)

scala> val r4 = r3.reduceByKey((x,y) => x+y)
scala> r4.foreach(println);
(not,3)
(here?,1)
(writing,1)
(that,3)
(am,1)
(or,3)
(You,3)
(Thoongudha?,1)
(love,4)
(Bangalore,1)
(Singapore,1)
(I,5)
(know,3)
(Why,1)
(my,1)
(Kaipulla,1)
(these,1)
(India,2)
/// r1,r2,r3,r4 are RDDs. but r5 is dataframe

toDF : is used to convert RDD into Data frame
-----------------------------------------------
scala> val r5 = r3.reduceByKey((x,y) => x+y).toDF("word","word_count"); // word and word_count are column names specified here
r5: org.apache.spark.sql.DataFrame = [word: string, word_count: int]

scala> r5.show
+-----------+----------+
| word|word_count| // column names are here
+-----------+----------+
| not| 3|
| here?| 1|
| writing| 1|
| that| 3|
| am| 1|
| or| 3|
| You| 3|
|Thoongudha?| 1|
| love| 4|
| Bangalore| 1|
| Singapore| 1|
| I| 5|
| know| 3|
| Why| 1|
| my| 1|
| Kaipulla| 1|
| these| 1|
| India| 2|
+-----------+----------+

scala> val r6 = r3.reduceByKey((x,y) => x+y).toDF; // column names are missing
r6: org.apache.spark.sql.DataFrame = [_1: string, _2: int] // _1 and _2 are column names internally given by spark

scala> r6.show
+-----------+---+
| _1| _2| // _1 and _2 are column names internally given by spark
+-----------+---+
| not| 3|
| here?| 1|
| writing| 1|
| that| 3|
| am| 1|
| or| 3|
| You| 3|
|Thoongudha?| 1|
| love| 4|
| Bangalore| 1|
| Singapore| 1|
| I| 5|
| know| 3|
| Why| 1|
| my| 1|
| Kaipulla| 1|
| these| 1|
| India| 2|
+-----------+---+

scala> r5.columns
res3: Array[String] = Array(word, word_count)

scala> r5.columns.foreach(println);
word
word_count

Load wordcount.txt file and do wordcount using spark with scala
val r1 = sc.textFile("file:///home/cloudera/Desktop/Files/wordcount.txt");
val r2 = r1.flatMap(l => l.split(" "))
val r3 = r2.map (x => (x,1));
val r4 = r3.reduceByKey((x,y) => x+y)
val r5 = r3.reduceByKey((x,y) => x+y).toDF("word","word_count");

scala> sqlContext.sql("use sara");
scala> r5.registerTempTable("wordcount_temp"); // create a temporary table in spark in memory
scala> r5.saveAsTable("wordcount_per"); // create a permanent table in hive disk storage

scala> sqlContext.sql("show tables").show
+--------------+-----------+
| tableName|isTemporary|
+--------------+-----------+
|wordcount_temp| true| // temporary
| customer_per| false| // permanent
| mytable| false|
| wordcount_per| false|
+--------------+-----------+

$ hive

hive> use sara;
OK
Time taken: 0.684 seconds
hive> show tables;
OK
customer_per
mytable
wordcount_per
Time taken: 0.425 seconds, Fetched: 3 row(s) // Here wordcount_temp is not there

Back to Spark...

scala> sqlContext.sql("select * from wordcount_temp").show
+-----------+----------+
| word|word_count|
+-----------+----------+
| not| 3|
| here?| 1|
| writing| 1|
| that| 3|
| am| 1|
| or| 3|
| You| 3|
|Thoongudha?| 1|
| love| 4|
| Bangalore| 1|
| Singapore| 1|
| I| 5|
| know| 3|
| Why| 1|
| my| 1|
| Kaipulla| 1|
| these| 1|
| India| 2|
+-----------+----------+

scala> sqlContext.sql("select * from wordcount_per").count
Long = 18

Load json directly into Data frame:
scala> val df = sqlContext.read.format("json").load("file:///home/cloudera/Desktop/Files/city_json");
df: org.apache.spark.sql.DataFrame = [abbr: string, district: string, id: bigint, name: string, population: bigint]

scala> df.printSchema();
root
|-- abbr: string (nullable = true)
|-- district: string (nullable = true)
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- population: long (nullable = true)

scala> df.select("name","population").show(2);
+--------+----------+
| name|population|
+--------+----------+
| Kabul| 1780000|
|Qandahar| 237500|
+--------+----------+
only showing top 2 rows

scala> df.select ("id","name","district","population","abbr").show(5);
+---+--------------+-------------+----------+----+
| id| name| district|population|abbr|
+---+--------------+-------------+----------+----+
| 1| Kabul| Kabol| 1780000| AFG|
| 2| Qandahar| Qandahar| 237500| AFG|
| 3| Herat| Herat| 186800| AFG|
| 4|Mazar-e-Sharif| Balkh| 127800| AFG|
| 5| Amsterdam|Noord-Holland| 731200| NLD|
+---+--------------+-------------+----------+----+
only showing top 5 rows

scala> df.select($"name".as("country"),($"population" * 0.10).alias("population")).show;
+----------------+------------------+
| country| population|
+----------------+------------------+
| Kabul| 178000.0|
| Qandahar| 23750.0|
| Herat| 18680.0|
| Mazar-e-Sharif| 12780.0|
| Amsterdam| 73120.0|
| Rotterdam|59332.100000000006|
| Haag| 44090.0|
| Utrecht|23432.300000000003|
| Eindhoven|20184.300000000003|
| Tilburg| 19323.8|
| Groningen|17270.100000000002|
| Breda|16039.800000000001|
| Apeldoorn| 15349.1|
| Nijmegen|15246.300000000001|
| Enschede|14954.400000000001|
| Haarlem| 14877.2|
| Almere| 14246.5|
| Arnhem| 13802.0|
| Zaanstad| 13562.1|
|´s-Hertogenbosch| 12917.0|
+----------------+------------------+

scala> df.filter($"population" > 100000).sort($"population".desc).show();
+----+--------------------+---+-----------------+----------+
|abbr| district| id| name|population|
+----+--------------------+---+-----------------+----------+
| BRA| São Paulo|206| São Paulo| 9968485|
| IDN| Jakarta Raya|939| Jakarta| 9604900|
| GBR| England|456| London| 7285000|
| EGY| Kairo|608| Cairo| 6789479|
| BRA| Rio de Janeiro|207| Rio de Janeiro| 5598953|
| CHL| Santiago|554|Santiago de Chile| 4703954|
| BGD| Dhaka|150| Dhaka| 3612850|
| EGY| Aleksandria|609| Alexandria| 3328196|
| AUS| New South Wales|130| Sydney| 3276207|
| ARG| Distrito Federal| 69| Buenos Aires| 2982146|
| ESP| Madrid|653| Madrid| 2879052|
| AUS| Victoria|131| Melbourne| 2865329|
| IDN| East Java|940| Surabaya| 2663820|
| ETH| Addis Abeba|756| Addis Abeba| 2495000|
| IDN| West Java|941| Bandung| 2429000|
| ZAF| Western Cape|712| Cape Town| 2352121|
| BRA| Bahia|208| Salvador| 2302832|
| EGY| Giza|610| Giza| 2221868|
| PHL|National Capital Reg|765| Quezon| 2173831|
| DZA| Alger| 35| Alger| 2168000|
+----+--------------------+---+-----------------+----------+

scala> df.groupBy("abbr").agg(sum("population").as("population")).show()
+----+----------+
|abbr|population|
+----+----------+
| BEL| 1609322|
| GRD| 4621|
| BEN| 968503|
| BRN| 21484|
| GEO| 1880900|
| GRL| 13445|
| BFA| 1229000|
| GLP| 75380|
| HTI| 1517338|
| ECU| 5744142|
| BLZ| 62915|
| DOM| 2438276|
| ARE| 1728336|
| ARG| 19996563|
| HND| 1287000|
| ARM| 1633100|
| GMB| 144926|
| ALB| 270000|
| FRO| 14542|
| BGD| 8569906|
+----+----------+
only showing top 20 rows

scala> df.select("abbr").distinct().sort("abbr").show
+----+
|abbr|
+----+
| ABW|
| AFG|
| AGO|
| AIA|
| ALB|
| AND|
| ANT|
| ARE|
| ARG|
| ARM|
| ASM|
| ATG|
| AUS|
| AZE|
| BDI|
| BEL|
| BEN|
| BFA|
| BGD|
| BGR|
+----+
only showing top 20 rows

---------------------------------------------------------------------
scala> val r1 = List((11,10000),(11,20000),(12,30000),(12,40000),(13,50000))

r1: List[(Int, Int)] = List((11,10000), (11,20000), (12,30000), (12,40000), (13,50000))

scala> val r2 = List((11,"Hyd"),(12,"Del"),(13,"Hyd"))
r2: List[(Int, String)] = List((11,Hyd), (12,Del), (13,Hyd))

scala> val rdd1 = sc.parallelize(r1)
rdd1: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:29

scala> rdd1.collect.foreach(println) // salary info
(11,10000)
(11,20000)
(12,30000)
(12,40000)
(13,50000)

scala> val rdd2 = sc.parallelize(r2)
rdd2: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[1] at parallelize at <console>:29

scala> rdd2.collect.foreach(println) // location info
(11,Hyd)
(12,Del)
(13,Hyd)

scala> val j = rdd1.join(rdd2)
j: org.apache.spark.rdd.RDD[(Int, (Int, String))] = MapPartitionsRDD[4] at join at <console>:35

scala> j.collect.foreach(println)
(13,(50000,Hyd))
(11,(10000,Hyd))
(11,(20000,Hyd))
(12,(30000,Del))
(12,(40000,Del))

scala> var citySalPair = j.map { x =>
| val city = x._2._2
| val sal = x._2._1
| (city,sal)
| }
citySalPair: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[5] at map at <console>:43

scala> citySalPair.collect.foreach(println)
(Hyd,50000)
(Hyd,10000)
(Hyd,20000)
(Del,30000)
(Del,40000)

scala> val res = citySalPair.reduceByKey(_+_)
res: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[6] at reduceByKey at <console>:45

scala> res.collect.foreach(println) // salary grouped by city aggregation
(Del,70000)
(Hyd,80000)

In the above example, we made join against 2 different tuples which have 2 fields only

if a Tuple has more than 2 fields how to do join?

d
scala> val e = List((11,30000,10000),(11,40000,20000),(12,50000,30000),(13,60000,20000),(12,80000,30000))
e: List[(Int, Int, Int)] = List((11,30000,10000), (11,40000,20000), (12,50000,30000), (13,60000,20000), (12,80000,30000))

scala> val ee = sc.parallelize(e)
ee: org.apache.spark.rdd.RDD[(Int, Int, Int)] = ParallelCollectionRDD[7] at parallelize at <console>:29

scala> ee.foreach(println)
(11,30000,10000)
(11,40000,20000)
(12,50000,30000)
(13,60000,20000)
(12,80000,30000)

scala> rdd2.collect.foreach(println)
(11,Hyd)
(12,Del)
(13,Hyd)

scala> val j2 = ee.join(rdd2)
<console>:35: error: value join is not a member of org.apache.spark.rdd.RDD[(Int, Int, Int)]
val j2 = ee.join(rdd2)
^
// while joining both should be key, value pairs

scala> ee.collect.foreach(println) // Here the below is not a key,value pair
(11,30000,10000)
(11,40000,20000)
(12,50000,30000)
(13,60000,20000)
(12,80000,30000)

we need to do one more transformation

scala> val e3 = ee.map { x =>
| val dno = x._1
| val sal = x._2
| val bonus = x._3
| (dno, (sal,bonus))
| }
e3: org.apache.spark.rdd.RDD[(Int, (Int, Int))] = MapPartitionsRDD[8] at map at <console>:37

scala> e3.collect.foreach(println) // Here we formed key,value pairs
(11,(30000,10000))
(11,(40000,20000))
(12,(50000,30000))
(13,(60000,20000))
(12,(80000,30000))

scala> val j4 = e3.join(rdd2)
j4: org.apache.spark.rdd.RDD[(Int, ((Int, Int), String))] = MapPartitionsRDD[14] at join at <console>:43

scala> j4.collect.foreach(println)
(13,((60000,20000),Hyd))
(11,((30000,10000),Hyd))
(11,((40000,20000),Hyd))
(12,((50000,30000),Del))
(12,((80000,30000),Del))

scala> val j3 = e3.join(rdd2)
j3: org.apache.spark.rdd.RDD[(Int, ((Int, Int), String))] = MapPartitionsRDD[12] at join at <console>:37

scala> j3.collect.foreach(println)
(13,((60000,20000),Hyd))
(11,((30000,10000),Hyd))
(11,((40000,20000),Hyd))
(12,((50000,30000),Del))
(12,((80000,30000),Del))

scala> val pair = j3.map { x =>
val sal = x._2._1._1
val bonus = x._2._1._2
val tot = sal+bonus
val city = x._2._2
(city,tot)
}

scala> pair.collect.foreach(println)
(Hyd,80000)
(Hyd,40000)
(Hyd,60000)
(Del,80000)
(Del,110000)

scala> val resultOfCityAgg = pair.reduceByKey(_+_)
resultOfCityAgg: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[14] at reduceByKey at <console>:41

scala> resultOfCityAgg.foreach(println)
(Del,190000)
(Hyd,180000)

create the following files in local linux then copy them into hdfs:
-----------------------------------------------------------------
[cloudera@quickstart ~]$ cat > emp
101,aaaa,70000,m,12
102,bbbbb,90000,f,12
103,cc,10000,m,11
104,dd,40000,m,12
105,cccc,70000,f,13
106,de,80000,f,13
107,io,90000,m,14
108,yu,100000,f,14
109,poi,30000,m,11
110,aaa,60000,f,14
123,djdj,900000,m,15
122,asasd,10000,m,15

[cloudera@quickstart ~]$ cat > dept
11,marketing,hyd
12,hr,del
13,finance,hyd
14,admin,del
15,accounts,hyd

Copy the files into Sparks (HDFS):
----------------------------------
[cloudera@quickstart ~]$ hdfs dfs -copyFromLocal emp Sparks
[cloudera@quickstart ~]$ hdfs dfs -copyFromLocal dept Sparks
[cloudera@quickstart ~]$
[cloudera@quickstart ~]$ hdfs dfs -ls Sparks
Found 2 items
-rw-r--r-- 1 cloudera cloudera 71 2018-10-09 02:06 Sparks/dept
-rw-r--r-- 1 cloudera cloudera 232 2018-10-09 02:06 Sparks/emp

[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/dept
11,marketing,hyd
12,hr,del
13,finance,hyd
14,admin,del
15,accounts,hyd

[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/emp
101,aaaa,70000,m,12
102,bbbbb,90000,f,12
103,cc,10000,m,11
104,dd,40000,m,12
105,cccc,70000,f,13
106,de,80000,f,13
107,io,90000,m,14
108,yu,100000,f,14
109,poi,30000,m,11
110,aaa,60000,f,14
123,djdj,900000,m,15
122,asasd,10000,m,15

scala> emp.collect.foreach(println)
101,aaaa,70000,m,12
102,bbbbb,90000,f,12
103,cc,10000,m,11
104,dd,40000,m,12
105,cccc,70000,f,13
106,de,80000,f,13
107,io,90000,m,14
108,yu,100000,f,14
109,poi,30000,m,11
110,aaa,60000,f,14
123,djdj,900000,m,15
122,asasd,10000,m,15

scala> dept.collect.foreach(println)
11,marketing,hyd
12,hr,del
13,finance,hyd
14,admin,del
15,accounts,hyd

To perform joins, we need to make key,value pairs

scala> val e = emp.map { x =>
| val w = x.split(",")
| val dno = w(4).toInt
| val id = w(0)
| val name = w(1)
| val sal = w(2).toInt
| val sex = w(3)
| val info = id +","+name+","+sal+","+sex
| (dno,info)
| }
e: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[20] at map at <console>:29

scala> e.collect.foreach(println)
(12,101,aaaa,70000,m) /// (12, "101,aaa,70000,m") internally
(12,102,bbbbb,90000,f)
(11,103,cc,10000,m)
(12,104,dd,40000,m)
(13,105,cccc,70000,f)
(13,106,de,80000,f)
(14,107,io,90000,m)
(14,108,yu,100000,f)
(11,109,poi,30000,m)
(14,110,aaa,60000,f)
(15,123,djdj,900000,m)
(15,122,asasd,10000,m)

scala> val d = dept.map { x =>
| val w = x.split(",")
| val dno = w(0).toInt
| val info = w(1)+","+w(2)
| (dno,info)
| }
d: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[21] at map at <console>:29

scala> d.collect.foreach(println)
(11,marketing,hyd) /// (11,"marketing,hyd") internally
(12,hr,del)
(13,finance,hyd)
(14,admin,del)
(15,accounts,hyd)

scala> val ed = e.join(d)
ed: org.apache.spark.rdd.RDD[(Int, (String, String))] = MapPartitionsRDD[24] at join at <console>:35

scala> ed.collect.foreach(println)
(13,(105,cccc,70000,f,finance,hyd)) // (13,("105,cccc,70000,f",'finance,hyd")
(13,(106,de,80000,f,finance,hyd))
(15,(123,djdj,900000,m,accounts,hyd))
(15,(122,asasd,10000,m,accounts,hyd))
(11,(103,cc,10000,m,marketing,hyd))
(11,(109,poi,30000,m,marketing,hyd))
(14,(107,io,90000,m,admin,del))
(14,(108,yu,100000,f,admin,del))
(14,(110,aaa,60000,f,admin,del))
(12,(101,aaaa,70000,m,hr,del))
(12,(102,bbbbb,90000,f,hr,del))
(12,(104,dd,40000,m,hr,del))

scala> val ed2 = ed.map { x =>
| val einfo = x._2._1
| val dinfo = x._2._2
| val info = einfo +","+dinfo
| info
| }
ed2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[25] at map at <console>:37

scala> ed2.collect.foreach(println)
105,cccc,70000,f,finance,hyd
106,de,80000,f,finance,hyd
123,djdj,900000,m,accounts,hyd
122,asasd,10000,m,accounts,hyd
103,cc,10000,m,marketing,hyd
109,poi,30000,m,marketing,hyd
107,io,90000,m,admin,del
108,yu,100000,f,admin,del
110,aaa,60000,f,admin,del
101,aaaa,70000,m,hr,del
102,bbbbb,90000,f,hr,del
104,dd,40000,m,hr,del

Write the RDD into HDFS as a file:
scala> ed2.saveAsTextFile("/user/cloudera/Sparks/res1")

[cloudera@quickstart ~]$ hdfs dfs -ls Sparks
Found 3 items
-rw-r--r-- 1 cloudera cloudera 71 2018-10-09 02:06 Sparks/dept
-rw-r--r-- 1 cloudera cloudera 232 2018-10-09 02:06 Sparks/emp
drwxr-xr-x - cloudera cloudera 0 2018-10-09 02:21 Sparks/res1

[cloudera@quickstart ~]$ hdfs dfs -ls Sparks/res1
Found 2 items
-rw-r--r-- 1 cloudera cloudera 0 2018-10-09 02:21 Sparks/res1/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 325 2018-10-09 02:21 Sparks/res1/part-00000

[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/res1/part-00000
105,cccc,70000,f,finance,hyd
106,de,80000,f,finance,hyd
123,djdj,900000,m,accounts,hyd
122,asasd,10000,m,accounts,hyd
103,cc,10000,m,marketing,hyd
109,poi,30000,m,marketing,hyd
107,io,90000,m,admin,del
108,yu,100000,f,admin,del
110,aaa,60000,f,admin,del
101,aaaa,70000,m,hr,del
102,bbbbb,90000,f,hr,del
104,dd,40000,m,hr,del

scala> val emp = sc.textFile("/user/cloudera/Sparks/emp")
emp: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/emp MapPartitionsRDD[28] at textFile at <console>:27

scala> val dept = sc.textFile("/user/cloudera/Sparks/dept")
dept: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/dept MapPartitionsRDD[30] at textFile at <console>:27

scala> val ednosal = emp.map { x =>
| val w = x.split(",")
| val dno = w(4)
| val sal = w(2).toInt
| (dno,sal)
| }
ednosal: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[31] at map at <console>:29

scala> ednosal.collect.foreach(println)
(12,70000)
(12,90000)
(11,10000)
(12,40000)
(13,70000)
(13,80000)
(14,90000)
(14,100000)
(11,30000)
(14,60000)
(15,900000)
(15,10000)

scala> val dnoCity = dept.map { x =>
| val w = x.split(",")
| val dno = w(0)
| val city = w(2)
| (dno,city)
| }
dnoCity: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[32] at map at <console>:29

scala> dnoCity.collect.foreach(println)
(11,hyd)
(12,del)
(13,hyd)
(14,del)
(15,hyd)

Now ednosal and dnoCity both are key,value pairs

scala> val edjoin = ednosal.join(dnoCity)
edjoin: org.apache.spark.rdd.RDD[(String, (Int, String))] = MapPartitionsRDD[35] at join at <console>:35

scala> edjoin.collect.foreach(println)
(14,(90000,del))
(14,(100000,del))
(14,(60000,del))
(15,(900000,hyd))
(15,(10000,hyd))
(12,(70000,del))
(12,(90000,del))
(12,(40000,del))
(13,(70000,hyd))
(13,(80000,hyd))
(11,(10000,hyd))
(11,(30000,hyd))

scala> val citysal = edjoin.map { x =>
| val city = x._2._2
| val sal = x._2._1
| (city,sal)
| }
citysal: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[36] at map at <console>:37

scala> citysal.collect.foreach(println)
(del,90000)
(del,100000)
(del,60000)
(hyd,900000)
(hyd,10000)
(del,70000)
(del,90000)
(del,40000)
(hyd,70000)
(hyd,80000)
(hyd,10000)
(hyd,30000)

scala>

// performing city based aggregation
scala> val res = citysal.reduceByKey(_+_)
res: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[37] at reduceByKey at <console>:39

scala> res.collect.foreach(println)
(hyd,1100000)
(del,450000)

Resilient Distributed DataSets

RDD is subdivided into partitions and partitions are distributed across multiple slaves

3 ways to create RDDs

a) Read Data from file using SparkContext (sc)
b) When you perform transformation against existing RDD
c) When you parallelize local objects

Two types of operations
Transformations and Actions

Transformations:
a) element wise
operation over each element of the collection
map, flatMap

b) grouping aggregations
reduceByKey, groupByKey

c) Filters
filter, filterByRange

Actions:
RDD data flow will be executed, when action is performed.
During the flow execution, RDDs will be loaded into RAM

scala> val x = List(10,20,30,40,30,23,45,36)
x: List[Int] = List(10, 20, 30, 40, 30, 23, 45, 36)

scala> val y = sc.parallelize(x)
y: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[38] at parallelize at <console>:29

scala> val a = x.map (x => x + 100)
a: List[Int] = List(110, 120, 130, 140, 130, 123, 145, 136)

scala> val b = y.map (x => x + 100)
b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[39] at map at <console>:31

scala> b.collect.foreach(println)
110
120
130
140
130
123
145
136

Here x, a are local objects
y, and b are RDDs

Local objects:
object which resides in client machine is called local objects

RDDs are declared at client
during flow execution, loaded into slaves of spark cluster

x is declared as local
y is parallelized object of x so that y is created as RDD

a is local object, because a is transformed object of x (local)

scala> x
res29: List[Int] = List(10, 20, 30, 40, 30, 23, 45, 36)

scala> val c = x.filter (x => x >= 40)
c: List[Int] = List(40, 45)

scala> val d = y.filter(x => x >= 40)
d: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[40] at filter at <console>:31

scala> val res = d.collect
res: Array[Int] = Array(40, 45)

collect : is an Actions

a) y will be loaded into RAM and waits for data
b) d will be loaded into RAM and performs computation once is ready y will be removed from RAM
On d, collect action will be executed

Collect Action will collect the results of all partitions of the RDD into client machine (local)

val res = d.collect
res is local object

what parallelize() will do?
it converts local objects into RDDs
val a = sc.parallelize(List(10,20,30,40))

'a' is RDD, with 1 partitions
so that, during execution parallel process not possible

val b = sc.parallelize(List(10,20,30,40),2)

Now 'b' is RDD, which has 2 partitions
during execution these 2 partitions will be loaded into RAMs of 2 separate slaves. so that
parallel processing is possible

scala> val x = List(10,20,30,40,1,2,3,4,90,12)
x: List[Int] = List(10, 20, 30, 40, 1, 2, 3, 4, 90, 12)

scala> x.size
res30: Int = 10

scala> x.length
res31: Int = 10

scala> val r1 = sc.parallelize(x) // partition is 1
r1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[41] at parallelize at <console>:29

scala> r1.partitions.size
res32: Int = 1

scala> val r2 = sc.parallelize(x,3) // partition is 3 so parallel achieved
r2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[42] at parallelize at <console>:29

scala> r2.partitions.size
res33: Int = 3

//we are going to perform wordcount analysis in spark
//create a text file in local linux named as comment with repeatative words as content of it.
[cloudera@quickstart ~]$ cat > comment
I love Spark
I love Hadoop
I love Spark and Hadoop
Hadoop and Spark are great systems

[cloudera@quickstart ~]$ ls comment
comment
[cloudera@quickstart ~]$ pwd
/home/cloudera

//copy comment into Sparks folder of HDFS
[cloudera@quickstart ~]$ hdfs dfs -copyFromLocal comment Sparks

[cloudera@quickstart ~]$ hdfs dfs -ls Sparks
Found 4 items
-rw-r--r-- 1 cloudera cloudera 86 2018-10-10 01:21 Sparks/comment
-rw-r--r-- 1 cloudera cloudera 71 2018-10-09 02:06 Sparks/dept
-rw-r--r-- 1 cloudera cloudera 232 2018-10-09 02:06 Sparks/emp
drwxr-xr-x - cloudera cloudera 0 2018-10-09 02:21 Sparks/res1

//view it
[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/comment
I love Spark
I love Hadoop
I love Spark and Hadoop
Hadoop and Spark are great systems
[cloudera@quickstart ~]$

scala> val data = sc.textFile("/user/cloudera/Sparks/comment")
data: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/comment MapPartitionsRDD[44] at textFile at <console>:27

scala> data.count
res35: Long = 4

scala> data.collect.foreach(println)
I love Spark
I love Hadoop
I love Spark and Hadoop
Hadoop and Spark are great systems

scala> data.collect
res37: Array[String] = Array(I love Spark, I love Hadoop, I love Spark and Hadoop, Hadoop and Spark are great systems)

read text file contents and put them into RDDs

scala> val lines = sc.textFile("hdfs://quickstart.cloudera/user/cloudera/Sparks/comment")
lines: org.apache.spark.rdd.RDD[String] = hdfs://quickstart.cloudera/user/cloudera/Sparks/comment MapPartitionsRDD[46] at textFile at <console>:27

scala> lines.count
res38: Long = 4

scala> lines.collect
res39: Array[String] = Array(I love Spark, I love Hadoop, I love Spark and Hadoop, Hadoop and Spark are great systems)

scala> lines.collect.foreach(println)
I love Spark
I love Hadoop
I love Spark and Hadoop
Hadoop and Spark are great systems

scala> lines.foreach(println)
I love Spark
I love Hadoop
I love Spark and Hadoop
Hadoop and Spark are great systems

scala> val words = lines.flatMap(x => x.split(" "))
words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[47] at flatMap at <console>:31

scala> val wordss = lines.flatMap(_.split(" ")) // short hand format
wordss: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[48] at flatMap at <console>:29

scala> words.foreach(println)
I
love
Spark
I
love
Hadoop
I
love
Spark
and
Hadoop
Hadoop
and
Spark
are
great
systems

scala> wordss.foreach(println)
I
love
Spark
I
love
Hadoop
I
love
Spark
and
Hadoop
Hadoop
and
Spark
are
great
systems

scala> words.collect
res44: Array[String] = Array(I, love, Spark, I, love, Hadoop, I, love, Spark, and, Hadoop, Hadoop, and, Spark, are, great, systems)

scala> val pair = words.map (x => (x,1))
pair: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[49] at map at <console>:33

scala> pair.collect
res45: Array[(String, Int)] = Array((I,1), (love,1), (Spark,1), (I,1), (love,1), (Hadoop,1), (I,1), (love,1), (Spark,1), (and,1), (Hadoop,1), (Hadoop,1), (and,1), (Spark,1), (are,1), (great,1), (systems,1))

scala> pair.collect.foreach(println)
(I,1)
(love,1)
(Spark,1)
(I,1)
(love,1)
(Hadoop,1)
(I,1)
(love,1)
(Spark,1)
(and,1)
(Hadoop,1)
(Hadoop,1)
(and,1)
(Spark,1)
(are,1)
(great,1)
(systems,1)

scala> val wc = pair.reduceByKey((a,b) => a+b) // full format
scala> val wc = pair.reduceByKey (_+_)
wc: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[50] at reduceByKey at <console>:44

scala> wc.collect
res47: Array[(String, Int)] = Array((are,1), (Spark,3), (love,3), (I,3), (great,1), (and,2), (systems,1), (Hadoop,3))

scala> wc.collect.foreach(println)
(are,1)
(Spark,3)
(love,3)
(I,3)
(great,1)
(and,2)
(systems,1)
(Hadoop,3)

Complete short hand involved below:

scala> val words = lines.flatMap (_.split(" "))
words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[51] at flatMap at <console>:29

scala> val pair = words.map ( (_,1))
pair: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[52] at map at <console>:31

scala> val res = pair.reduceByKey(_+_)
res: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[53] at reduceByKey at <console>:33

scala> res.collect.foreach(println)
(are,1)
(Spark,3)
(love,3)
(I,3)
(great,1)
(and,2)
(systems,1)
(Hadoop,3)

scala> res.count
res50: Long = 8

//single line implementation

scala> val wc = lines.flatMap(_.split(" ")).map ((_,1)).reduceByKey(_+_)
wc: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[56] at reduceByKey at <console>:29

scala> wc.collect.foreach(println)
(are,1)
(Spark,3)
(love,3)
(I,3)
(great,1)
(and,2)
(systems,1)
(Hadoop,3)

val wc = lines.map { x =>
val w = x.split(" ")
val p = w.map((_,1))
p
}.flatMap(x => x).reduceByKey(_+_)

scala> wc.collect.foreach(println)
(are,1)
(Spark,3)
(love,3)
(I,3)
(great,1)
(and,2)
(systems,1)
(Hadoop,3)

Unnecessary Space Example:
[cloudera@quickstart ~]$ val data = sc.textFile("/user/cloudera/Sparks/comment")

bash: syntax error near unexpected token `('
[cloudera@quickstart ~]$ words.collect

bash: words.collect: command not found

[cloudera@quickstart ~]$ cat >unnecessaryspace.txt
I loVE INdiA I loVE PaLlaThuR I LoVE BanGALorE
Hadoop VS SPArK faCEBooK SCAla ArunACHAlam VenkaTAChaLAm "

[cloudera@quickstart ~]$ hdfs dfs -copyFromLocal unnecessaryspace.txt Sparks

scala function to remove unnecessary space

scala> def removeSpace(line:String) = {
| // i LoVE spARk
| val w = line.trim().split(" ")
| val words = w.filter(x => x != "")
| words.mkString(" ")
| }
removeSpace: (line: String)String

execute the function:
scala> removeSpace("I LovE Spark ")
res56: String = I LovE Spark

scala> val data = sc.textFile("/user/cloudera/Sparks/unnecessaryspace.txt")
data: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/unnecessaryspace.txt MapPartitionsRDD[62] at textFile at <console>:27

scala> data.count
res57: Long = 2

scala> data.collect.foreach(println)
I loVE INdiA I loVE PaLlaThuR I LoVE BanGALorE
Hadoop VS SPArK faCEBooK SCAla ArunACHAlam VenkaTAChaLAm "

scala> val data = sc.textFile("/user/cloudera/Sparks/unnecessaryspace.txt")
data: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/unnecessaryspace.txt MapPartitionsRDD[62] at textFile at <console>:27

scala> data.count
res57: Long = 2

scala> data.collect.foreach(println)
I loVE INdiA I loVE PaLlaThuR I LoVE BanGALorE
Hadoop VS SPArK faCEBooK SCAla ArunACHAlam VenkaTAChaLAm "

scala> removeSpace("I LovE Spark ")
res59: String = I LovE Spark

scala> val words = data.flatMap { x =>
| val x2 = removeSpace(x).toLowerCase.split(" ")
| x2
| }
words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[63] at flatMap at <console>:37

scala> words.collect
res60: Array[String] = Array(i, love, india, i, love, pallathur, i, love, bangalore, hadoop, vs, spark, facebook, scala, arunachalam, venkatachalam, ")

scala> words.collect.foreach(println)
i
love
india
i
love
pallathur
i
love
bangalore
hadoop
vs
spark
facebook
scala
arunachalam
venkatachalam
"

scala> val pair = words.map (x => (x,1))
pair: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[64] at map at <console>:40

scala> val wc = pair.reduceByKey(_+_)
wc: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[65] at reduceByKey at <console>:42

scala> wc.collect.foreach(println)
(bangalore,1)
(india,1)
(",1)
(scala,1)
(spark,1)
(hadoop,1)
(love,3) // 3 times
(facebook,1)
(i,3) // 3 times
(venkatachalam,1)
(arunachalam,1)
(pallathur,1)
(vs,1)

cat > emp
101,aaaa,70000,m,12
102,bbbbb,90000,f,12
103,cc,10000,m,11
104,dd,40000,m,12
105,cccc,70000,f,13
106,de,80000,f,13
107,io,90000,m,14
108,yu,100000,f,14
109,poi,30000,m,11
110,aaa,60000,f,14
123,djdj,900000,m,15
122,asasd,10000,m,15

hdfs dfs -copyFromLocal emp Sparks

[cloudera@quickstart ~]$ hdfs dfs -ls Sparks
Found 5 items
-rw-r--r-- 1 cloudera cloudera 86 2018-10-10 01:21 Sparks/comment
-rw-r--r-- 1 cloudera cloudera 71 2018-10-09 02:06 Sparks/dept
-rw-r--r-- 1 cloudera cloudera 232 2018-10-09 02:06 Sparks/emp -- this is the file
drwxr-xr-x - cloudera cloudera 0 2018-10-09 02:21 Sparks/res1
-rw-r--r-- 1 cloudera cloudera 137 2018-10-10 01:58 Sparks/unnecessaryspace.txt
[cloudera@quickstart ~]$

scala> val emp = sc.textFile("/user/cloudera/Sparks/emp")
emp: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/emp MapPartitionsRDD[67] at textFile at <console>:27

scala> val eArr = emp.map (x => x.split(" "))
eArr: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[68] at map at <console>:31

scala> //sex based aggregations on sal

scala> val pairSexSal = eArr.map ( x => (x(3),x(2).toInt))
pairSexSal: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[71] at map at <console>:33

scala> val res1 = pairSexSal.reduceByKey(_+_)
res1: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[72] at reduceByKey at <console>:35

scala> //select sex,max(sal) from emp group by sex;

scala> val pairSexSal = eArr.map ( x => (x(3),x(2).toInt))
pairSexSal: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[71] at map at <console>:33

scala> val res1 = pairSexSal.reduceByKey(_+_)
res1: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[72] at reduceByKey at <console>:35

scala> //select sex,max(sal) from emp group by sex;

scala> val res2 = pairSexSal.reduceByKey(Math.max(_,_))
res2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[73] at reduceByKey at <console>:35

scala> //select sex,min(sal) from emp group by sex;

scala> val res3 = pairSexSal.reduceByKey(Math.min(_,_))
res3: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[74] at reduceByKey at <console>:35

scala> val data = sc.textFile("/user/cloudera/Sparks/emp")
data: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/emp MapPartitionsRDD[76] at textFile at <console>:27

scala> data.collect.foreach(println)
101,aaaa,70000,m,12
102,bbbbb,90000,f,12
103,cc,10000,m,11
104,dd,40000,m,12
105,cccc,70000,f,13
106,de,80000,f,13
107,io,90000,m,14
108,yu,100000,f,14
109,poi,30000,m,11
110,aaa,60000,f,14
123,djdj,900000,m,15
122,asasd,10000,m,15

// select dno,sex, sum(sal) from emp group by sex;

scala> val pair = data.map { x =>
| val w = x.split(",")
| val dno = w(4)
| val sex = w(3)
| val sal = w(2).toInt
| val myKey = (dno,sex)
| (myKey,sal)
| }
pair: org.apache.spark.rdd.RDD[((String, String), Int)] = MapPartitionsRDD[77] at map at <console>:31

scala> pair.collect.foreach(println)
((12,m),70000)
((12,f),90000)
((11,m),10000)
((12,m),40000)
((13,f),70000)
((13,f),80000)
((14,m),90000)
((14,f),100000)
((11,m),30000)
((14,f),60000)
((15,m),900000)
((15,m),10000)

(key,value) - in that place of key tuple (15,m) exists

when ever multi group is required keep that k,v as tuple

scala> var res = pair.reduceByKey(_+_)
res: org.apache.spark.rdd.RDD[((String, String), Int)] = ShuffledRDD[78] at reduceByKey at <console>:33

scala> val res = pair.reduceByKey((x,y) => x+y)
res: org.apache.spark.rdd.RDD[((String, String), Int)] = ShuffledRDD[79] at reduceByKey at <console>:38

scala> res.collect
res68: Array[((String, String), Int)] = Array(((14,m),90000), ((12,f),90000), ((15,m),910000), ((14,f),160000), ((13,f),150000), ((12,m),110000), ((11,m),40000))

scala> res.collect.foreach(println)
((14,m),90000)
((12,f),90000)
((15,m),910000)
((14,f),160000)
((13,f),150000)
((12,m),110000)
((11,m),40000)

Grouping by single or set of columns but aggregation is only one
I want multiple aggregations sum,max,min etc.,

reduceByKey doesn't support multiple aggregation

scala> data.collect
res72: Array[String] = Array(101,aaaa,70000,m,12, 102,bbbbb,90000,f,12, 103,cc,10000,m,11, 104,dd,40000,m,12, 105,cccc,70000,f,13, 106,de,80000,f,13, 107,io,90000,m,14, 108,yu,100000,f,14, 109,poi,30000,m,11, 110,aaa,60000,f,14, 123,djdj,900000,m,15, 122,asasd,10000,m,15)

scala> data.collect.foreach(println)
101,aaaa,70000,m,12
102,bbbbb,90000,f,12
103,cc,10000,m,11
104,dd,40000,m,12
105,cccc,70000,f,13
106,de,80000,f,13
107,io,90000,m,14
108,yu,100000,f,14
109,poi,30000,m,11
110,aaa,60000,f,14
123,djdj,900000,m,15
122,asasd,10000,m,15

scala> data.take(3).foreach(println)
101,aaaa,70000,m,12
102,bbbbb,90000,f,12
103,cc,10000,m,11

data.skip(3) is not available

val data = sc.textFile("....")
val arr = data.map (x => x.split(","))
val pairSexSal = arr.map (x => ( x(3),x(2).toInt))

val res1 = pairSexSal.reduceByKey (_+_)
val res2 = pairSexSal.reduceByKey(Math.max(_,_))
val res3 = pairSexSal.reduceByKey(Math.min(_,_))

data
array
pair
(pair.persist / pair.cache)
res1,res2,res3

3 different flows :
data -> array -> pair -> res1

data -> array -> pair -> res2

data -> array -> pair -> res3

RDD wont be persisted when declared
when 1st time loaded and computed then it will be persisted (res1.collect)

res1.collect now pairSexSal.persist
so result is available in RAM
res2.collect --> all the steps from the beginning wont be executed
--> just operations continues after pairSexSal
because pairSexSal result is already persisted so no need to recalculate from the beginning

collect is a method of Spark and not a part of Scala

take is availabe in Scala as well as Spark
.collect is exclusive only for Spark RDD

scala> data.collect.foreach(println)
101,aaaa,70000,m,12
102,bbbbb,90000,f,12
103,cc,10000,m,11
104,dd,40000,m,12
105,cccc,70000,f,13
106,de,80000,f,13
107,io,90000,m,14
108,yu,100000,f,14
109,poi,30000,m,11
110,aaa,60000,f,14
123,djdj,900000,m,15
122,asasd,10000,m,15

scala> val arr = data.map (x => x.split(","))
arr: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[80] at map at <console>:31

scala> val arr = data.map(x => x.split(","))
arr: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[84] at map at <console>:31

scala> val pair = arr.map (x => (x(3),x(2).toInt))
pair: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[85] at map at <console>:33

scala> pair.collect.foreach(println)
(m,70000)
(f,90000)
(m,10000)
(m,40000)
(f,70000)
(f,80000)
(m,90000)
(f,100000)
(m,30000)
(f,60000)
(m,900000)
(m,10000)

scala> pair.persist
res87: pair.type = MapPartitionsRDD[85] at map at <console>:33

scala> val res1 = pair.reduceByKey(_+_)
res1: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[86] at reduceByKey at <console>:35

scala> val res2 = pair.reduceByKey(Math.max(_,_))
res2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[87] at reduceByKey at <console>:35

scala> val res3 = pair.reduceByKey(Math.min(_,_))
res3: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[88] at reduceByKey at <console>:35

faster execution:

sum
scala> res3.collect
res88: Array[(String, Int)] = Array((f,60000), (m,10000))

max
scala> res2.collect
res89: Array[(String, Int)] = Array((f,100000), (m,900000))

min
scala> res1.collect
res90: Array[(String, Int)] = Array((f,400000), (m,1150000))

when we comeout of the session, shell persistance will be released (go away)

Here we store the result into HDFS as a file but
it will be the copy of tuple what we have like within brackets
scala> res1.saveAsTextFile("/user/cloudera/Sparks/RES1")

scala> res2.saveAsTextFile("/user/cloudera/Sparks/RES2")

scala> res3.saveAsTextFile("/user/cloudera/Sparks/RES3")

[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/RES1/part-00000
(f,400000)
(m,1150000)
[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/RES2/part-00000
(f,100000)
(m,900000)
[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/RES3/part-00000
(f,60000)
(m,10000)

scala> res1.collect
res101: Array[(String, Int)] = Array((f,400000), (m,1150000))

// here we make just a string instead of tuple to write string output into a file
scala> val tempResult = res1.map { x =>
| val res = x._1 + "\t" + x._2 // we make tab delimited string instead of tuple with brrackets

| res
| }
tempResult: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[94] at map at <console>:47

scala> tempResult
res102: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[94] at map at <console>:47

scala> tempResult.collect
res103: Array[String] = Array(f 400000, m 1150000)

// here we write string output into a file
scala> tempResult.saveAsTextFile("/user/cloudera/Sparks/R1")

check the file content (HDFS)
hdfs dfs -cat Sparks/R1/part-00000
f 400000
m 1150000

Note:
Don't write output as tuple in HDFS
Transform the RDD as string and then write into HDFS

scala> val pair2 = arr.map( x => ((x(4),x(3)),x(2).toInt))
pair2: org.apache.spark.rdd.RDD[((String, String), Int)] = MapPartitionsRDD[96] at map at <console>:33

scala> val res4 = pair2.reduceByKey(_+_)
res4: org.apache.spark.rdd.RDD[((String, String), Int)] = ShuffledRDD[97] at reduceByKey at <console>:35

scala> res4.collect.foreach(println)
((14,m),90000)
((12,f),90000)
((15,m),910000)
((14,f),160000)
((13,f),150000)
((12,m),110000)
((11,m),40000)

scala> res4.saveAsTextFile("/user/cloudera/Sparks/Re1")

[cloudera@quickstart ~]$ hdfs dfs -ls Sparks/Re1
Found 2 items
-rw-r--r-- 1 cloudera cloudera 0 2018-10-10 10:24 Sparks/Re1/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 109 2018-10-10 10:24 Sparks/Re1/part-00000
[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/Re1/part-00000
((14,m),90000)
((12,f),90000)
((15,m),910000)
((14,f),160000)
((13,f),150000)
((12,m),110000)
((11,m),40000)

Here ((dno,sex),sal) ---> tuple inside a tuple as key and salary as value written into HDFS

But our required output is : 14 m 90000

we need to transform the RDD as follows to make tab delimited string

//using multiline code
scala> val r4 = res4.map { x =>
| val dno = x._1._1
| val sex = x._1._2
| val sal = x._2
| dno + "\t" + sex + "\t" + sal
| }
r4: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[100] at map at <console>:37

scala> r4.collect.foreach(println)
14 m 90000
12 f 90000
15 m 910000
14 f 160000
13 f 150000
12 m 110000
11 m 40000

//using single line code
scala> val r4 = res4.map (x => x._1._1 + "\t" + x._1._2 + "\t" + x._2)
r4: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[101] at map at <console>:37

scala> r4.collect.foreach(println)
14 m 90000
12 f 90000
15 m 910000
14 f 160000
13 f 150000
12 m 110000
11 m 40000

scala> r4.saveAsTextFile("/user/cloudera/Sparks/Re4")
[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/Re4/part-0000014 m 90000
12 f 90000
15 m 910000
14 f 160000
13 f 150000
12 m 110000
11 m 40000

Make a scala function for reuse: input x is tuple, y is delimiter
scala> def pairToString(x:(String,Int),delim:String) = {
| val a = x._1
| val b = x._2
| a + delim + b
| }
pairToString: (x: (String, Int), delim: String)String

scala> res1.collect.foreach(println)
(f,400000)
(m,1150000)
// comma delimited string
scala> val Re5 = res1.map ( x => pairToString(x,","))
Re5: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[103] at map at <console>:48

scala> Re5.collect.foreach(println)
f,400000
m,1150000

//Tab Delimited string
scala> val Re6 = res1.map (x => pairToString(x,"\t"))
Re6: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[104] at map at <console>:48

scala> Re6.collect.foreach(println)
f 400000
m 1150000

saveAsTextFile --> How many number of files will be created in HDFS?
Number of Files count ==> Number of partitions

scala> val myList = List(10,20,30,40,50,50,30,40,10,23)
myList: List[Int] = List(10, 20, 30, 40, 50, 50, 30, 40, 10, 23)

scala> myList.size
res119: Int = 10

scala> val rdd1 = sc.parallelize(myList)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[105] at parallelize at <console>:29

scala> rdd1.partitions.size
res121: Int = 1

scala> rdd1.saveAsTextFile("/user/cloudera/Sparks/rdd1Result")

scala> val rdd2 = sc.parallelize(myList,3)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[107] at parallelize at <console>:29

scala> rdd2.partitions.size
res123: Int = 3

hdfs dfs -ls Sparks
Found 14 items
drwxr-xr-x - cloudera cloudera 0 2018-10-10 10:48 Sparks/rdd1Result (single partition)
drwxr-xr-x - cloudera cloudera 0 2018-10-10 10:49 Sparks/rdd2Result (3 partitions)

rdd1 has single partition so : part-00000 (single file will be present in that folder)
[cloudera@quickstart ~]$ hdfs dfs -ls Sparks/rdd1Result
Found 2 items
-rw-r--r-- 1 cloudera cloudera 0 2018-10-10 10:48 Sparks/rdd1Result/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 30 2018-10-10 10:48 Sparks/rdd1Result/part-00000

rdd2 has 3 partitions so : part-000000, part-00001, part-00002 total 3 files present there
[cloudera@quickstart ~]$ hdfs dfs -ls Sparks/rdd2Result
Found 4 items
-rw-r--r-- 1 cloudera cloudera 0 2018-10-10 10:49 Sparks/rdd2Result/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 9 2018-10-10 10:49 Sparks/rdd2Result/part-00000
-rw-r--r-- 1 cloudera cloudera 9 2018-10-10 10:49 Sparks/rdd2Result/part-00001
-rw-r--r-- 1 cloudera cloudera 12 2018-10-10 10:49 Sparks/rdd2Result/part-00002

How many number of files will be created when we use saveAsTextFile ?
that depends on number of partitions for the given RDD

For Single Aggregation, use reduceByKey

For Multiple Aggregation, use groupByKey
it will make compact buffer (iterator in java)

scala> val data = sc.textFile("/user/cloudera/Sparks/emp")
data: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/emp MapPartitionsRDD[112] at textFile at <console>:27

scala> val arr = data.map (x => x.split(","))
arr: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[113] at map at <console>:31

scala> val pair1 = arr.map (x => (x(3),x(2).toInt))
pair1: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[114] at map at <console>:33

scala> pair1.collect.foreach(println)
(m,70000)
(f,90000)
(m,10000)
(m,40000)
(f,70000)
(f,80000)
(m,90000)
(f,100000)
(m,30000)
(f,60000)
(m,900000)
(m,10000)

Going to apply multiple aggregations

scala> val grp = pair1.groupByKey() // always first one is key
grp: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[115] at groupByKey at <console>:35

// for male one tuple, female another tuple
scala> grp.collect.foreach(println)
(f,CompactBuffer(90000, 70000, 80000, 100000, 60000))
(m,CompactBuffer(70000, 10000, 40000, 90000, 30000, 900000, 10000))

// single grouping column but multiple aggregations

scala> val res = grp.map{ x =>
| val sex = x._1
| val cb = x._2
| val tot = cb.sum
| val cnt = cb.size
| val avg = tot / cnt
| val max = cb.max
| val min = cb.min
| val result = sex + "," + tot + "," + cnt + "," + avg + "," + max + "," + min
| result
| }
res: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[116] at map at <console>:37

scala> res
res127: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[116] at map at <console>:37

scala> res.collect.foreach(println)
f,400000,5,80000,100000,60000
m,1150000,7,164285,900000,10000

scala> res.saveAsTextFile("/user/cloudera/Sparks/res100")

[cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/Sparks/res100/part-00000
f,400000,5,80000,100000,60000
m,1150000,7,164285,900000,10000

scala> data.collect
res130: Array[String] = Array(101,aaaa,70000,m,12, 102,bbbbb,90000,f,12, 103,cc,10000,m,11, 104,dd,40000,m,12, 105,cccc,70000,f,13, 106,de,80000,f,13, 107,io,90000,m,14, 108,yu,100000,f,14, 109,poi,30000,m,11, 110,aaa,60000,f,14, 123,djdj,900000,m,15, 122,asasd,10000,m,15)

scala> data.collect.foreach(println)
101,aaaa,70000,m,12
102,bbbbb,90000,f,12
103,cc,10000,m,11
104,dd,40000,m,12
105,cccc,70000,f,13
106,de,80000,f,13
107,io,90000,m,14
108,yu,100000,f,14
109,poi,30000,m,11
110,aaa,60000,f,14
123,djdj,900000,m,15
122,asasd,10000,m,15

scala> arr.collect
res132: Array[Array[String]] = Array(Array(101, aaaa, 70000, m, 12), Array(102, bbbbb, 90000, f, 12), Array(103, cc, 10000, m, 11), Array(104, dd, 40000, m, 12), Array(105, cccc, 70000, f, 13), Array(106, de, 80000, f, 13), Array(107, io, 90000, m, 14), Array(108, yu, 100000, f, 14), Array(109, poi, 30000, m, 11), Array(110, aaa, 60000, f, 14), Array(123, djdj, 900000, m, 15), Array(122, asasd, 10000, m, 15))

scala> val pair2 = arr.map(x => ( (x(4),x(3)),x(2).toInt))
pair2: org.apache.spark.rdd.RDD[((String, String), Int)] = MapPartitionsRDD[119] at map at <console>:33

scala> pair2.collect.foreach(println)
((12,m),70000)
((12,f),90000)
((11,m),10000)
((12,m),40000)
((13,f),70000)
((13,f),80000)
((14,m),90000)
((14,f),100000)
((11,m),30000)
((14,f),60000)
((15,m),900000)
((15,m),10000)

scala> val grp2 = pair2.groupByKey()
grp2: org.apache.spark.rdd.RDD[((String, String), Iterable[Int])] = ShuffledRDD[120] at groupByKey at <console>:35

// multiple compact buffers for each key grouped together
scala> grp2.collect.foreach(println)
((14,m),CompactBuffer(90000))
((12,f),CompactBuffer(90000))
((15,m),CompactBuffer(900000, 10000))
((14,f),CompactBuffer(100000, 60000))
((13,f),CompactBuffer(70000, 80000))
((12,m),CompactBuffer(70000, 40000))
((11,m),CompactBuffer(10000, 30000))

val res2 = grp2.map { x =>
val k = x._1
val dno = k._1
val sex = k._2
val cb = x._2
val tot = cb.sum
val cnt = cb.size
val avg = tot / cnt
val max = cb.max
val min = cb.min
(dno,sex,tot,cnt,avg,max,min)
}

scala> res2.collect.foreach(println)
(14,m,90000,1,90000,90000,90000)
(12,f,90000,1,90000,90000,90000)
(15,m,910000,2,455000,900000,10000)
(14,f,160000,2,80000,100000,60000)
(13,f,150000,2,75000,80000,70000)
(12,m,110000,2,55000,70000,40000)
(11,m,40000,2,20000,30000,10000)

reduceByKey provides better performance than groupByKey

// select sum(sal) from emp
// select sum(sal),avg(sal),count(*),max(sal),min(sal) from emp;

// we need the aggregations without grouping for all employees

scala> data.collect
res137: Array[String] = Array(101,aaaa,70000,m,12, 102,bbbbb,90000,f,12, 103,cc,10000,m,11, 104,dd,40000,m,12, 105,cccc,70000,f,13, 106,de,80000,f,13, 107,io,90000,m,14, 108,yu,100000,f,14, 109,poi,30000,m,11, 110,aaa,60000,f,14, 123,djdj,900000,m,15, 122,asasd,10000,m,15)

scala> arr.collect
res138: Array[Array[String]] = Array(Array(101, aaaa, 70000, m, 12), Array(102, bbbbb, 90000, f, 12), Array(103, cc, 10000, m, 11), Array(104, dd, 40000, m, 12), Array(105, cccc, 70000, f, 13), Array(106, de, 80000, f, 13), Array(107, io, 90000, m, 14), Array(108, yu, 100000, f, 14), Array(109, poi, 30000, m, 11), Array(110, aaa, 60000, f, 14), Array(123, djdj, 900000, m, 15), Array(122, asasd, 10000, m, 15))

scala>

scala> val sals = arr.map (x => x(2).toInt)
sals: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[122] at map at <console>:33

scala> sals.collect.foreach(println)
70000
90000
10000
40000
70000
80000
90000
100000
30000
60000
900000
10000

scala> val tot = sals.sum
tot: Double = 1550000.0

scala> val tot = sals.reduce(_+_)
tot: Int = 1550000

scala> val cnt = sals.count
cnt: Long = 12

scala> val avg = tot / cnt
avg: Long = 129166

scala> val max = sals.max
max: Int = 900000

scala> val min = sals.min
min: Int = 10000

scala> val tot = sals.reduce(_+_)
tot: Int = 1550000

scala> val cnt = sals.count
cnt: Long = 12

scala> val avg = tot / cnt
avg: Long = 129166

scala> val max = sals.reduce(Math.max)
max: Int = 900000

scala> val max = sals.reduce(Math.max(_,_))
max: Int = 900000

scala> val min = sals.reduce(Math.min)
min: Int = 10000

scala> val min = sals.reduce(Math.min(_,_))
min: Int = 10000

reduce will work on each partitions of each RAMs of every machines of a clusters where data already parallelized
-- better performance
-- parallelism achieved
sum will work on non parallel manner. sum will collect data from RDDs and put it into single machine and gives burden
-- bad performance
-- non parallelism

val lst = sc.parallelize(List(10,20,30,40,50,60,70,80,90,100),2)

rdd --> lst
has 2 partitions
partition 1 -> List(10,20,30,40,50)
partition 2 -> List(60,70,80,90,100)

lst.sum --> all data from all partitions will be collected into local then sum executed at local (non parallel)

lst.reduce(_+_)
operation executed at cluster performed in all partitions wherever data reside in partitions

partition1 result = 150
partition2 result = 400

finally independent result of each partition will be collected into any one of the spark slave and there only
sum will be calculated

finally -> List(150,400) ==> 550

this final result 550 will be sent to client machine who needs result

grouping aggregation --> reduceByKey
entire collection's sum or aggregation --> reduce

scala> val res = (tot,cnt,avg,max,min)
res: (Int, Long, Long, Int, Int) = (1550000,12,129166,900000,10000)

scala> val r = List(tot,cnt,avg,max,min).mkString("\t")
r: String = 1550000 12 129166 900000 10000

scala> r
res142: String = 1550000 12 129166 900000 10000

scala> data.collect
res143: Array[String] = Array(101,aaaa,70000,m,12, 102,bbbbb,90000,f,12, 103,cc,10000,m,11, 104,dd,40000,m,12, 105,cccc,70000,f,13, 106,de,80000,f,13, 107,io,90000,m,14, 108,yu,100000,f,14, 109,poi,30000,m,11, 110,aaa,60000,f,14, 123,djdj,900000,m,15, 122,asasd,10000,m,15)

scala> data.take(3).foreach(println)
101,aaaa,70000,m,12val
102,bbbbb,90000,f,12
103,cc,10000,m,11

scala> val res = data.map { x =>
val w = x.trim().split(",")
val id = w(0)
val name = w(1).toLowerCase
val fc = name.slice(0,1).toUpperCase
val rc = name.slice(1,name.size).toLowerCase
val sal = w(2).toInt
val grade = if (sal >= 70000) "A" else
if (sal >= 50000) "B" else
if (sal >= 30000) "C" else "D"
val dno = w(4).toInt
val dname = dno match {
case 11 => "Marketing"
case 12 => "HR"
case 13 => "Finance"
case others => "Others"
}
var sex = w(3).toLowerCase
sex = if (sex =="f") "Female" else "Male"
val Name = fc + rc
List(id,Name,w(2),grade,sex,dname).mkString("\t")
}
res: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at map at <console>:29

scala> res.collect
res2: Array[String] = Array(101 Aaaa 70000 A Male HR, 102 Bbbbb 90000 A Female HR, 103 Cc 10000 D Male Marketing, 104 Dd 40000 C Male HR, 105 Cccc 70000 A Female Finance, 106 De 80000 A Female Finance, 107 Io 90000 A Male Others, 108 Yu 100000 A Female Others, 109 Poi 30000 C Male Marketing, 110 Aaa 60000 B Female Others, 123 Djdj 900000 AMale Others, 122 Asasd 10000 D Male Others)

scala> res.collect.foreach(println)
101 Aaaa 70000 A Male HR
102 Bbbbb 90000 A Female HR
103 Cc 10000 D Male Marketing
104 Dd 40000 C Male HR
105 Cccc 70000 A Female Finance
106 De 80000 A Female Finance
107 Io 90000 A Male Others
108 Yu 100000 A Female Others
109 Poi 30000 C Male Marketing
110 Aaa 60000 B Female Others
123 Djdj 900000 A Male Others
122 Asasd 10000 D Male Others

// select sum(sal) from emp where sex ="m"

// writing a function to find gender
scala> def isMale(x:String) = {
| val w = x.split(",")
| val sex = w(3).toLowerCase
| sex =="m"
| }
isMale: (x: String)Boolean

scala> var males = data.filter(x => isMale(x))
males: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[4] at filter at <console>:31

scala> males.collect.foreach(println)
101,aaaa,70000,m,12
103,cc,10000,m,11
104,dd,40000,m,12
107,io,90000,m,14
109,poi,30000,m,11
123,djdj,900000,m,15
122,asasd,10000,m,15

filtered data : it has just male employee datas
scala> val sals = males.map ( x => x.split(",")(2).toInt)
sals: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[6] at map at <console>:33

scala> sals.collect
res7: Array[Int] = Array(70000, 10000, 40000, 90000, 30000, 900000, 10000)

scala> sals.reduce(_+_)
res8: Int = 1150000

// Maximum salary of female employee collection
scala> val fems = data.filter(x => !isMale(x))
fems: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at filter at <console>:31

scala> fems.collect.foreach(println)
102,bbbbb,90000,f,12
105,cccc,70000,f,13
106,de,80000,f,13
108,yu,100000,f,14
110,aaa,60000,f,14

scala> val maxOfFemale = fems.map (x => x.split(",")(2).toInt).reduce(Math.max(_,_))
maxOfFemale: Int = 100000

Merging RDDs (Union)
-------------------
scala> val l1 = List(10,20,30,50,80)
l1: List[Int] = List(10, 20, 30, 50, 80)

scala> val l2 = List(20,30,10,90,200)
l2: List[Int] = List(20, 30, 10, 90, 200)

scala> val r1 = sc.parallelize(l1)
r1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:29

scala> val r2 = sc.parallelize(l2)
r2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at <console>:29

scala> r1.collect
res11: Array[Int] = Array(10, 20, 30, 50, 80)

scala> r2.collect
res12: Array[Int] = Array(20, 30, 10, 90, 200)

scala> val r = r1.union(r2)
r: org.apache.spark.rdd.RDD[Int] = UnionRDD[11] at union at <console>:35

scala> r.count
res13: Long = 10

scala> r.collect // which merges 2 RDDs with duplicate values (UNION ALL)
res14: Array[Int] = Array(10, 20, 30, 50, 80, 20, 30, 10, 90, 200)

Merge 2 RDDs with duplicates then later eliminate duplicates

scala> val r3 = sc.parallelize(List(1,2,3,4,5,10,80,20))
r3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[15] at parallelize at <console>:27

//Merging more than 2 RDDs
scala> val result = r1.union(r2).union(r3)
result: org.apache.spark.rdd.RDD[Int] = UnionRDD[17] at union at <console>:37

scala> result.collect
res16: Array[Int] = Array(10, 20, 30, 50, 80, 20, 30, 10, 90, 200, 1, 2, 3, 4, 5, 10, 80, 20)

scala> result.count
res17: Long = 18

scala> val re1 = r1 ++ r2 // Mergining similar to r1.union(r2)
re1: org.apache.spark.rdd.RDD[Int] = UnionRDD[18] at $plus$plus at <console>:35

scala> re1.collect
res18: Array[Int] = Array(10, 20, 30, 50, 80, 20, 30, 10, 90, 200)

scala> val re2 = r1 ++ r2 ++ r3 // similar to : r1.union(r2).union(r3)
re2: org.apache.spark.rdd.RDD[Int] = UnionRDD[20] at $plus$plus at <console>:37

scala> re2.collect
res19: Array[Int] = Array(10, 20, 30, 50, 80, 20, 30, 10, 90, 200, 1, 2, 3, 4, 5, 10, 80, 20)

scala> val data = sc.parallelize(List(10,20,10,20,30,20,10,10))
data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[21] at parallelize at <console>:27

scala> data.collect
res20: Array[Int] = Array(10, 20, 10, 20, 30, 20, 10, 10)

scala> val data2 = data.distinct // avoid or eliminate duplicate
data2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[24] at distinct at <console>:29

scala> data2.collect
res21: Array[Int] = Array(30, 20, 10)

//Duplicates eliminated
scala> re1.distinct.collect
res26: Array[Int] = Array(200, 80, 30, 50, 90, 20, 10)

//Duplicates eliminated
scala> re2.distinct.collect
res27: Array[Int] = Array(30, 90, 3, 4, 1, 10, 200, 80, 50, 20, 5, 2)

//Duplicates included
scala> re1.collect
res28: Array[Int] = Array(10, 20, 30, 50, 80, 20, 30, 10, 90, 200)

//Duplicates included
scala> re2.collect
res29: Array[Int] = Array(10, 20, 30, 50, 80, 20, 30, 10, 90, 200, 1, 2, 3, 4, 5, 10, 80, 20)

scala> val x = sc.parallelize(List("A","B","c","D"))
x: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[31] at parallelize at <console>:27

scala> val y = sc.parallelize(List("A","c","M","N"))
y: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[32] at parallelize at <console>:27

scala> val z = x ++ y
z: org.apache.spark.rdd.RDD[String] = UnionRDD[33] at $plus$plus at <console>:31

//with duplicates
scala> z.collect
res30: Array[String] = Array(A, B, c, D, A, c, M, N)

//without duplicates
scala> z.distinct.collect
res31: Array[String] = Array(B, N, D, M, A, c)

Cross Join - Cartesian Join

Each element of left side RDD, will join with each elements of the right side RDD

key,value pair
scala> val pair1 = sc.parallelize(Array(("p1",10000),("p2",1000),("p2",20000),("p2",50000),("p3",60000)))
pair1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[39] at parallelize at <console>:27

scala> val pair2 = sc.parallelize(Array(("p1",20000),("p2",50000),("p1",10000)))
pair2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[40] at parallelize at <console>:27

//cross join goes here
scala> val cr = pair1.cartesian(pair2)
cr: org.apache.spark.rdd.RDD[((String, Int), (String, Int))] = CartesianRDD[41] at cartesian at <console>:31

pair1 count : 5, pair2 count : 3
cross will results 5 x 3 = 15 elements

scala> cr.collect.foreach(println)
((p1,10000),(p1,20000))
((p1,10000),(p2,50000))
((p1,10000),(p1,10000))
((p2,1000),(p1,20000))
((p2,1000),(p2,50000))
((p2,1000),(p1,10000))
((p2,20000),(p1,20000))
((p2,20000),(p2,50000))
((p2,20000),(p1,10000))
((p2,50000),(p1,20000))
((p2,50000),(p2,50000))
((p2,50000),(p1,10000))
((p3,60000),(p1,20000))
((p3,60000),(p2,50000))
((p3,60000),(p1,10000))

cartesian against 2 lists
//count is : 4
scala> val rdd1 = sc.parallelize(List(10,20,30,40))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[42] at parallelize at <console>:27

//count is : 2
scala> val rdd2 = sc.parallelize(List(10,200))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[43] at parallelize at <console>:27

scala> val result = rdd1.cartesian(rdd2)
result: org.apache.spark.rdd.RDD[(Int, Int)] = CartesianRDD[44] at cartesian at <console>:31

//count : 4 x 2 = 8
scala> result.collect.foreach(println)
(10,10)
(10,200)
(20,10)
(20,200)
(30,10)
(30,200)
(40,10)
(40,200)

// Taking data from emp in hdfs

scala> val data = sc.textFile("/user/cloudera/Sparks/emp")
data: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/emp MapPartitionsRDD[46] at textFile at <console>:27

scala> val dpair = data.map { x =>
| val w = x.split(",")
| val dno = w(4)
| val sal = w(2).toInt
| (dno,sal)
| }
dpair: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[47] at map at <console>:31

scala> dpair.collect.foreach(println)
(12,70000)
(12,90000)
(11,10000)
(12,40000)
(13,70000)
(13,80000)
(14,90000)
(14,100000)
(11,30000)
(14,60000)
(15,900000)
(15,10000)

scala> val dres = dpair.reduceByKey(_+_)
dres: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[48] at reduceByKey at <console>:33

// grouped aggregations
scala> dres.collect.foreach(println)
(14,250000)
(15,910000)
(12,200000)
(13,150000)
(11,40000)

// making a copy of dres
scala> val dres2 = dres
dres2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[48] at reduceByKey at <console>:33

// performing cartesian join here
scala> val cr = dres.cartesian(dres2)
cr: org.apache.spark.rdd.RDD[((String, Int), (String, Int))] = CartesianRDD[49] at cartesian at <console>:37

//cross join
scala> val cr = dres.cartesian(dres2)
cr: org.apache.spark.rdd.RDD[((String, Int), (String, Int))] = CartesianRDD[49] at cartesian at <console>:37

scala>

scala> cr.collect.foreach(println)
((14,250000),(14,250000))
((14,250000),(15,910000))
((14,250000),(12,200000))
((14,250000),(13,150000))
((14,250000),(11,40000))
((15,910000),(14,250000))
((15,910000),(15,910000))
((15,910000),(12,200000))
((15,910000),(13,150000))
((15,910000),(11,40000))
((12,200000),(14,250000))
((12,200000),(15,910000))
((12,200000),(12,200000))
((12,200000),(13,150000))
((12,200000),(11,40000))
((13,150000),(14,250000))
((13,150000),(15,910000))
((13,150000),(12,200000))
((13,150000),(13,150000))
((13,150000),(11,40000))
((11,40000),(14,250000))
((11,40000),(15,910000))
((11,40000),(12,200000))
((11,40000),(13,150000))
((11,40000),(11,40000))

val cr2 = cr.map { x =>
val t1 = x._1
val t2 = x._2
val dno1 = t1._1
val tot1 = t1._2
val dno2 = t2._1
val tot2 = t2._2
(dno1,dno2,tot1,tot2)
}

scala> cr2.collect.foreach(println)
(14,14,250000,250000) // reject this
(14,15,250000,910000)
(14,12,250000,200000)
(14,13,250000,150000)
(14,11,250000,40000)
(15,14,910000,250000)
(15,15,910000,910000) // reject this
(15,12,910000,200000)
(15,13,910000,150000)
(15,11,910000,40000)
(12,14,200000,250000)
(12,15,200000,910000)
(12,12,200000,200000) // reject this
(12,13,200000,150000)
(12,11,200000,40000)
(13,14,150000,250000)
(13,15,150000,910000)
(13,12,150000,200000)
(13,13,150000,150000) // reject this
(13,11,150000,40000)
(11,14,40000,250000)
(11,15,40000,910000)
(11,12,40000,200000)
(11,13,40000,150000)
(11,11,40000,40000) // reject this

if dno1 == dno2 then reject that

want to eliminate same dept

// if dno1 != dno2 get the result
scala> val cr3 = cr2.filter( x => x._1 != x._2)
cr3: org.apache.spark.rdd.RDD[(String, String, Int, Int)] = MapPartitionsRDD[51] at filter at <console>:41

// which dept's salary is greater than to which other dept's salary
scala> cr3.collect.foreach(println)
(14,15,250000,910000)
(14,12,250000,200000)
(14,13,250000,150000)
(14,11,250000,40000)
(15,14,910000,250000)
(15,12,910000,200000)
(15,13,910000,150000)
(15,11,910000,40000)
(12,14,200000,250000)
(12,15,200000,910000)
(12,13,200000,150000)
(12,11,200000,40000)
(13,14,150000,250000)
(13,15,150000,910000)
(13,12,150000,200000)
(13,11,150000,40000)
(11,14,40000,250000)
(11,15,40000,910000)
(11,12,40000,200000)
(11,13,40000,150000)

// i want tot1 > tot2 results only

scala> val cr4 = cr3.filter(x => x._3 >= x._4)
cr4: org.apache.spark.rdd.RDD[(String, String, Int, Int)] = MapPartitionsRDD[52] at filter at <console>:43

scala> cr4.collect.foreach(println)
(14,12,250000,200000)
(14,13,250000,150000)
(14,11,250000,40000)
(15,14,910000,250000)
(15,12,910000,200000)
(15,13,910000,150000)
(15,11,910000,40000)
(12,13,200000,150000)
(12,11,200000,40000)
(13,11,150000,40000)

scala> val cr5 = cr4.map (x => (x._1,1))
cr5: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[53] at map at <console>:45

scala> cr5.collect.foreach(println)
(14,1)
(14,1)
(14,1)
(15,1)
(15,1)
(15,1)
(15,1)
(12,1)
(12,1)
(13,1)

scala> val finalres = cr5.reduceByKey(_+_)
finalres: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[54] at reduceByKey at <console>:47

scala> finalres.collect.foreach(println)
(14,3)
(15,4)
(12,2)
(13,1)

scala> data.collect.foreach(println)
101,aaaa,70000,m,12
102,bbbbb,90000,f,12
103,cc,10000,m,11
104,dd,40000,m,12
105,cccc,70000,f,13
106,de,80000,f,13
107,io,90000,m,14
108,yu,100000,f,14
109,poi,30000,m,11
110,aaa,60000,f,14
123,djdj,900000,m,15
122,asasd,10000,m,15

we are able to compare 1 key with all other keys
if same key we filtered out

qn : Each dept total salary is greater than to how many other dept salary

create a sales data file in local
[cloudera@quickstart ~]$ gedit sales

[cloudera@quickstart ~]$ cat sales
01/01/2016,30000
01/05/2016,80000
01/30/2016,90000
02/01/2016,20000
02/25/2016,48000
03/01/2016,22000
03/05/2016,89000
03/30/2016,91000
04/01/2016,100000
04/25/2016,71000
05/01/2016,31500
06/05/2016,86600
07/30/2016,92000
08/01/2016,32000
09/25/2016,43000
09/01/2016,32300
10/05/2016,85000
10/30/2016,80000
11/01/2016,70300
11/25/2016,50000
12/01/2016,30000
12/05/2016,20200

//copy the file into hdfs
[cloudera@quickstart ~]$ hdfs dfs -copyFromLocal sales Sparks

check the data
[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/sales
01/01/2016,30000
01/05/2016,80000
01/30/2016,90000
02/01/2016,20000
02/25/2016,48000
03/01/2016,22000
03/05/2016,89000
03/30/2016,91000
04/01/2016,100000
04/25/2016,71000
05/01/2016,31500
06/05/2016,86600
07/30/2016,92000
08/01/2016,32000
09/25/2016,43000
09/01/2016,32300
10/05/2016,85000
10/30/2016,80000
11/01/2016,70300
11/25/2016,50000
12/01/2016,30000
12/05/2016,20200

//make RDD - read file and put the data in RDD
scala> val sales = sc.textFile("/user/cloudera/Sparks/sales")
sales: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/sales MapPartitionsRDD[56] at textFile at <console>:27

scala> sales.collect.foreach(println)
01/01/2016,30000
01/05/2016,80000
01/30/2016,90000
02/01/2016,20000
02/25/2016,48000
03/01/2016,22000
03/05/2016,89000
03/30/2016,91000
04/01/2016,100000
04/25/2016,71000
05/01/2016,31500
06/05/2016,86600
07/30/2016,92000
08/01/2016,32000
09/25/2016,43000
09/01/2016,32300
10/05/2016,85000
10/30/2016,80000
11/01/2016,70300
11/25/2016,50000
12/01/2016,30000
12/05/2016,20200

val pair = sales.map { x =>
val w = x.split(",")
val dt = w(0)
val pr = w(1).toInt
val m = dt.slice(0,2).toInt
(m,pr)
}

scala> pair.collect
res6: Array[(Int, Int)] = Array((1,30000), (1,80000), (1,90000), (2,20000), (2,48000), (3,22000), (3,89000), (3,91000), (4,10000), (5,31500), (6,86600), (7,92000), (8,32000), (9,43000), (9,32300), (10,85000), (10,80000), (11,70300), (11,50000), (12,30000), (12,20200))

scala> pair.collect.foreach(println)
(1,30000)
(1,80000)
(1,90000)
(2,20000)
(2,48000)
(3,22000)
(3,89000)
(3,91000)
(4,10000)
(5,31500)
(6,86600)
(7,92000)
(8,32000)
(9,43000)
(9,32300)
(10,85000)
(10,80000)
(11,70300)
(11,50000)
(12,30000)
(12,20200)

scala> val rep = pair.reduceByKey(_+_)
rep: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[12] at reduceByKey at <console>:31

scala> rep.collect.foreach(println)
(4,10000)
(11,120300)
(1,200000)
(6,86600)
(3,202000)
(7,92000)
(9,75300)
(8,32000)
(12,50200)
(10,165000)
(5,31500)
(2,68000)

// make ascending order
scala> val res = rep.sortByKey()
res: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[14] at sortByKey at <console>:33

scala> res.collect.foreach(println)
(1,200000)
(2,68000)
(3,202000)
(4,10000)
(5,31500)
(6,86600)
(7,92000)
(8,32000)
(9,75300)
(10,165000)
(11,120300)
(12,50200)

When compared to current and previous month's sales sales comparison
increased / decreased howmuch percentage increased / decreased

every month sales has to be compared with it's previous sales

cartesian join needed

scala> val res2 = res
res2: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[14] at sortByKey at <console>:33

scala> val cr = res.cartesian(res2)
cr: org.apache.spark.rdd.RDD[((Int, Int), (Int, Int))] = CartesianRDD[15] at cartesian at <console>:37

scala> cr.take(5).foreach(println)
((1,200000),(1,200000))
((1,200000),(2,68000))
((1,200000),(3,202000))
((1,200000),(4,10000))
((1,200000),(5,31500))

scala> val cr2 = cr.map{ x =>
| val t1 = x._1
| val t2 = x._2
| val m1 = t1._1
| val tot1 = t1._2
| val m2 = t2._1
| val tot2 = t2._2
| (m1,m2,tot1,tot2)
| }
cr2: org.apache.spark.rdd.RDD[(Int, Int, Int, Int)] = MapPartitionsRDD[16] at map at <console>:39

scala> cr2.collect.foreach(println)
(1,1,200000,200000)
(1,2,200000,68000)
(1,3,200000,202000)
(1,4,200000,10000)
(1,5,200000,31500)
(1,6,200000,86600)
(1,7,200000,92000)
(1,8,200000,32000)
(1,9,200000,75300)
(1,10,200000,165000)
(1,11,200000,120300)
(1,12,200000,50200)
(2,1,68000,200000)
(2,2,68000,68000)
(2,3,68000,202000)
(2,4,68000,10000)
(2,5,68000,31500)
(2,6,68000,86600)
(2,7,68000,92000)
(2,8,68000,32000)
(2,9,68000,75300)
(2,10,68000,165000)
(2,11,68000,120300)
(2,12,68000,50200)
(3,1,202000,200000)
(3,2,202000,68000)
(3,3,202000,202000)
(3,4,202000,10000)
(3,5,202000,31500)
(3,6,202000,86600)
(3,7,202000,92000)
(3,8,202000,32000)
(3,9,202000,75300)
(3,10,202000,165000)
(3,11,202000,120300)
(3,12,202000,50200)
(4,1,10000,200000)
(4,2,10000,68000)
(4,3,10000,202000)
(4,4,10000,10000)
(4,5,10000,31500)
(4,6,10000,86600)
(4,7,10000,92000)
(4,8,10000,32000)
(4,9,10000,75300)
(4,10,10000,165000)
(4,11,10000,120300)
(4,12,10000,50200)
(5,1,31500,200000)
(5,2,31500,68000)
(5,3,31500,202000)
(5,4,31500,10000)
(5,5,31500,31500)
(5,6,31500,86600)
(5,7,31500,92000)
(5,8,31500,32000)
(5,9,31500,75300)
(5,10,31500,165000)
(5,11,31500,120300)
(5,12,31500,50200)
(6,1,86600,200000)
(6,2,86600,68000)
(6,3,86600,202000)
(6,4,86600,10000)
(6,5,86600,31500)
(6,6,86600,86600)
(6,7,86600,92000)
(6,8,86600,32000)
(6,9,86600,75300)
(6,10,86600,165000)
(6,11,86600,120300)
(6,12,86600,50200)
(7,1,92000,200000)
(7,2,92000,68000)
(7,3,92000,202000)
(7,4,92000,10000)
(7,5,92000,31500)
(7,6,92000,86600)
(7,7,92000,92000)
(7,8,92000,32000)
(7,9,92000,75300)
(7,10,92000,165000)
(7,11,92000,120300)
(7,12,92000,50200)
(8,1,32000,200000)
(8,2,32000,68000)
(8,3,32000,202000)
(8,4,32000,10000)
(8,5,32000,31500)
(8,6,32000,86600)
(8,7,32000,92000)
(8,8,32000,32000)
(8,9,32000,75300)
(8,10,32000,165000)
(8,11,32000,120300)
(8,12,32000,50200)
(9,1,75300,200000)
(9,2,75300,68000)
(9,3,75300,202000)
(9,4,75300,10000)
(9,5,75300,31500)
(9,6,75300,86600)
(9,7,75300,92000)
(9,8,75300,32000)
(9,9,75300,75300)
(9,10,75300,165000)
(9,11,75300,120300)
(9,12,75300,50200)
(10,1,165000,200000)
(10,2,165000,68000)
(10,3,165000,202000)
(10,4,165000,10000)
(10,5,165000,31500)
(10,6,165000,86600)
(10,7,165000,92000)
(10,8,165000,32000)
(10,9,165000,75300)
(10,10,165000,165000)
(10,11,165000,120300)
(10,12,165000,50200)
(11,1,120300,200000)
(11,2,120300,68000)
(11,3,120300,202000)
(11,4,120300,10000)
(11,5,120300,31500)
(11,6,120300,86600)
(11,7,120300,92000)
(11,8,120300,32000)
(11,9,120300,75300)
(11,10,120300,165000)
(11,11,120300,120300)
(11,12,120300,50200)
(12,1,50200,200000)
(12,2,50200,68000)
(12,3,50200,202000)
(12,4,50200,10000)
(12,5,50200,31500)
(12,6,50200,86600)
(12,7,50200,92000)
(12,8,50200,32000)
(12,9,50200,75300)
(12,10,50200,165000)
(12,11,50200,120300)
(12,12,50200,50200)

Here cartesian joins jan with all other 11 months
feb with all other 11 months
dec with all other 11 months
but our moto is comparing just current month with its previous mont
Need to compare current month (Oct) with its previous (Aug) only

so filter condition should be
currentMonth - previousMonth = 1
it will filter all others exect (Oct,Aug) ... (Dec,Nov)

scala> val cr3 = cr2.filter (x => x._1 - x._2 == 1)
cr3: org.apache.spark.rdd.RDD[(Int, Int, Int, Int)] = MapPartitionsRDD[18] at filter at <console>:41

scala> cr3.count
res16: Long = 11

scala> cr3.collect.foreach(println)
(2,1,68000,200000)
(3,2,202000,68000)
(4,3,10000,202000)
(5,4,31500,10000)
(6,5,86600,31500)
(7,6,92000,86600)
(8,7,32000,92000)
(9,8,75300,32000)
(10,9,165000,75300)
(11,10,120300,165000)
(12,11,50200,120300)

howmuch percentage growth for each month

scala> val finalres = cr3.map { x =>
| val m1 = x._1
| val m2 = x._2
| val tot1 = x._3
| val tot2 = x._4
| val pgrowth = ( (tot1 - tot2) * 100) / tot2
| (m1,m2,tot1,tot2,pgrowth)
| }
finalres: org.apache.spark.rdd.RDD[(Int, Int, Int, Int, Int)] = MapPartitionsRDD[19] at map at <console>:43

// howmuch percentage of sales increased when comparing with previous month's sales
scala> finalres.collect.foreach(println)
(2,1,68000,200000,-66)
(3,2,202000,68000,197)
(4,3,10000,202000,-95)
(5,4,31500,10000,215)
(6,5,86600,31500,174)
(7,6,92000,86600,6)
(8,7,32000,92000,-65)
(9,8,75300,32000,135)
(10,9,165000,75300,119)
(11,10,120300,165000,-27)
(12,11,50200,120300,-58)

Dec,Nov,DecSales,NovSales,SalesGrowth (+/-)

Quarterly sales report comparision instead of monthly

Q1,Q2,Q3,Q4

Spark SQL:
It is a library to process spark data objects using sql statements (mysql select)

Spark SQL follows MySQL based SQL syntaxes

Spark Core provides
SQLContext, HiveContext (DataWarehouse)

DataFrame,DataSets,Temp Table

If data is in Hive Table, we have to use HiveContext

SparkContext (sc)
SparkStreamingContext
SQLContext
HiveContext

import org.apache.spark.sql.SqlContext

val sqlCon = new SqlContext(sc)

From Spark 1.6 onwards, SqlContext is by default available in Spark Shell

while doing Spark Programming in IDE we need to create instance of SqlContext

Using SQLContext,
we can process Spark objects using select statements

Using HiveContext,
we can integrate Hive with Spark.
Hive is a datawarehouse environment in hadoop framework

Data is stored and managed at Hive but processed in Spark
All valid Hive Queries are available in HiveContext

Using HiveContext we can access entire Hive environment (hive tables) from Spark

HQL statement vs Hive
---------------------
If HQL is executed within Hive Environment, the statements will be converted into MapReduce Job then
MapReduce (converted from HQL) will be executed (performance issue, disk i/o, java issues, missing inmemory computing)

If same Hive is integrated within Spark an HQL is submitted from Spark
and HQL is submitted from Spark, it uses DAG and inmemory computing models with persits.

persistanc feature, inmemory computing, customized parallel processing

Spark SQL limitations:
is applicable only for structured data

If data is unstructured,
need to process, with Spark Core's RDD API and Spark MLLib, NLP algorithms, nltk is best compatible with Spark MLLib

import org.apache.spark.sql.hive.HiveContext
val hc = new HiveContext(sc)

val sqc = new SqlContext(sc)

Steps to work with SQLContext:
#1: val data = sc.textFile("/user/cloudera/Sparks/file1")
Load data into RDD
sample - file1
100,200,300
300,400,400
...
#2: Provide schema to the RDD (Create case class)
case class Rec(a:Int, b:Int, c:Int)

#3: Create a function to convert raw linto case object
(Function to provide schema)

def makeRec(line:String) = {
val w = line.split(",")
val a = w(0).toInt
val b = w(1),toInt
val c = w(2).toInt
val r = Rec(a,b,c)
r
}
(In order to work with SQL, we definitely need schema)

#4 : Transform each record into case object

val recs = data.map (x => makeRec(x))

#5: convert RDD into Data Frame
val df = recs.toDF (To Data Frame)

#6: Create table instance for the data frame
df.registerTempTable("samp")

before Spark 1.3 no Data Frame
Spark 1.5 onwards DataSet too
RDD --> DataFrame --> DataSets

#7. Play SQL statements
run select statements against 'samp' (temp table)

#8. Apply Select Statement of SQL on temp table
val r1 = sqc.sql("select a+b+c as tot from samp")
(returned object is not a temp table, returned object is data frame)
(r1 is dataframe)
r1
----
tot
----
600
900
when sql statement applied on temp table, returned object will be dataframe
To apply SQL statement on result set again we need to register as temp table

r1.registerTempTable("samp1")

emp:
id,name,sal,sex,dno

101,aaa,40000,m,1
......

import org.apachce.spark.sql.SqlContext
val sqc = new SqlContext(sc)
val data = sc.textFile("/user/cloudera/Sparks/emp")

case class Emp(id:Int, name:String, sal:Int, sex:String, dno:Int)

def toEmp(x:String) = {
val w = x.trim().split(",")
val id = w(0).toInt
val name = w(1)
val sal = w(2).toInt
val sex = w(3)
val dno = w(4).toInt
val e = Emp(id,name,sal,sex,dno)
e
}

val emps = data.map ( x => toEmp(x));

val df = emps.toDF
df.registerTempTable("emp")

val r1 = sqc.sql("select sex,sum(sal) as tot from emp group by sex");

val res2 = sqc.sql("select dno,sex, sum(sal) as tot, avg(sal) as avg,
max(sal) as max, min(sal) as min, count(*) as cnt from emp group by dno,sex");

dept:
------
11,marketing,hyd
...

emp (file #1), dept (file #2)
val data2 = sc.textFile("user/cloudera/Sparks/dept")

case class Dept(dno:Int, dname:String, city:String)

val dept = data2.map { x =>
val w = x.split(",")
val id = w(0).toInt
val name = w(1)
val city = w(2)
Dept(id,name,city)
}

val df2 = dept.toDF
df2.registerTempTable("departs")

val res = sqc.sql("select city,sum(sal) as tot from emp l join departs r on l.dno = r.dno group by city');

(object type of res is dataframe) res.persist to keep it memory

Table available in Hive. we are going to access and run queries against Hive tables within Spark enviroment
------------------------------------------------------------------------------------------------------------
One time investment:
copy hive-site.xml into /usr/spark/conf folder
if this file is not copied, spark cannot understand hive metastore location

import org.apache.spark.sql.hive.HiveContext
val hc = new HiveContext(sc)
hc.sql("create database mydb")
hc.sql("use mydb")
hc.sql("create table result1 (dno int, tot int)")
hc.sql("insert into table result1 select dno,sum(sal) from default.emp group by dno")

select * from abc:
hive will contact metasore is in RDBMS (default Derby - light weight)
reconfigure it into Oracle or wherever

hql along with sql
------------------
val r1 = sc.sql("...")
val r2 = hc.sql("...")

r1.registerTempTable("res1")
r2.registerTempTable("res2")

one data is in file (using sqlContext we handle it)
one data is in database (using HiveContext we handle it)

finally we do union, joins etc against both of them

Working with json using sqlContext:
-------------------------------------

json Serde (Serialization, Deserialization) needed in Hive
get_, json_ tuple

working with json using sqlContext
json1
------
{"name":"Ravi,"age":20,"sex":"M"}
{"name":"Vani","city":"hyd","sex":"F"}

The response of webservices will be in json
even log files too as json

sc.textFile("....txt") /// for regular normal file
going to use sqlContext

val jdf = sqc.read.json("/user/cloudera/Sparks/json1.json")
jdf --> df automatically converted

name age city sex
---------------------------
Ravi 20 null M
Vani null Hyd F

jdf - data frame

How to work with XML?
----------------------
i) using 3rd party library (ex : databrics)
ii) Integrated Spark with HIve using HiveContext and apply xml parsers such as
xpath(),xpath_string(),xpath_int()... etc

xml1
-------
<rec><name>Ravi</name><age>20</age></rec>
<rec><name>Rani</name><sex>f</sex></rec>

hc.sql("use mydb")
hc.sql("create table raw (line string)")
hc.sql("load data local inpath 'xml1' into table raw")
hc.sql("create table info (name string, age int, sex string)
row format delimited fields terminated by ','")
hc.sql("insert overwrite table info
select
xpath_string(line,'rec/name')
xpath_int(line,'rec/age')
xpath_string(line,'rec/sex' from raw")

Spark SQL:
----------
#1 import org.apache.spark.sql.SqlContext
-- val sqlContext = new SQLContext(sc)
//to convert RDDs into DFs implicitly
import sqlContext.implicitis._

#2 load data from file
#3 create schema (case class)
#4 transform each element into case class
#5 convert into DF
#6 register as temp table
#7 Play with SQL
sqlContext.sql("SELECT ...") // only select statements are allowed

scala> import sqlContext.implicits._
import sqlContext.implicits._

scala> case class Samp(a:Int, b:Int, c:Int)
defined class Samp

scala> val s1 = Samp(10,20,30)
s1: Samp = Samp(10,20,30)

scala> val s2 = Samp(1,2,3)
s2: Samp = Samp(1,2,3)

scala> val s3 = Samp(100,200,300)
s3: Samp = Samp(100,200,300)

scala> val s4 = Samp(1000,2000,3000)
s4: Samp = Samp(1000,2000,3000)

scala> val data = List(s1,s2,s3,s4)
data: List[Samp] = List(Samp(10,20,30), Samp(1,2,3), Samp(100,200,300), Samp(1000,2000,3000))

scala> val data = sc.parallelize(List(s1,s2,s3,s4))
data: org.apache.spark.rdd.RDD[Samp] = ParallelCollectionRDD[20] at parallelize at <console>:40

scala> data.collect.foreach(println)
Samp(10,20,30)
Samp(1,2,3)
Samp(100,200,300)
Samp(1000,2000,3000)

scala> val x = data.map (v => v.a)
x: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[21] at map at <console>:42

scala> x.collect.foreach(println)
10
1
100
1000

scala> val x = data.map(v => v.a + v.b + v.c)
x: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[22] at map at <console>:42

scala> x.collect.foreach(println)
60
6
600
6000

//once your RDD is having Schema you can convert it into Data Frame

RDD with Schema can be converted into DataFrame

catalyst optimizer
dataframe specialized APIs
optimized
can turn data frame into temp table and play with sql statements

scala> val df = data.toDF
df: org.apache.spark.sql.DataFrame = [a: int, b: int, c: int]

scala> df.collect.foreach(println)
[10,20,30]
[1,2,3]
[100,200,300]
[1000,2000,3000]

scala> val df = data.toDF
df: org.apache.spark.sql.DataFrame = [a: int, b: int, c: int]

scala> df.collect.foreach(println)
[10,20,30]
[1,2,3]
[100,200,300]
[1000,2000,3000]

scala> df.printSchema
root
|-- a: integer (nullable = false)
|-- b: integer (nullable = false)
|-- c: integer (nullable = false)

scala>

scala> df.show()
+----+----+----+
| a| b| c|
+----+----+----+
| 10| 20| 30|
| 1| 2| 3|
| 100| 200| 300|
|1000|2000|3000|
+----+----+----+

scala> df.take(10)
res25: Array[org.apache.spark.sql.Row] = Array([10,20,30], [1,2,3], [100,200,300], [1000,2000,3000])

scala> df.show(3)
+---+---+---+
| a| b| c|
+---+---+---+
| 10| 20| 30|
| 1| 2| 3|
|100|200|300|
+---+---+---+
only showing top 3 rows

//register table to play SQL
scala> df.registerTempTable("df")

scala> sqlContext.sql("select * from df")
res31: org.apache.spark.sql.DataFrame = [a: int, b: int, c: int]

scala> val df2 = sqlContext.sql("select * from df")
df2: org.apache.spark.sql.DataFrame = [a: int, b: int, c: int]

scala> df2.show()
+----+----+----+
| a| b| c|
+----+----+----+
| 10| 20| 30|
| 1| 2| 3|
| 100| 200| 300|
|1000|2000|3000|
+----+----+----+

scala> val df2 = sqlContext.sql("select a,b from df")
df2: org.apache.spark.sql.DataFrame = [a: int, b: int]

scala> df2.show()
+----+----+
| a| b|
+----+----+
| 10| 20|
| 1| 2|
| 100| 200|
|1000|2000|
+----+----+

scala> val df3 = sqlContext.sql("select a,b,c,a+b+c as tot from df")
df3: org.apache.spark.sql.DataFrame = [a: int, b: int, c: int, tot: int]

scala> df3.show()
+----+----+----+----+
| a| b| c| tot|
+----+----+----+----+
| 10| 20| 30| 60|
| 1| 2| 3| 6|
| 100| 200| 300| 600|
|1000|2000|3000|6000|
+----+----+----+----+

scala> df3.printSchema
root
|-- a: integer (nullable = false)
|-- b: integer (nullable = false)
|-- c: integer (nullable = false)
|-- tot: integer (nullable = false)

Transformation is very easy in SparkSQL

SQL needs data must have proper schema

If it is enterprise data (structured)
RDD APIs with functional programming simplifying a lot
but still people are thinking thats difficult
so SparkSQL came into picture

load your file into RDD
create schema
transform each element into that schema
convert into df
register df in temp table

custom functionalities still needs RDD APIs

Create emp file in local then copy it into hdfs:
[cloudera@quickstart ~]$ cat > emp
101,aaa,40000,m,11
102,bbbb,50000,f,12
103,ccc,90000,m,12
104,ddddd,100000,f,13
105,eee,20000,m,11
106,iiii,30000,f,12
107,jjjj,60000,m,13
108,kkkkk,90000,f,14

[cloudera@quickstart ~]$ hdfs dfs -copyFromLocal emp Sparks/emp

scala>

scala> val raw = sc.textFile("/user/cloudera/Sparks/emp")
raw: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/emp MapPartitionsRDD[38] at textFile at <console>:30

scala> raw.count
res36: Long = 8

scala> raw.collect.foreach(println)
101,aaa,40000,m,11
102,bbbb,50000,f,12
103,ccc,90000,m,12
104,ddddd,100000,f,13
105,eee,20000,m,11
106,iiii,30000,f,12
107,jjjj,60000,m,13
108,kkkkk,90000,f,14

scala> raw.take(1) // without schema
res38: Array[String] = Array(101,aaa,40000,m,11)

//we create a case class to apply schema to existing RDD

scala> case class Info(id:Int, name:String,sal:Int,sex:String,dno:Int)

// create a function to apply Info case class for each element

def toInfo (x:String) = {
val w = x.split(",")
val id = w(0).toInt
val name = w(1)
val sal = w(2).toInt
val sex = w(3)
val dno = w(4).toInt
val info = Info(id,name,sal,sex,dno)
info
}

scala> val rec = "401,Amar,7000,m,12"
rec: String = 401,Amar,7000,m,12

scala> val re = toInfo(rec)
re: Info = Info(401,Amar,7000,m,12)

scala> re
res39: Info = Info(401,Amar,7000,m,12)

scala> re.name
res41: String = Amar

scala> re.sex
res42: String = m

scala> re.sal
res43: Int = 7000

scala> re.dno
res44: Int = 12

scala> re.id
res45: Int = 401

scala> val infos = raw.map(x => toInfo(x))
infos: org.apache.spark.rdd.RDD[Info] = MapPartitionsRDD[40] at map at <console>:50

scala> infos.collect.foreach(println)
Info(101,aaa,40000,m,11)
Info(102,bbbb,50000,f,12)
Info(103,ccc,90000,m,12)
Info(104,ddddd,100000,f,13)
Info(105,eee,20000,m,11)
Info(106,iiii,30000,f,12)
Info(107,jjjj,60000,m,13)
Info(108,kkkkk,90000,f,14)

scala> infos.map( x => x.sal).sum
res49: Double = 480000.0

//now infos has Schema so its eligible to convert into Data Frame

scala> val dfinfo = infos.toDF
dfinfo: org.apache.spark.sql.DataFrame = [id: int, name: string, sal: int, sex: string, dno: int]

scala> dfinfo.show()
+---+-----+------+---+---+
| id| name| sal|sex|dno|
+---+-----+------+---+---+
|101| aaa| 40000| m| 11|
|102| bbbb| 50000| f| 12|
|103| ccc| 90000| m| 12|
|104|ddddd|100000| f| 13|
|105| eee| 20000| m| 11|
|106| iiii| 30000| f| 12|
|107| jjjj| 60000| m| 13|
|108|kkkkk| 90000| f| 14|
+---+-----+------+---+---+

scala> dfinfo.printSchema
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- sal: integer (nullable = false)
|-- sex: string (nullable = true)
|-- dno: integer (nullable = false)

scala> sqlContext.sql("select * from dfinfo where sex='m'").show()
+---+----+-----+---+---+
| id|name| sal|sex|dno|
+---+----+-----+---+---+
|101| aaa|40000| m| 11|
|103| ccc|90000| m| 12|
|105| eee|20000| m| 11|
|107|jjjj|60000| m| 13|
+---+----+-----+---+---+

scala> sqlContext.sql("select * from dfinfo where sex='f'").show()
+---+-----+------+---+---+
| id| name| sal|sex|dno|
+---+-----+------+---+---+
|102| bbbb| 50000| f| 12|
|104|ddddd|100000| f| 13|
|106| iiii| 30000| f| 12|
|108|kkkkk| 90000| f| 14|
+---+-----+------+---+---+

RDD way to find sum of sal for male and female employees

scala> infos
res60: org.apache.spark.rdd.RDD[Info] = MapPartitionsRDD[40] at map at <console>:50

scala> val pair = infos.map ( x => (x.sex,x.sal))
pair: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[61] at map at <console>:52

scala> val res = pair.reduceByKey(_+_)
res: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[62] at reduceByKey at <console>:54

scala> res.collect.foreach(println)
(f,270000)
(m,210000)

scala> var r = sqlContext.sql("select sex,sum(sal) as tot from dfinfo group by sex")
r: org.apache.spark.sql.DataFrame = [sex: string, tot: bigint]

scala> r.show()
+---+------+
|sex| tot|
+---+------+
| f|270000|
| m|210000|
+---+------+

// if it is RDD -> saveAsTextFile
// if it is DataFrame -> avro,orc,parquet,json etc
code is vastly reduced
optimized memory management
very faster execution

// we need to register this as temp table then only it will allow us to do sql queries against it

RDD way to filter records:
-------------------------
scala> infos.filter( x => x.sex.toLowerCase == "m").collect.foreach(println)
Info(101,aaa,40000,m,11)
Info(103,ccc,90000,m,12)
Info(105,eee,20000,m,11)
Info(107,jjjj,60000,m,13)

scala> infos.filter( x => x.sex.toLowerCase == "f").collect.foreach(println)
Info(102,bbbb,50000,f,12)
Info(104,ddddd,100000,f,13)
Info(106,iiii,30000,f,12)
Info(108,kkkkk,90000,f,14)

scala> dfinfo.registerTempTable("dfinfo")

code vastly reduced
optimized way of data transformation and memory management

Play with multiple databases

RDD style of multiple aggregations
In RDD style, for each sex group I want all 5 aggregations
later we will convert the same thing in SQL

scala> infos.collect.foreach(println)
Info(101,aaa,40000,m,11)
Info(102,bbbb,50000,f,12)
Info(103,ccc,90000,m,12)
Info(104,ddddd,100000,f,13)
Info(105,eee,20000,m,11)
Info(106,iiii,30000,f,12)
Info(107,jjjj,60000,m,13)
Info(108,kkkkk,90000,f,14)

scala> pair.collect.foreach(println)
(m,40000)
(f,50000)
(m,90000)
(f,100000)
(m,20000)
(f,30000)
(m,60000)
(f,90000)

scala> val pair = infos.map (x => (x.sex,x.sal))
pair: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[70] at map at <console>:52

scala> pair.collect.foreach(println)
(m,40000)
(f,50000)
(m,90000)
(f,100000)
(m,20000)
(f,30000)
(m,60000)
(f,90000)

scala> val grp = pair.groupByKey()
grp: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[71] at groupByKey at <console>:54

scala> grp.collect.foreach(println)
(f,CompactBuffer(50000, 100000, 30000, 90000))
(m,CompactBuffer(40000, 90000, 20000, 60000))

scala> val res = grp.map { x =>
| val sex = x._1
| val cb = x._2
| val tot = cb.sum
| val cnt = cb.size
| val avg = tot / cnt
| val max = cb.max
| val min = cb.min
| (sex,tot,cnt,avg,max,min)
| }
res: org.apache.spark.rdd.RDD[(String, Int, Int, Int, Int, Int)] = MapPartitionsRDD[72] at map at <console>:56

scala> res.collect.foreach(println)
(f,270000,4,67500,100000,30000)
(m,210000,4,52500,90000,20000)

select sex,sum(sal) as tot, count(*) as cnt, avg(sal) as avg,
max(sal) as max, min(sal) as min
from dfinfo group by sex

scala> sqlContext.sql("select sex,sum(sal) as tot, count(*) as cnt, avg(sal) as avg, max(sal) as max, min(sal) as min from dfinfo group by sex").show()
+---+------+---+-------+------+-----+
|sex| tot|cnt| avg| max| min|
+---+------+---+-------+------+-----+
| f|270000| 4|67500.0|100000|30000|
| m|210000| 4|52500.0| 90000|20000|
+---+------+---+-------+------+-----+

select sex,sum(sal) as tot, count(*) as cnt, avg(sal) as avg,
max(sal) as max, min(sal) as min
from dfinfo group by sex

scala> sqlContext.sql("select dno,sex,sum(sal) as tot, count(*) as cnt, avg(sal) as avg, max(sal) as max, min(sal) as min from dfinfo group by dno,sex").show()
+---+---+------+---+--------+------+------+
|dno|sex| tot|cnt| avg| max| min|
+---+---+------+---+--------+------+------+
| 11| m| 60000| 2| 30000.0| 40000| 20000|
| 12| f| 80000| 2| 40000.0| 50000| 30000|
| 12| m| 90000| 1| 90000.0| 90000| 90000|
| 13| f|100000| 1|100000.0|100000|100000|
| 13| m| 60000| 1| 60000.0| 60000| 60000|
| 14| f| 90000| 1| 90000.0| 90000| 90000|
+---+---+------+---+--------+------+------+

// multi grouping, multiple aggregations done

[cloudera@quickstart ~]$ cat > emp
101,aaa,40000,m,11
102,bbbb,50000,f,12
103,ccc,90000,m,12
104,ddddd,100000,f,13
105,eee,20000,m,11
106,iiii,30000,f,12
107,jjjj,60000,m,13
108,kkkk,90000,f,14

[cloudera@quickstart ~]$ cat > emp2
201,kiran,14,m,90000
202,mani,12,f,10000
203,giri,12,m,20000
204,girija,11,f,40000

[cloudera@quickstart ~]$ hdfs dfs -copyFromLocal emp Sparks/emp

[cloudera@quickstart ~]$ hdfs dfs -copyFromLocal emp2 Sparks/emp2

[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/emp
101,aaa,40000,m,11
102,bbbb,50000,f,12
103,ccc,90000,m,12
104,ddddd,100000,f,13
105,eee,20000,m,11
106,iiii,30000,f,12
107,jjjj,60000,m,13
108,kkkk,90000,f,14

[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/emp2
201,kiran,14,m,90000
202,mani,12,f,10000
203,giri,12,m,20000
204,girija,11,f,40000

we have 2 different files emp and emp2 both schema different between them

scala> raw
res71: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/emp MapPartitionsRDD[38] at textFile at <console>:30

scala> infos
res72: org.apache.spark.rdd.RDD[Info] = MapPartitionsRDD[40] at map at <console>:50

(2 table joining using Spark SQL)
scala> val raw2 = sc.textFile("/user/cloudera/Sparks/emp2")
raw2: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/emp2 MapPartitionsRDD[94] at textFile at <console>:30

scala> raw2.collect.foreach(println)
201,kiran,14,m,90000
202,mani,12,f,10000
203,giri,12,m,20000
204,girija,11,f,40000

scala> val infos2 = raw2.map { x =>
| val w = x.split(",")
| val id = w(0).toInt
| val name = w(1)
| val dno = w(2).toInt
| val sex = w(3)
| val sal = w(4).toInt
| Info(id,name,sal,sex,dno)
| }
infos2: org.apache.spark.rdd.RDD[Info] = MapPartitionsRDD[95] at map at <console>:48

scala> infos2.collect.foreach(println)
Info(201,kiran,90000,m,14)
Info(202,mani,10000,f,12)
Info(203,giri,20000,m,12)
Info(204,girija,40000,f,11)

scala> infos.collect.foreach(println)
Info(101,aaa,40000,m,11)
Info(102,bbbb,50000,f,12)
Info(103,ccc,90000,m,12)
Info(104,ddddd,100000,f,13)
Info(105,eee,20000,m,11)
Info(106,iiii,30000,f,12)
Info(107,jjjj,60000,m,13)
Info(108,kkkk,90000,f,14)

scala> dfinfo.show(2)
+---+----+-----+---+---+
| id|name| sal|sex|dno|
+---+----+-----+---+---+
|101| aaa|40000| m| 11|
|102|bbbb|50000| f| 12|
+---+----+-----+---+---+
only showing top 2 rows

scala> val dfinfo2 = infos2.toDF
dfinfo2: org.apache.spark.sql.DataFrame = [id: int, name: string, sal: int, sex: string, dno: int]

scala> dfinfo2.show(2)
+---+-----+-----+---+---+
| id| name| sal|sex|dno|
+---+-----+-----+---+---+
|201|kiran|90000| m| 14|
|202| mani|10000| f| 12|
+---+-----+-----+---+---+
only showing top 2 rows

scala> val df = sqlContext.sql("select * from dfinfo union all select * from dfinfo2").show()
+---+------+------+---+---+
| id| name| sal|sex|dno|
+---+------+------+---+---+
|101| aaa| 40000| m| 11|
|102| bbbb| 50000| f| 12|
|103| ccc| 90000| m| 12|
|104| ddddd|100000| f| 13|
|105| eee| 20000| m| 11|
|106| iiii| 30000| f| 12|
|107| jjjj| 60000| m| 13|
|108| kkkk| 90000| f| 14|
|201| kiran| 90000| m| 14|
|202| mani| 10000| f| 12|
|203| giri| 20000| m| 12|
|204|girija| 40000| f| 11|
+---+------+------+---+---+

scala> df.registerTempTable("df")

// combined aggregation of both tables (emp,emp2)
scala> sqlContext.sql("select sex,sum(sal) as tot from df group by sex").show()
+---+------+
|sex| tot|
+---+------+
| f|320000|
| m|320000|
+---+------+

Multi table joining (schema is different for left and right tables)
[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/dept
11,marketing,hyd
12,hr,del
13,finance,hyd
14,admin,del
15,accounts,hyd

[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/emp
101,aaa,40000,m,11
102,bbbb,50000,f,12
103,ccc,90000,m,12
104,ddddd,100000,f,13
105,eee,20000,m,11
106,iiii,30000,f,12
107,jjjj,60000,m,13
108,kkkk,90000,f,14

val raw3 = sc.textFile("/user/cloudera/Sparks/dept")
raw3: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/dept MapPartitionsRDD[114] at textFile at <console>:30

scala> raw3.collect.foreach(println)
11,marketing,hyd
12,hr,del
13,finance,hyd
14,admin,del
15,accounts,hyd

scala> case class Dept(dno:Int, dname:String,loc:String)
defined class Dept

scala> val dept = raw3.map { x =>
| val w = x.split(",")
| val dno = w(0).toInt
| val dname = w(1)
| val loc = w(2)
| Dept(dno,dname,loc)
| }
dept: org.apache.spark.rdd.RDD[Dept] = MapPartitionsRDD[115] at map at <console>:48

what is the salary budget of each city?

scala> dept.collect.foreach(println)
Dept(11,marketing,hyd)
Dept(12,hr,del)
Dept(13,finance,hyd)
Dept(14,admin,del)
Dept(15,accounts,hyd)

scala> infos.collect.foreach(println)
Info(101,aaa,40000,m,11)
Info(102,bbbb,50000,f,12)
Info(103,ccc,90000,m,12)
Info(104,ddddd,100000,f,13)
Info(105,eee,20000,m,11)
Info(106,iiii,30000,f,12)
Info(107,jjjj,60000,m,13)
Info(108,kkkk,90000,f,14)

scala> val deptdf = dept.toDF
deptdf: org.apache.spark.sql.DataFrame = [dno: int, dname: string, loc: string]

scala> deptdf.registerTempTable("dept")

Accessing Hive Tables using Spark:
----------------------------------

sqlContext
hiveContext
search for 'hive-site.xml' file in linux

su
Password: cloudera

[root@quickstart cloudera]# find / -name hive-site.xml
/home/cloudera/Desktop/hive-site.xml
/etc/hive/conf.dist/hive-site.xml
/etc/impala/conf.dist/hive-site.xml

copy hive-site.xml into /usr/lib/spark/conf
--------------------------------------------
[root@quickstart ~]# cp /etc/hive/conf.dist/hive-site.xml /usr/lib/spark/conf

//import package for hive in spark shell
scala> import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.HiveContext

// create the instance for HiveContext and pass sc (sqlContext)
scala> val hc = new HiveContext(sc)
hc: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@645a3f6d

// create hive database in spark
scala> hc.sql("create database myspark")
res109: org.apache.spark.sql.DataFrame = [result: string]

start hive and look for myspark database:
--------------------------------------
[cloudera@quickstart ~]$ hive

Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> show databases;
OK
batch_may
default
myspark
sakthi
Time taken: 0.028 seconds, Fetched: 4 row(s)
hive>

scala> hc.sql("use myspark")
res111: org.apache.spark.sql.DataFrame = [result: string]

scala> hc.sql("create table samp(id int, name string, sal int, sex string, dno int) row format delimited fields terminated by ','")
res112: org.apache.spark.sql.DataFrame = [result: string]

hive> use myspark;
OK
Time taken: 0.015 seconds
hive> show tables;
OK
samp
Time taken: 0.018 seconds, Fetched: 1 row(s)
hive> describe samp;
OK
id int
name string
sal int
sex string
dno int
Time taken: 0.112 seconds, Fetched: 5 row(s)

load data into samp :
---------------------
scala> hc.sql("load data local inpath 'emp' into table samp")
res113: org.apache.spark.sql.DataFrame = [result: string]

scala> hc.sql("select * from samp").show()
+---+-----+------+---+---+
| id| name| sal|sex|dno|
+---+-----+------+---+---+
|101| aaa| 40000| m| 11|
|102| bbbb| 50000| f| 12|
|103| ccc| 90000| m| 12|
|104|ddddd|100000| f| 13|
|105| eee| 20000| m| 11|
|106| iiii| 30000| f| 12|
|107| jjjj| 60000| m| 13|
|108| kkkk| 90000| f| 14|
+---+-----+------+---+---+

scala> val res1 = hc.sql("select dno,sum(sal) as tot from samp group by dno")
res1: org.apache.spark.sql.DataFrame = [dno: int, tot: bigint]

scala> res1.take(5)
res116: Array[org.apache.spark.sql.Row] = Array([11,60000], [12,170000], [13,160000], [14,90000])

scala> res1.show()
+---+------+
|dno| tot|
+---+------+
| 11| 60000|
| 12|170000|
| 13|160000|
| 14| 90000|
+---+------+

// here map-reduce operation it will convert hql query into map-reduce java code then it will run .jar file (mappp... reduce...)
hive> select dno,sum(sal) from samp group by dno;

11 60000
12 170000
13 160000
14 90000

Hive provides partitioning table to avoid scanning. faster search query results

Create a json file in local linux :
-------------------------------------
cat > mydata.json
{"name":"Ravi","age":25}
{"name":"Rani","city":"Hyd"}
{"name":"Mani","age":24,"city":"Del"}

copy mydata.json into hdfs Sparks folder:
--------------------------------------------
[cloudera@quickstart ~]$ hdfs dfs -copyFromLocal mydata.json Sparks

display the content of mydata.json:
-----------------------------------
hdfs dfs -cat /user/cloudera/Sparks/mydata.json
{"name":"Ravi","age":25}
{"name":"Rani","city":"Hyd"}
{"name":"Mani","age":24,"city":"Del"}

import json into Hive table :
-------------------------------
Fetched: 4 row(s)
hive> use myspark;
OK
Time taken: 0.046 seconds

hive> use myspark;
OK
Time taken: 0.016 seconds

hive> create table raw(line string);
OK
Time taken: 0.734 seconds

hive> load data local inpath 'mydata.json' into table raw;
Loading data to table myspark.raw
Table myspark.raw stats: [numFiles=1, totalSize=92]
OK
Time taken: 0.636 seconds

hive> select * from raw;
OK
{"name":"Ravi","age":25}
{"name":"Rani","city":"Hyd"}
{"name":"Mani","age":24,"city":"Del"}
Time taken: 0.079 seconds, Fetched: 3 row(s)

hive> create table info (name string, age int, city string);

//create destination / target table
select get_json_object(line,'$.name') from raw;
OK
Ravi
Rani
Mani
Time taken: 0.059 seconds, Fetched: 3 row(s)

//fetch json elements from raw table using get_json_object:
hive> select get_json_object(line,'$.name'),get_json_object(line,'$.age') from raw;
OK
Ravi 25
Rani NULL
Mani 24
Time taken: 0.065 seconds, Fetched: 3 row(s)

hive> select get_json_object(line,'$.name'),get_json_object(line,'$.age'),get_json_object(line,'$.city') from raw;
OK
Ravi 25 NULL
Rani NULL Hyd
Mani 24 Del
Time taken: 0.058 seconds, Fetched: 3 row(s)

//fetch json elements from raw table using json_tuple
hive> select x.* from raw lateral view json_tuple(line,'name','age','city') x as name,age,city;
OK
Ravi 25 NULL
Rani NULL Hyd
Mani 24 Del
Time taken: 0.063 seconds, Fetched: 3 row(s)

// fetch raw table json elements and put them into info table:
hive> insert into table info select x.* from raw lateral view json_tuple(line,'name','age','city') x as name,age,city;

hive> select * from info;
OK
Ravi 25 NULL
Rani NULL Hyd
Mani 24 Del
Time taken: 0.076 seconds, Fetched: 3 row(s)

[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/mydata.json
{"name":"Ravi","age":25}
{"name":"Rani","city":"Hyd"}
{"name":"Mani","age":24,"city":"Del"}

using Spark way to do with json files:
------------------------------------------

[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/mydata.json

scala> val jdf = sqlContext.read.json("/user/cloudera/Sparks/mydata.json")
jdf: org.apache.spark.sql.DataFrame = [age: bigint, city: string, name: string]

scala> jdf.show()
+----+----+----+
| age|city|name|
+----+----+----+
| 25|null|Ravi|
|null| Hyd|Rani|
| 24| Del|Mani|
+----+----+----+

scala> jdf.count
res119: Long = 3

scala> jdf.take(3)
res120: Array[org.apache.spark.sql.Row] = Array([25,null,Ravi], [null,Hyd,Rani], [24,Del,Mani])

scala> jdf.printSchema
root
|-- age: long (nullable = true)
|-- city: string (nullable = true)
|-- name: string (nullable = true)

read.json ==> Serialization, Deserialization happens automatically

// example to do with nested json
// Hive approach first

cat > mydata1.json
{"name":"Ravi","age":25,"wife":{"name":"Rani","age":24,"city":"hyd"},"city":"del"}
{"name":"Kiran","age":30,"wife":{"name":"Veni","qual":"BTech","city":"hyd"},"city":"hyd"}

[cloudera@quickstart ~]$ cat mydata1.json
{"name":"Ravi","age":25,"wife":{"name":"Rani","age":24,"city":"hyd"},"city":"del"}
{"name":"Kiran","age":30,"wife":{"name":"Veni","qual":"BTech","city":"hyd"},"city":"hyd"}

[cloudera@quickstart ~]$ hdfs dfs -copyFromLocal mydata1.json Sparks

[cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/Sparks/mydata1.json
{"name":"Ravi","age":25,"wife":{"name":"Rani","age":24,"city":"hyd"},"city":"del"}
{"name":"Kiran","age":30,"wife":{"name":"Veni","qual":"BTech","city":"hyd"},"city":"hyd"}

hive> create table jraw (line string);
OK
Time taken: 0.325 seconds

hive> load data local inpath 'mydata1.json' into table jraw;
Loading data to table myspark.jraw
Table myspark.jraw stats: [numFiles=1, totalSize=173]
OK
Time taken: 0.29 seconds

hive> select * from jraw;
OK
{"name":"Ravi","age":25,"wife":{"name":"Rani","age":24,"city":"hyd"},"city":"del"}
{"name":"Kiran","age":30,"wife":{"name":"Veni","qual":"BTech","city":"hyd"},"city":"hyd"}
Time taken: 0.054 seconds, Fetched: 2 row(s)

hive> create table raw2(name string, age int, wife string, city string);
OK
Time taken: 0.235 seconds

select x.* from jraw lateral view json_tuple(line,'name','age','wife','city') x as n,a,w,c;
OK
Ravi 25 {"name":"Rani","age":24,"city":"hyd"} del
Kiran 30 {"name":"Veni","qual":"BTech","city":"hyd"} hyd
Time taken: 0.071 seconds, Fetched: 2 row(s)

hive> insert into table raw2 select x.* from jraw lateral view json_tuple(line,'name','age','wife','city') x as n,a,w,c;

select * from raw2;
OK
Ravi 25 {"name":"Rani","age":24,"city":"hyd"} del
Kiran 30 {"name":"Veni","qual":"BTech","city":"hyd"} hyd
Time taken: 0.053 seconds, Fetched: 2 row(s)

hive> select name,get_json_object(wife,'$.name'), age,get_json_object(wife,'$.age'),get_json_object(wife,'$.qual'),city,get_json_object(wife,'$.city') from raw2;
OK
Ravi Rani 25 24 NULL del hyd
Kiran Veni 30 NULL BTech hyd hyd
Time taken: 0.063 seconds, Fetched: 2 row(s)

Spark approach to handle nested json:
--------------------------------------
scala> val couples = sqlContext.read.json("/user/cloudera/Sparks/mydata1.json")
couples: org.apache.spark.sql.DataFrame = [age: bigint, city: string, name: string, wife: struct<age:bigint,city:string,name:string,qual:string>]

scala> couples.show();
+---+----+-----+--------------------+
|age|city| name| wife|
+---+----+-----+--------------------+
| 25| del| Ravi| [24,hyd,Rani,null]|
| 30| hyd|Kiran|[null,hyd,Veni,BT...|
+---+----+-----+--------------------+

scala> couples.collect
res126: Array[org.apache.spark.sql.Row] = Array([25,del,Ravi,[24,hyd,Rani,null]], [30,hyd,Kiran,[null,hyd,Veni,BTech]])

scala> couples.collect.map (x => x(3))
res129: Array[Any] = Array([24,hyd,Rani,null], [null,hyd,Veni,BTech])

Hive with XML and Spark with XML
-------------------------------
Old semistructured is XML
Latest semistructured is json

In IT industry, Most of old data is in XML

Hive has powerful feature for XML parsers
Databrics provides 3rd party libraries

We can get help from HQL for XML parsing

Create an XML file:
-------------------
[cloudera@quickstart ~]$ cat > my1st.xml
<rec><name>Ravi</name><age>25</age></rec>
<rec><name>Rani</name><sex>F</sex></rec>
<rec><name>Giri</name><age>35</age><sex>M</sex></rec>

Copy it into hdfs:
-------------------
[cloudera@quickstart ~]$ hdfs dfs -copyFromLocal my1st.xml Sparks

org.apache.spark.sql.hive.HiveContext
val hc = new HiveContext(sc)
scala> hc
res131: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@645a3f6d

scala> hc.sql("use myspark")
res132: org.apache.spark.sql.DataFrame = [result: string]

scala> hc.sql("create table xraw(line string)")
res133: org.apache.spark.sql.DataFrame = [result: string]

scala> hc.sql("create table xinfo(name string, age int, city string) row format delimited fields terminated by ','")
res134: org.apache.spark.sql.DataFrame = [result: string]

hdfs location :
/user/hive/warehouse/mysparks/xraw

scala> hc.sql("load data local inpath 'my1st.xml' into table xraw")
res135: org.apache.spark.sql.DataFrame = [result: string]

scala> hc.sql("select * from xraw").show()
+--------------------+
| line|
+--------------------+
|<rec><name>Ravi</...|
|<rec><name>Rani</...|
|<rec><name>Giri</...|
+--------------------+

scala> hc.sql("select xpath_string(line,'rec/name') from xraw").show()
+----+
| _c0|
+----+
|Ravi|
|Rani|
|Giri|
+----+

scala> hc.sql("select xpath_string(line,'rec/age') from xraw").show()
+---+
|_c0|
+---+
| 25|
| |
| 35|
+---+

scala> hc.sql("select xpath_string(line,'rec/sex') from xraw").show()
+---+
|_c0|
+---+
| |
| F|
| M|
+---+

scala> val re = hc.sql("select xpath_string(line,'rec/name'), xpath_string(line,'rec/age'),xpath_string(line,'rec/sex') from xraw")
re: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string, _c2: string]

scala> re.show();
+----+---+---+
| _c0|_c1|_c2|
+----+---+---+
|Ravi| 25| |
|Rani| | F|
|Giri| 35| M|
+----+---+---+

scala> val re1 = hc.sql("select xpath_string(line,'rec/name'), xpath_int(line,'rec/age'),xpath_string(line,'rec/sex') from xraw")
re1: org.apache.spark.sql.DataFrame = [_c0: string, _c1: int, _c2: string]

scala> re1.show()
+----+---+---+
| _c0|_c1|_c2|
+----+---+---+
|Ravi| 25| |
|Rani| 0| F|
|Giri| 35| M|
+----+---+---+

// put all the results taken from xraw into xresults:
-------------------------------------------------------
scala> hc.sql("insert into table xresults select xpath_string(line,'rec/name'), xpath_int(line,'rec/age'),xpath_string(line,'rec/sex') from xraw")
res144: org.apache.spark.sql.DataFrame = []

scala> hc.sql("select * from xresults").show()
+----+---+---+
|name|age|sex|
+----+---+---+
|Ravi| 25| |
|Rani| 0| F|
|Giri| 35| M|
+----+---+---+

Hive Integration Advantage:
Speed because of inmemory computing

DataFrame:
Spark Data Objects -> RDD
Data Frame -> Temporary Table
-> SQL queries

SparkSQL provides 2 types of data objects:
1) Data Frame
2) Data Set

RDD Vs Data Frame Vs Data Set

RDD Data Frame DataSet

RDD APIs Yes No Yes

DF APIs No Yes No

DS APIs No No Yes

(Catalyst Optimizer) (Catalyst Optimizer + Tungston optimizer)

Data Frames are faster than RDDs because of Catalyst optimizer
DataSets are faster than RDDs and Data Frames (Catalyst Optimizer + Tungston Optimizer)

Both RDDs and Data Frames uses inmemory computing
Inmemory computing is very much faster than traditional Disk I/O computing

CPU cache - Frequently used data will be cached to get more performance
Tungston uses CPU caches along with inmemory computing
(L1, L2, L3, L4 caching)

Computing using CPU caches is very faster than inmemory computing.

Speed of CPU is greater than speed of in memory and is greater than speed of disk computing

MapReduce (Disk Computing)
RDDs (In memory computing)
DataSets ( along with In memory, CPU caching) -- speed

DataSets are more than 50% speed than traditional RDDs.

Spark In memory cache is speed but DataSets are combining with RDDs inmemory computing
+ L1,L2,L3,L4 cpu caching

Data Set Example:
----------------
scala> import sqlContext.implicits._
import sqlContext.implicits._

scala> case class Sample(a:Int, b:Int)
defined class Sample

scala> case class Sample(a:Int, b:Int)
defined class Sample

scala> val rdd = sc.parallelize(List(Sample(10,20),Sample(1,2),Sample(5,6),Sample(100,200),Sample(1000,2000)))
rdd: org.apache.spark.rdd.RDD[Sample] = ParallelCollectionRDD[261] at parallelize at <console>:36

scala> rdd.collect.foreach(println)
Sample(10,20)
Sample(1,2)
Sample(5,6)
Sample(100,200)
Sample(1000,2000)

scala> val df = rdd.toDF
df: org.apache.spark.sql.DataFrame = [a: int, b: int]

scala> df.printSchema()
root
|-- a: integer (nullable = false)
|-- b: integer (nullable = false)

scala> df.show()
+----+----+
| a| b|
+----+----+
| 10| 20|
| 1| 2|
| 5| 6|
| 100| 200|
|1000|2000|
+----+----+

scala> import sqlContext.implicits._
import sqlContext.implicits._

scala> case class Sample(a:Int, b:Int)
defined class Sample

scala> case class Sample(a:Int, b:Int)
defined class Sample

scala> val rdd = sc.parallelize(List(Sample(10,20),Sample(1,2),Sample(5,6),Sample(100,200),Sample(1000,2000)))
rdd: org.apache.spark.rdd.RDD[Sample] = ParallelCollectionRDD[306] at parallelize at <console>:39

scala> rdd.collect.foreach(println)
Sample(10,20)
Sample(1,2)
Sample(5,6)
Sample(100,200)
Sample(1000,2000)

scala> val df = rdd.toDF
df: org.apache.spark.sql.DataFrame = [a: int, b: int]

scala> df.printSchema()
root
|-- a: integer (nullable = false)
|-- b: integer (nullable = false)

scala> df.show()
+----+----+
| a| b|
+----+----+
| 10| 20|
| 1| 2|
| 5| 6|
| 100| 200|
|1000|2000|
+----+----+

scala> df.show()
+----+----+
| a| b|
+----+----+
| 10| 20|
| 1| 2|
| 5| 6|
| 100| 200|
|1000|2000|
+----+----+

scala> df.take(3)
res164: Array[org.apache.spark.sql.Row] = Array([10,20], [1,2], [5,6])

scala> df.take(3).foreach(println)
[10,20]
[1,2]
[5,6]

scala> df.select("a","b").show()
+----+----+
| a| b|
+----+----+
| 10| 20|
| 1| 2|
| 5| 6|
| 100| 200|
|1000|2000|
+----+----+

scala> df.select(df("a"),df("a")+1,df("b"),df("b")+1).show()
+----+-------+----+-------+
| a|(a + 1)| b|(b + 1)|
+----+-------+----+-------+
| 10| 11| 20| 21|
| 1| 2| 2| 3|
| 5| 6| 6| 7|
| 100| 101| 200| 201|
|1000| 1001|2000| 2001|
+----+-------+----+-------+

scala> df.filter(df("a") >= 100).show();
+----+----+
| a| b|
+----+----+
| 100| 200|
|1000|2000|
+----+----+

scala> df.filter(df("a")>100).show()
+----+----+
| a| b|
+----+----+
|1000|2000|
+----+----+

scala> val df = sqlContext.read.json("/user/cloudera/Sparks/mydata.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, city: string, name: string]

scala> df.show()
+----+----+----+
| age|city|name|
+----+----+----+
| 25|null|Ravi|
|null| Hyd|Rani|
| 24| Del|Mani|
+----+----+----+

scala> df.printSchema()
root
|-- age: long (nullable = true)
|-- city: string (nullable = true)
|-- name: string (nullable = true)

scala> df.select("name").show()
+----+
|name|
+----+
|Ravi|
|Rani|
|Mani|
+----+

scala> df.select("name","city","age").show()
+----+----+----+
|name|city| age|
+----+----+----+
|Ravi|null| 25|
|Rani| Hyd|null|
|Mani| Del| 24|
+----+----+----+

scala> df.select("name","city").show()
+----+----+
|name|city|
+----+----+
|Ravi|null|
|Rani| Hyd|
|Mani| Del|
+----+----+

scala> df.select(df("name"),df("age")+1).show();
+----+---------+
|name|(age + 1)|
+----+---------+
|Ravi| 26|
|Rani| null|
|Mani| 25|
+----+---------+

scala> df.select(df("age"),df("age")+100).show()
+----+-----------+
| age|(age + 100)|
+----+-----------+
| 25| 125|
|null| null|
| 24| 124|
+----+-----------+

df.filter(df("age")>21).show();
df.groupBy("age").count().show()

scala> val data = sc.textFile("/user/cloudera/Sparks/emp")
data: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/emp MapPartitionsRDD[325] at textFile at <console>:37

scala> data.collect.foreach(println)
101,aaa,40000,m,11
102,bbbb,50000,f,12
103,ccc,90000,m,12
104,ddddd,100000,f,13
105,eee,20000,m,11
106,iiii,30000,f,12
107,jjjj,60000,m,13
108,kkkk,90000,f,14

scala> val emp = data.map { x =>
| val w = x.split(",")
| val id = w(0).toInt
| val name = w(1)
| val sal = w(2).toInt
| val sex = w(3)
| val dno = w(4).toInt
| Emp(id,name,sal,sex,dno)
| }
emp: org.apache.spark.rdd.RDD[Emp] = MapPartitionsRDD[326] at map at <console>:55

scala> emp.collect.foreach(println)
Emp(101,aaa,40000,m,11)
Emp(102,bbbb,50000,f,12)
Emp(103,ccc,90000,m,12)
Emp(104,ddddd,100000,f,13)
Emp(105,eee,20000,m,11)
Emp(106,iiii,30000,f,12)
Emp(107,jjjj,60000,m,13)
Emp(108,kkkk,90000,f,14)

scala> empdf.select("id","name","sal","sex","dno").show();
+---+-----+------+---+---+
| id| name| sal|sex|dno|
+---+-----+------+---+---+
|101| aaa| 40000| m| 11|
|102| bbbb| 50000| f| 12|
|103| ccc| 90000| m| 12|
|104|ddddd|100000| f| 13|
|105| eee| 20000| m| 11|
|106| iiii| 30000| f| 12|
|107| jjjj| 60000| m| 13|
|108| kkkk| 90000| f| 14|
+---+-----+------+---+---+

scala> empdf.select(empdf("sal"),empdf("sal")*10/100).show();
+------+------------------+
| sal|((sal * 10) / 100)|
+------+------------------+
| 40000| 4000.0|
| 50000| 5000.0|
| 90000| 9000.0|
|100000| 10000.0|
| 20000| 2000.0|
| 30000| 3000.0|
| 60000| 6000.0|
| 90000| 9000.0|
+------+------------------+

scala> empdf.groupBy(empdf("sex")).count.show()
+---+-----+
|sex|count|
+---+-----+
| f| 4|
| m| 4|
+---+-----+
// select sex,count(*) from emp group by sex;

scala> empdf.groupBy(empdf("sex")).agg(sum("sal")).show();
+---+--------+
|sex|sum(sal)|
+---+--------+
| f| 270000|
| m| 210000|
+---+--------+

// for each sex group how much is the total salary
// here we dealt with single group and single aggregation

scala> empdf.groupBy(empdf("sex")).agg(sum("sal"),max("sal")).show();
+---+--------+--------+
|sex|sum(sal)|max(sal)|
+---+--------+--------+
| f| 270000| 100000|
| m| 210000| 90000|
+---+--------+--------+

// single goruping but multiple aggregations
scala> empdf.groupBy(empdf("sex")).agg(sum("sal"),max("sal"),min("sal")).show();
+---+--------+--------+--------+
|sex|sum(sal)|max(sal)|min(sal)|
+---+--------+--------+--------+
| f| 270000| 100000| 30000|
| m| 210000| 90000| 20000|
+---+--------+--------+--------+

//group by multiple columns and multiple aggregations
scala> empdf.groupBy(empdf("dno"),empdf("sex")).agg(sum("sal"),max("sal"),min("sal")).show();
+---+---+--------+--------+--------+
|dno|sex|sum(sal)|max(sal)|min(sal)|
+---+---+--------+--------+--------+
| 11| m| 60000| 40000| 20000|
| 12| f| 80000| 50000| 30000|
| 12| m| 90000| 90000| 90000|
| 13| f| 100000| 100000| 100000|
| 13| m| 60000| 60000| 60000|
| 14| f| 90000| 90000| 90000|
+---+---+--------+--------+--------+

convert df into temp table then play with sql queries

dataSets --> catalyst optimizer + Tungston optimzer (cpu cache L1,L2,L3,L3)
frequent data will be cached to access frequent

SAP HANA is also inmemroy computing
HANA is not distributed

but SPark is largely distributed
inmemory + CPU Cache + GPU speed ==> super rocket speed
Quad

if you enable GPU computing, while using Single Core processer, we can enable 64 parallel processes
single core will be divided into 64 sub cores

if we use 4 Cores, we can run 64*4 = 256 parallel processes

if we have 4 core cpus it may act like 256 node cluster because of GPU enabling
Multi layer of GPU

Machine learning Algorithm
- Decision Tree
- Random Forest kindf of algorithm on 200 GB data

if we run R or Python --> it will need 2 days to run

Spark Execution is best suitable for future machine learning
Spark with Scala
Spark with Python

Lot of Rich Libraries are available
Machine Learning, GraphX Algorithms available

Data Set:
--------
a) DataSet without schema:
-----------------------------
scala> val ds = Seq(1,2,3).toDS()
ds: org.apache.spark.sql.Dataset[Int] = [value: int]

scala> ds.show();
+-----+
|value|
+-----+
| 1|
| 2|
| 3|
+-----+

scala> ds.printSchema
root
|-- value: integer (nullable = false)

scala> ds.map(x => x+10).show();
+-----+
|value|
+-----+
| 11|
| 12|
| 13|
+-----+
// more speeder operation with the help of tungston

Data Set with Schema:
---------------------

scala> case class Person(name:String, age:Long)
defined class Person

scala> val ds = Seq(Person("Andy",32)).toDS();
ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]

scala> ds.printSchema();
root
|-- name: string (nullable = true)
|-- age: long (nullable = false)

scala> ds.show();
+----+---+
|name|age|
+----+---+
|Andy| 32|
+----+---+

scala> val ds = Seq(Person("Andy",32),Person("Raja",28),Person("Ravi",29)).toDS();
ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]

scala> ds.show();
+----+---+
|name|age|
+----+---+
|Andy| 32|
|Raja| 28|
|Ravi| 29|
+----+---+

Play with Json and DataSet:
-----------------------------
create a json file in local linux:

[cloudera@quickstart ~]$ cat >sample.json
{"name":"Hari","age":30}
{"name":"Latha","age":25}
{"name":"Mani","age":23}

copy it into hdfs:
[cloudera@quickstart ~]$ hdfs dfs -copyFromLocal sample.json Sparks

display the content using cat:
-----------------------------
[cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/Sparks/sample.json
{"name":"Hari","age":30}
{"name":"Latha","age":25}
{"name":"Mani","age":23}

RDD Exa: (Reading json file using Spark)
scala> val rdd = sqlContext.read.json("/user/cloudera/Sparks/sample.json")
rdd: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> rdd.foreach(println)
[30,Hari]
[25,Latha]
[23,Mani]

scala> rdd.printSchema
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)

DataSet with json
-----------------

scala> case class Person(name:String, age:Long)
defined class Person

scala> Person
res206: Person.type = Person

scala> val ds = sqlContext.read.json("/user/cloudera/Sparks/sample.json").as[Person] // as[Person] creates dataSet
ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]

scala> ds.printSchema
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)

scala> ds.show();
+-----+---+
| name|age|
+-----+---+
| Hari| 30|
|Latha| 25|
| Mani| 23|
+-----+---+

Word count example:
------------------

val lines = sqlContext.read.text("......file.txt")
val words = lines.flatMap(_.split(",")).filter(_ != "")

val counts = words.groupBy(_.toLowerCase).map(w => w._1, w._2.size))

dataSet approach:
-------------------
val counts = words.groupBy(_.toLowerCase).count()

Spark Streaming:
----------------
flume / storm / kafka / spark

Types of Big Data Applications:
-------------------------------
Online Live Streaming Micro Batching Batch

Online
-----------
User Interactive
End user's transactions are
recorded here
ATM cash withdrawal
All User Interactive applications

Batch
--------
Non interactive
automated
Bank statement generations
periodically scheduled
Generate reports
All Non interactive applications

Online --> RDBMS

1 Lakh transactions / Minute attacking --> RDBMS can handle
Java,C#,Python any app --> RDBMS

Trillions of Crores of Transactions / Minute attacking -> RDBMS can't handle
NoSQL came into picture to handle trillions of transactions / minute

OLTP - OnLine Transaction Process (Online)
OLAP - Online Analytical Process (Batch)
RDBMS can't bear too much of work load

NoSQL supports heavy load and online processing

MongoDB
Cassandra - for open app and don't want to involve hadoop
Neo4j

Hadoop allows only sequencial reading style
from beginning of block to end of the block
record by record reading - sequential reading

Random Reading in Hadoop is not possible
Hbase -> row level access --> Random reading
If app needs Hadoop process go with Hbase
Go with Cassandra if you dont want to use Hadoop

Single RDBMS can solve all our problems
But single NoSQL db can't solve all our problems

for fixing different kind of problems, different kind of NoSQL DBs needed

Graph DB -> Friend of Friend of Friend --> 3rd level friend in Facebook --> Neo4j

100% schemaless -> MongoDB

Faster Random access with K,V pair -> PNUT
Faster Random access along with columns -> Hbase

Big Table (Google Big Table white Paper)
table is having 40 Lakh columns (4 Million columns)
we can't keep 40 Lakh columns in single RDBMS
RDBMs max columns count is 1024

select * from table is faster than select <column123>,<column235> from table
because it has to identify the beginning and ending position

Columnar stores

Row Key (associated with column family)
Column Family (associated with column name)
Column Name (associated with Column value)

Column Family (Table)

RDBMS and JOINs
----------------
We will be having multile tables in RDBMS
Employee, Department, Manager, Attendance Tables -
we will be doing JOINs to perform combing multi table info to get desired results

Big Data with JOINs
-------------------
In Big Data Left side table (Table A) may have 1 Crore records
Right side table (Table B) may have 10 Lakh records
It will be very difficult to do joins with this much huge number of rows.
(10 Lakh X 1 Crore comparision - very difficult)
'
Big Big Tables with OLTP joins are very bad
Cassandra kind of systems are eliminating this kind of joines
All these independent tables (Employee,Department,Manager,Attendance) will be kept as column families

Single Cassandra or HBase will be having all abovesaid 4 tables as Column families
This process is called Denormalisation

Generally if we use RDBMS as our backend, Denormalization is very bad for online systems
because in a single table (RDBMS) can have max of 1024 columns
so if we need more than 10K columns means we definitely need 100s of tables.
100 tables x 100 columns or (1024 columns x 10 tables at max)

Employee table (250 columns)
Department (120)
Attendance(150)
Manager(150)
and some more tables with 200s of columns

if we denormalize all tables and put them into a single table that is impossible

if total number of columns of all our tables exceeds 1024 columns we can't denormalize it in RDBMS

If I have denormalized table, i dont need to do any joins

In OLTP systems denormalization is not possible because of max allowed column length of a single table

schemaless (structureless , flexible schema, dynamic schema)

Row by row number of columns are different

Fresher
Experienced
Married
Different kind of rows have different kind of columns

Products in Amazon:
Book -> Author, Publisher, Date, pages, ISBN
Toy -> Mfd Date, Material, Age
Picle -> Expiry date, Mfd Date, Veg,NonVeg,etc.,

Each and every product will be having its own attributes
it's very difficult to put all kind of products in a single table (RDBMS)

NoSQL Transacton DBs are good to handle multiple number of columns for each rows for each products

Streaming:
To capture more number of transactions at High rate of speed
without user's knowledge, more number of data are captured and collected and stored and analyzed
(User event logs)

Security Camera:
CAM is capuring and recording
today morning 10 MB
tomorrow 1 GB
next month 1 TB .. streaming means keep on capturing and writing

Logs
Web log
DB log
OS Logging
App log
Log file will keep increasing. Without user interaction log info will be keep on capturing, writing

Logs are used to analyze server problems

But nowadays, logs usage is to understand user behaviour
which pages user visited
last night my bike got repair and i went home by bus
This morning, Google sent message to me : This morning's first bus is at 7:15 AM

Bharath Matrimony:
seeking for matches
10:30 AM
Match 1: Rani
Looking Rani's profile

10:40 AM
Match 2: Veni
Looking Veni's profile
11:00 AM
Match 3: Kajol
Looking Kajol's profile
11:28 AM
Match 1: Rani (Again he revisiting)

how much time he viewed the profile : that profile atracted our guy
how many time he revisited the same profile again :
Recommendation engine using machine learning

Giri's nearest neighbour is Vani
so, show Vani's info as recommendation for our guy

Taste, preference recommendation

all recommendation engines algorithm takes the user logs and analyze and finding the recommendation

Amazon, Bharath Matrimony, FlipKart, NetFlix etc.,
all user's log info will be helpful for making recommendations

Static non streaming:
downloaded file size will be always same (movie)
static (not a streaming)

Streaming:
Flume
Storm
Kafka
Spark Streaming

each and every tools developed to solve different kind of purpose

Flume for Log analysis
Kafka for Transaction processing with small analysis
Storm : Transaction processing with Heavy analysis

Flume:
Faster stream capute applications
not part of hadoop
separate cluster
independent system
faster stream capture system
100% delivery guarantee is not there
No guarantee for each and every event will reach destination (target)

Streaming
Source (app)
------>
Flume
---->
JMS, Hadoop (Destination)

1000 Events generated
Flume
800 events delivered at destination

Flume Agent (Source, Channel, Sinks)

Source (App) - event will be immediately captured and buffered in Channel
Event based , volumme based thresold
3000 events --> transferred to ---> Hadoop
for fault tolerance flume recommends channel1 and channel2

C1 (3000 events) C2 (3000 events) --->
it will take some time to write into C1 and C2
delayed process - fails here
100% guarantee for proper successfull delivery - a big No
Flume is not recommended for sensitive, commercial app (online banking, credit card transactions)

Good for doing Bahubali 2 sentiment analysis (data taken from twitter)
twitter, youtube, facebook user sentiment analysis (flume is good at this)

Banking transactions, credit card processing - Flume is very bad at this

Storm is not alternate to the flume
But Kafka is alternate to the flume

Kafka can stream the things + it can act as messaging system

Kafka is a streaming and Message (brokerage) system

Kafka has the solution for flume's issue

Kafka gives 100% delivery guarantee
Kafka handles high loads of transactions

if 10 different sources are attacking flume, it wil be slow

if 100s of 1000s of sources are attacking Kafka it can handle
Kafka is very powerful
Kafka promises 100% one time delivery

Kafka is well capable of handling 'n' number of sources

Linked can handle 3 Trillion events (transactions) per 10 seconds using high end systems (their secrets)

Messaging - Brokerage systems

Publishers
Kafka
Subscriber

Buyer -> Kafka -> Seller

Buyer informs to Mediator
Mediator broadcast the msg to Seller

Before Kafka
Websphere MQ
RabbitMQ
Tibco
WebMethods
JMS (Java Messaging System)
Web services

App1 to App2 communications
App1 (C++) to App2 (C++) communications
App1 (Java) to App2 (Java) communications

Web services:
App to App communications (java to .net) (.net to python)
CRM
Seibel
Finance
Oracle
Sales
SAP

Web services help to communicate between different kind of apps written in different platforms, different languages

Apps:
Sales, Inventory, Finance, HR

Sales Finance
Products sold (Products outflow)
Cash In (cash Inflow) Cash total update here

If Sales App talks with Finance that's good for updating cash matters

Sales App (C#) talks with Finance App (Java) via web services
What ever money received via Sales App will be updated in Finance App.
Everything running smooth.

But what if Finance App is down.
Whatever Sales cash update wont be reflected in Finance App due to 2nd app is down.

Que system came
priority
App1 try to communicate App2 (but that time App2 is down)
But que system keeps App1's update
Once App2 is up immediately Que system delivers App1's updates to App2

JMS
Message Queing, Message Brokerages

WebSphereMQ, RabbitMQ, JMS

All these Queing, Messaging Broking systems can't handle heavy loads of transactions, events

How many nodes we can connect in Hadoop?
Unlimited

Kafka is a largely distributed system

RabbitMQ is also Distributed

Kafka has topics, RabbitMQ has ques
in Kafka single topic can be kept into multiple nodes
But in RabbitMQ a single que can be kept into single node

4 different que can be istributed in 4 different nodes in RabbidMQ
but 1 que can't be distributed in 4 nodes

RDDs are logical but partitions are physical
partition can be replicated in multiple machines
HDFS files are logical but blocks are physical

One year ago:
App1 is the source
App2,3,4, are destinations
meaning client requirement is App1 wants to communicate with App2,App3,App4 only
As of Now:
App1 is the source
App2.....App100 are destinations
meaning App1 wants to communicate with App2... App100
After 3 years:
App1 is the source:
App2... App10000 are destinations
meaning App1 wants to communicate with App2... App10000

in traditional approach lots and lots of code change needed to talk between App1 with all the other Apps.

But in Kafka, it simplified everything. Without writing lots and lots of code.
App1 can talk with any number of destinations

After sometime

A1 ... A2...A10000
X1
Y1
Z1 ... A2.. A10000

Lots and lots of Sources and lots and lots of Destinations are supported in Kafka

so, Kafka doesn't allow direct communication between source and destinations

In old style,
One app will talk with other app.

But in Kafka, App never directly talk with other app (source to dest)
But Source will talk with Broker and it doesn't know who is going to consume
Destination will talk with Broker and it doesn't know who delivered the message

App1 -- > Broker --> App2

Scaling

Old : Vertical scaling
increating RAM, disk, processor on the same system
existing 8GB now add one more 8GB
existing 1 TB hdd now add more 2TB
We can't afford it
for a laptop, we can't put 1 TB
-- availability, compatibility issues

New : Horizontal scaling
increasing nodes
I want 100 TB
added 100 nodes with 1 TB each

Application scalability

Jio Number of apps are more

Beginning
1 source 3 targets
after some time
100 sources 1000 targets

because of new apps, i dont need to change existing apps.

Large scaling
1000s of apps as source and 1000s of apps as destinations

WebSphere can't handle Lakh transaction, events / second

High loads of transactions, events support

Kafka ==> Streaming + Messaging --> Kafka is Superb

Live analytics Kafka is bad but Storm is good at this

Storm is very good at Live analytics

Kafka is the replacement for Flume
But Kafka is not the replacement for Storm

Storm can do Live analytics (big complex algorithm)

Kafka is the Message Broker,

Buyer dont know who is best seller
Seller dont know who is going to buy

Flume
can do only streaming, can't act as brokerage
100% delivery gurantee is not there

kafka
streaming + messaging
very large broker apps
1000s of sources and 1000s of target apps
high transaction load support
it can't perform live analytics on captured event
it can capture the object, but it can't say entered person is male or female
My app should detect : entered object as Male or Female

Kafka can capture any event - but it can't analyze the input itself

Credit card transactions :
Kafka can capture the transaction but it can't decide genuine / fraud transaction

Storm:
ML algorithm / any given algorithm can be played on that captured event
It can't capture streams in faster way
if the rate of speed is slow, Storm can help
If the rate of capturing speed is very fast, Storm can't handle.
Storm is good at doing Live Analytics

Kafka can't do Live analytics but It can capture streaming very fast
Storm can do live analytics but it can't capture streaming very fast

Both integration is required in the industry

Micro batching -> Spark Streaming purpose

Spark Streaming is not the replacement for Kafka

Batch:
User Non-Interactive applications
at a particular interval, one batch program will trigger and it fetch data from external resource
it will generate report or it will do its own updation :: Batch

Monthly once, Monthly Twice, weekly once, daily once, hourly, seconds

Micro Batching is hourly once - to - within Few Seconds

for every 5 seconds, for every 10 seconds how much money deposited

Hadoop is good at batch processing, but it needs data must be available in HDFS (file system)

for every 15 minutes, 20 minutes --> Hadoop can do batch processing very good

for every 5 seconds, 10 seconds -> Spark Streaming is good

capuring live events -> Kafka
Live Analytics against captured events -> Storm
Micro Batching every 15 minutes --> Spark Streaming
Batch Analytics -> Hadoop + Spark
Live Transaction Capturing - NoSQL

One particular NoSQL is not enough to fix all your problems

Cassandra - No need of Hadoop

One particular .Org wants to migrate their existing RDBMS to NoSQL for their OLTP
they have to use different kind of NoSQL DBs

Document Storage
MongoDB, Couch DB

Columnar
Cassandra

Graph Storage:
Neo4j

Key,Value
PNUT

Each particular NoSQL is expert it that particular area.

Live Analytics -> Storm

Storm + Trident -> Micro batch was there earlier
But Storm + Trident -> Application development is complex needs more time to developed

If Number of Sources is > 100s Spark Streaming is Slower

Kafka + Spark Streaming (Benefits)
Faster Capturing of events, messaging
Micro Batching

Credit card point of view:
All the transactions will be captured by Kafka and stored in Brokers

Every 10 seconds How much money deposited?

Facebook feeds, Twitter tweets - sentiment analysis
word frequency - Spark Streaming can do
Good word / bad word : live analytics - Storm needed

Twitter Tweets
Facebook Feeds --> Storm --> Kafka --> Spark Streaming --> Kafka

Fast capture
Live Analytics
Micro Batch

Kafka,Storm,Spark Streaming,NoSQL

Hadoop -> provide solution for batch processing (all offline data)

Hadoop Architect Vs Big Data Architect

End to End journery of Big data applications

NoSQL stores all transactions
kafka capture them and put it into Broker
Storm do Live Analytics
Spark Streaming do Micro Batches
Final result will be put it into NoSQL (HBase)

1) Credit card transaction is done
Kafka will capture it
2) Passes to Storm and Storm will analyze whether it is fraud or genuine transaction

3) Storm API will write results from Storm into Kafka
Because other applications will interact with Kafka only but not with Storm. So, need to pass result into Kafka
so, other applications will access Storm Result from Kafka Clusture

4) Spark Streaming will stream Kafka Result of Storm and performs micro batching.
for every 10,15 seconds, how many fraud, genuins transactions happened

5) Spark streaming rewrite the results into Kafka again
Hadoop, other applications will access Micro Batch results and Storm Results
So, Hadoop can perform Batch Process when needed

Industry is developing Automation development But Industry is not automated

Attendance system:
Separate Attendance system for Male and Females
Swiping - on behalf of you, your friend is swiping for you

one object entered
Kafka will capture the object (with the help of IoT integration with Kafka)
IoT APIs will send object to Kafka
Object is available in Kafka Topic
Storm will fetch Topic and find weather the object is human or not
Human is Male / Female (Storm performing this)

(Kafka can't do the following)
1st object is Dog - leave ignore it - Label is other
2nd object is Human -
Verify Male or Female
Storm is performing this finding - Label is Male / Female and rewrite it into Kafka Topic

Spark Streaming will fetch Storm results taken from Kafka Topic,
for every 1 Hour or every 30 minutes :
How many persons are Male?
How many persons are Female?
Spark Streaming is doing that micro batching to find count of males, females entered

CM Meeting
If more number of people are outflowing means he can identify meeting is boring

Demonitization
RBI is interested for every 15 seconds, how much money is deposited
It needs Micro Batching
Individual Bank doesn't need Micro Batching But RBI needs it

Male Inflow / Outflow, Female InFlow / Outflow

Hadoop at the end of the day, Hadoop is interested about total number of males, females per day

Hadoop processed Batch and batch results will be sent back to Kafka

Hadoop is one of the solution in Big Data

Online Transaction
Online Streaming
Live Analytics
Micro Batching
Batch Analytics
If your environment supports all the above 5 areas, you have fullfledged big data environment

Hadoop is Senior than SAP HANA

SAP Hana has live projects

Big Data Live projects means just POCs
Max and Max migrating from existing RDBMS to Big Data
Teradata, Oracle --> Sqooping to Big Data

Hadoop:
-------
Transaction must be recorded in Database, File
later sqoop import
Known and unknown batch process
result will be kept into Hive
These results are not accessed by any other system

Big Data:
Everything

Transactions
Streaming
Live Analytics
Micro Batching
Batch
Dash Board viewing

spark Streaming:
-----------------
is used to stream data from sources. source can be a file a file, network port, (remote host)
ex. for remote host : twitter tweets, facebook feeds
or from any other applications or other messaging systems like : JMS, Kafka

Purpose of Spark Streaming:
Micro Batching
streams data from sources and batch them (micro batching) and performs micro batch analytics

[Sources] --> [Spark Streaming] -> Prepare batches -> [Batches] (Buffering) -> Spark Core
Analytics is performed by Spark Core

Results can be written in given target
Target can be : HDFS, Kafka Topic, other systems (NoSQL, MongoDB, Cassandra, AWS)

The Spark Streaming data object are called DSStreams - Descretized Data Streams
Continuous Series of RDDs (DSStream)

Micro batching operation will be applied on each RDD of DSStream

For every period of given interval these RDDs will be build under one DSStream.

ex : Microbatching period is 10 seconds.
for every 10 seconds one RDD will be produced under DSStream.
As Streaming job is running, these RDDs will be keep generating.
These independent RDDs will be processed by Spark Core.

DSStream1
[ 10s | 10s | 10s | 10s | 10s | 10s ]
for every 10s one RDD will be created
RDD6 | RDD5 | RDD4 | RDD3 | RDD2 | RDD1

Micro batch period is 10seconds

We are applying our business logic on DSStream
Logic / Transformation will be applied for each RDDs independently.

For every 10 seconds 1 RDD will be created and the same RDD will be processed based on the given Transformation / business logic independently

01-10 seconds -- RDD7
11-20 seconds -- RDD6
21-30 seconds -- RDD5
31-40 seconds -- RDD4
41-50 seconds -- RDD3
51-60 seconds -- RDD2
61-70 seconds -- RDD1

Before 70 seconds, the streaming process started. for every 10 seconds one new RDD generated
Once streaming job started, it never stopped until unless manually stopped it or manually killed it

Streaming job never be stopped
Each RDD - we can call them as Batch
Who will be processing the RDD?
Spark Core

Spark Core will process individual RDDs (meaning each and every Micro Batches of 10 seconds interval)

Batch, Streaming, RDDs

Basic Context Object is : sc -> SparkContext
later : SQLContext,HiveContext
But here for Streaming -> SparkStreamingContext

ssc - sparkStreamingContext is used to create DSStreams

val ssc = new SparkStreamingContext(sc,10);
10 --> Microbatch period in seconds
for every 10 seconds worth of streamed data will buffered at some worker node of Spark Cluster and prepared as Batch (RDD of DSStream)

DSStream -> Continuous Series of RDDs.

Once batch is prepared this will be submitted into Spark Core

As Spark Core is processing the batch SparkStreaming keep collecting the data from sources and prepares next batches

One batch is prepared and send it to SparkCore
Again one more batch will be prepared and send it to SparkCore for processing

It is very bad idea if we set Batch Period of 1 Hour

Security System:
Live Analytics Storm is best
for every 5, 10 seconds : from where the hacker is attacking?
Immediately Storm will catch it
Data science guys will use CNN, ANN kind of algorithm

Algorithm on Live can be executed by Storm

Storm has the Special Architecture, even though complex algorithm Storm can execute it within fraction of seconds

whenever one object entered into a class room
weather object entered is human or non human (dog)
If it is human : Male / Female
that algorithm is not simple like sex,sum(sal) from emp

it will be complex, complicated algorithm
It will be executed within a fraction of seconds on Live
Only Storm has that capability

For every 5,10 seconds - which location is attacked by hackers?
that's called micro batching

SparkStreaming wont perform Micro batching analytics
It prepares only micro batches
that is in the form of DSStream (Continuous series of RDDs)
Ofcourse we are applying algorithm on DSStream, automtaically that will be applied on Each and every RDDs of the DSStream

As SparkCore is processing some of the RDDs, still SparkStreaming wont be idle, it will be keep on collecting and preparing other que of next batches

Batch period :

val ssc = new SparkStreamingContext(sc,10);

val ds1 = ssc.textFileStream("
ssc.socketTextStream (localhost,99999) -- netcat
listening and capuring from that port

it will capture the stream but it wont pass it to SparkCore immediately
It will keep on collecting the stream for 10seconds, then it will pass micro batch to SparkCore after 10 seconds only

within 10 seconds, some 100 events
in the mean time, SparkStream will buffer those 100 events which are worthy of 10 seconds into some of the node in Spark Clusters
later it will pass into Core Engine
then it will do process and apply transformation (business logic) on that RDDs

1st second - 1st --> I love you
5th second - 2nd --> you love me
9th second - 3rd --> He loves you
all the above batched togher to make single RDD after 10 seconds,
ds1 ->
SparkCore doesn't have direct access with particular RDD
it has access only with DSStream

Flatten -> split by space

val ds2 = ds1.flatMap(x => x.split(" "))
Flat map will do Array Of Array into Single Array.

Flattens Multi (nested) collection into single collection

Apply a regular experssion over there continuous white spaces should be removed

result of this d2 is continous series of RDDs.
if we take one particular RDD, the data will be as Array
like : Array (I,love,you,you,love,me,he,loves,you)
all together is 1 RDD1

SparkCore is splitting and performaing wordcount now in the mean time SparkString
wont be idle, it will keep on collecting events happening live
keep collecting next 10seconds of data and buffering them

whenever we perform some transformation / filter on RDD what we will get?
we will get one more RDD

But here we performed transformation over DSStream, what we will get ?
we will get one more DSStream

Object Type of DataStream

val ds3 = ds2.map ( x => (x,1))
(I,1 )
(love,1 )
(you,1 )
(you,1 )
(love,1 )
(me,1 )
(he,1 )
(loves,1)
(you,1 )

from Array into key,value Tuple

ds3 is Array of Tuples
we have create a pair RDD

ds3 is one more DataStream

val ds4 = ds3.reduceByKey(_+_);
( I,1 )
( love,2 )
( you,3 )
( me,1 )
( he,1 )
( loves,1)
SparkCore result generated for the 1st batch of DSStream (1st RDD)

SparkStream is keep on generating micro batches and send them into SparkCore
Like above, SparkCore will be processing Each batch and send the Result of wordcount (Transformation) back

For each batch the SparkCore will do processing and results aggregation
Who performed processing?
Spark Core
SparkStreaming :
Independently collect the events and preparing them as batch for every given interval and pass them to SparkCore
SparkCore is doing process

ds4.print()

Steps involved
#1. Context Object creation - within the content we specify micro batch period
#2. DSStream preparation -- root should connect with sourc
#3. Output operation
#4. Start the streaming

How to start Streaming?
ssc.start
Static file - textFileStream
port - SocketStream
Kafka - KafkaStream - need KafkaUtils - need to embed related libraries
JMS - Java Messaging System - related libraries we have to embed

RDD which is not taking input from other RDD which is called Root RDD

ds1,ds2,ds3,ds4
root should connect with source
source can be file, port, other application, other streaming system like Kafka, JMS, Flume
Once DSStream is prepared nothing will happen, we need to specify output operation
output into file, hdfs,

ds4.saveAsTextFile("...hdfs location")
ds4.saveAsParquetFile("...")
just want to see the output on console : that's also output
ds4.print()

want to write into Kafka topic,
Producer related code we need to embed

Target can be anything

Result can be rewritten into Kafka,
Adv : Other 3rd party application can fetch the results generated by SparkCore
Kafka should be available into all other applications of an Organization

They will apply consumer APIs (Target)

val ssc = new SparkStreamingContext(sc,10)
// 10 is micro batch period (sc,Minute(10)) o
val ds1 = ssc.SocketTextStream(localhost,99999)
// ds1 - DataStream

val ds2 = ds1.flatMap(x => x.split(" "))
\\w+ : Continous white spaces
val dspair = ds2.map (X => (x,1))

val dsres = dspair.reduceByKey(_+_)

dsres.print();

ssc.start // to start StreamingContext

for every 10 seconds, we will get different results
if source is not generating any event, the batch will be empty (input)
if batch is empty output will also empty

But the job will keep on runing

One batch will be collected and one batch will performed and printed

apply machine learning algorithm , business logic, transformation

K-Means, Supervised , Unsupervised , Linear Regression

Algorithm is in your hand apply whatever

Windowing and Sliding:
----------------------
Faculty is interested in Daily attendance (1 Day) - Micro batch
Principal is interested in visiting once in 2 days - Sliding
Once He visits, He wanted to see 4 days attendance details - Windowing

Microbatch (10 seconds), Sliding (20 seconds ), Windowing (40 seconds)
Sliding #1
Microbatch #1 : 01 to 10
Microbatch #2 : 11 to 20

Sliding #2:
Microbatch #3 : 21 to 30
Microbatch #4 : 31 to 40

Sliding #3:
Microbatch #5 : 41 to 50
Microbatch #6 : 51 to 60

Sliding #4:
Microbatch #7 : 61 to 70
Microbatch #8 : 71 to 80

Sliding #5:
Microbatch #9 : 81 to 90
Microbatch #10 : 91 to 100

Windowing:
Windowing #1
Sliding #1
Microbatch #1 : 01 to 10
Microbatch #2 : 11 to 20

Sliding #2:
Microbatch #3 : 21 to 30
Microbatch #4 : 31 to 40

Windowing #2:

Sliding #2:
Microbatch #3 : 21 to 30
Microbatch #4 : 31 to 40

Sliding #3:
Microbatch #5 : 41 to 50
Microbatch #6 : 51 to 60

Windowing #3:

Sliding #3:
Microbatch #5 : 41 to 50
Microbatch #6 : 51 to 60

Sliding #4:
Microbatch #7 : 61 to 70
Microbatch #8 : 71 to 80

Windowing #4:

Sliding #4:
Microbatch #7 : 61 to 70
Microbatch #8 : 71 to 80

Sliding #5:
Microbatch #9 : 81 to 90
Microbatch #10 : 91 to 100

--------------
--------------------
------------------------
--------------------------
-----------------------
---------------------

These ---- dashed lines represents windowing
Faculty is responsible for taking attendances once in a day (daily attendance) - MicroBatch
Principal is responsible for monitoring Faculty's attendance once in 2 days -- Sliding
Principal is interested in looking into 4 days attendance (Windowing)

Whenever the SparkCore processed particular RDD, immediately that RDD will be destroyed
we need to persist 4 RDDS to do windowing

Windowing is set of micro batch
Sliding is interval to perform window batch

dspair.reduceByKeyAndWindow(_+_,20,40)
20 -> sliding, 40 -> windowing

reduceByKeyAndWindow is valid only for dsStream API

but we need to keep dspair.persist
without persist we can't do windowing
While persisting, keep always the window worth of data only.
Don't keep all RDDs as persist

Once window got performed that will be cherished

val ssc

val ds1 = scc

val ds2 = ds1.

val dspair = ds2.map (...
dspair.persist
val rep1 = dspair.reduceByKey

val rep2 = dspair.reduceByKeyAndWindow(_+_,20,40)

rep1.print()
rep2.print()

ssc.start()

for every 20 seconds 1st output
for every 40 seconds 2nd output

when window period and sliding period are same ...
__________
___________
___________
___________
__________

Here it wont keep entire dsStream as persist, it will always keep recent 40 seconds into persists
remaining unnecessary data will be removed.

reduceByKeyAndWindow is only avaiable with dsStream

a. cumulative operations
b. sliding interval
c. window interval
default is seconds

sliding period is >= micro batching period
windowing period is >= sliding period

socket typing
--------------------------------------------------------------------

Algorithm difficulties
Optimization for existing hadoop
100 times faster

Processing power, time, code : Shrinks
tinier code, increase readability
expressiveness
fast

computation against disk - MapReduce
Computation against cached data in Memory - Spark

Directly interact with data using local machine

ScaleUp / ScaleOut

Fault Tolerant

Unify Big Data - Batch, Stream (Real time), MLLib

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Basics").getOrCreate()
df = spark.read.json("people.json")

df.printSchema()

root
|-- age: long (nullable = true)
|-- name: string (nullable = true)

df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+

df.describe().show()

+-------+------------------+-------+
|summary| age| name|
+-------+------------------+-------+
| count| 2| 3|
| mean| 24.5| null|
| stddev|7.7781745930520225| null|
| min| 19| Andy|
| max| 30|Michael|
+-------+------------------+-------+

df.columns

['age', 'name']

df.columns[0]

'age'

df.columns[-1]
'name'

df.describe()

DataFrame[summary: string, age: string, name: string]

from pyspark.sql.types import (StructField,StringType,
IntegerType,StructType)

data_schema = [StructField("age",IntegerType(),True),
StructField("name",StringType(),True)]

final_struct = StructType(fields=data_schema)

df = spark.read.json("people.json",schema=final_struct)

df.printSchema()

root
|-- age: integer (nullable = true)
|-- name: string (nullable = true)

df.describe().show()

+-------+------------------+-------+
|summary| age| name|
+-------+------------------+-------+
| count| 2| 3|
| mean| 24.5| null|
| stddev|7.7781745930520225| null|
| min| 19| Andy|
| max| 30|Michael|
+-------+------------------+-------+

df.show()

+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+

df["age"]
Column<b'age'>

type(df["age"])
pyspark.sql.column.Column

df.select("age")
DataFrame[age: int]

df.select("age").show()
+----+
| age|
+----+
|null|
| 30|
| 19|
+----+

df.head(2)
[Row(age=None, name='Michael'), Row(age=30, name='Andy')]

df.head(2)[0]
Row(age=None, name='Michael')

df.head(2)[1]
Row(age=30, name='Andy')

df.select("age").show()

+----+
| age|
+----+
|null|
| 30|
| 19|
+----+

df.select(["age","name"])
DataFrame[age: int, name: string]

df.select(["age","name"]).show()

+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+

df.withColumn("nameAge",df["age"]).show()

+----+-------+-------+
| age| name|nameAge|
+----+-------+-------+
|null|Michael| null|
| 30| Andy| 30|
| 19| Justin| 19|
+----+-------+-------+

df.withColumn("DoubleAge",df["age"]*2).show()

+----+-------+---------+
| age| name|DoubleAge|
+----+-------+---------+
|null|Michael| null|
| 30| Andy| 60|
| 19| Justin| 38|
+----+-------+---------+

df.withColumnRenamed("age","my_new_age").show()
+----------+-------+
|my_new_age| name|
+----------+-------+
| null|Michael|
| 30| Andy|
| 19| Justin|
+----------+-------+

df.createOrReplaceTempView("people")
results = spark.sql("SELECT * FROM people")
results.show()

+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+

results = spark.sql("SELECT * FROM people WHERE age = 30")

results.show()
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+

spark.sql("SELECT * FROM people WHERE age = 30").show()
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ops').getOrCreate()
df = spark.read.csv("AAPL.csv",inferSchema=True,header=True)

df.head(2)[0]
Row(Date=datetime.datetime(2018, 2, 13, 0, 0), Open=161.949997, High=164.75, Low=161.649994, Close=164.339996, Adj Close=164.339996, Volume=32549200)

df.printSchema()
root
|-- Date: timestamp (nullable = true)
|-- Open: double (nullable = true)
|-- High: double (nullable = true)
|-- Low: double (nullable = true)
|-- Close: double (nullable = true)
|-- Adj Close: double (nullable = true)
|-- Volume: integer (nullable = true)

df.show()
+-------------------+----------+----------+----------+----------+----------+--------+
| Date| Open| High| Low| Close| Adj Close| Volume|
+-------------------+----------+----------+----------+----------+----------+--------+
|2018-02-13 00:00:00|161.949997| 164.75|161.649994|164.339996|164.339996|32549200|
|2018-02-14 00:00:00|163.039993|167.539993|162.880005|167.369995|167.369995|40644900|
|2018-02-15 00:00:00|169.789993|173.089996| 169.0|172.990005|172.990005|51147200|
|2018-02-16 00:00:00|172.360001|174.820007|171.770004|172.429993|172.429993|40176100|
|2018-02-20 00:00:00|172.050003|174.259995|171.419998|171.850006|171.850006|33930500|
|2018-02-21 00:00:00|172.830002|174.119995|171.009995|171.070007|171.070007|37471600|
|2018-02-22 00:00:00|171.800003|173.949997|171.710007| 172.5| 172.5|30991900|
|2018-02-23 00:00:00|173.669998|175.649994|173.539993| 175.5| 175.5|33812400|
|2018-02-26 00:00:00|176.350006|179.389999|176.210007|178.970001|178.970001|38162200|
|2018-02-27 00:00:00|179.100006|180.479996|178.160004|178.389999|178.389999|38928100|
|2018-02-28 00:00:00|179.259995|180.619995|178.050003|178.119995|178.119995|37782100|
|2018-03-01 00:00:00|178.539993|179.779999|172.660004| 175.0| 175.0|48802000|
|2018-03-02 00:00:00|172.800003|176.300003|172.449997|176.210007|176.210007|38454000|
|2018-03-05 00:00:00|175.210007|177.740005|174.520004|176.820007|176.820007|28401400|
|2018-03-06 00:00:00|177.910004| 178.25|176.130005|176.669998|176.669998|23788500|
|2018-03-07 00:00:00|174.940002|175.850006|174.270004|175.029999|175.029999|31703500|
|2018-03-08 00:00:00|175.479996|177.119995|175.070007|176.940002|176.940002|23774100|
|2018-03-09 00:00:00|177.960007| 180.0|177.389999|179.979996|179.979996|32185200|
|2018-03-12 00:00:00|180.289993|182.389999|180.210007|181.720001|181.720001|32162900|
+-------------------+----------+----------+----------+----------+----------+--------+

df.head(3)[2]
Row(Date=datetime.datetime(2018, 2, 15, 0, 0), Open=169.789993, High=173.089996, Low=169.0, Close=172.990005, Adj Close=172.990005, Volume=51147200)

df.filter("Close = 172.5").show()

+-------------------+----------+----------+----------+-----+---------+--------+
| Date| Open| High| Low|Close|Adj Close| Volume|
+-------------------+----------+----------+----------+-----+---------+--------+
|2018-02-22 00:00:00|171.800003|173.949997|171.710007|172.5| 172.5|30991900|
+-------------------+----------+----------+----------+-----+---------+--------+

df.filter("Close > 175").show()
+-------------------+----------+----------+----------+----------+----------+--------+
| Date| Open| High| Low| Close| Adj Close| Volume|
+-------------------+----------+----------+----------+----------+----------+--------+
|2018-02-23 00:00:00|173.669998|175.649994|173.539993| 175.5| 175.5|33812400|
|2018-02-26 00:00:00|176.350006|179.389999|176.210007|178.970001|178.970001|38162200|
|2018-02-27 00:00:00|179.100006|180.479996|178.160004|178.389999|178.389999|38928100|
|2018-02-28 00:00:00|179.259995|180.619995|178.050003|178.119995|178.119995|37782100|
|2018-03-02 00:00:00|172.800003|176.300003|172.449997|176.210007|176.210007|38454000|
|2018-03-05 00:00:00|175.210007|177.740005|174.520004|176.820007|176.820007|28401400|
|2018-03-06 00:00:00|177.910004| 178.25|176.130005|176.669998|176.669998|23788500|
|2018-03-07 00:00:00|174.940002|175.850006|174.270004|175.029999|175.029999|31703500|
|2018-03-08 00:00:00|175.479996|177.119995|175.070007|176.940002|176.940002|23774100|
|2018-03-09 00:00:00|177.960007| 180.0|177.389999|179.979996|179.979996|32185200|
|2018-03-12 00:00:00|180.289993|182.389999|180.210007|181.720001|181.720001|32162900|
+-------------------+----------+----------+----------+----------+----------+--------+

df.filter("Close > 175").select("High").show()
+----------+
| High|
+----------+
|175.649994|
|179.389999|
|180.479996|
|180.619995|
|176.300003|
|177.740005|
| 178.25|
|175.850006|
|177.119995|
| 180.0|
|182.389999|
+----------+

df.filter("Close > 175").select(["High","Low","Volume"]).show()
+----------+----------+--------+
| High| Low| Volume|
+----------+----------+--------+
|175.649994|173.539993|33812400|
|179.389999|176.210007|38162200|
|180.479996|178.160004|38928100|
|180.619995|178.050003|37782100|
|176.300003|172.449997|38454000|
|177.740005|174.520004|28401400|
| 178.25|176.130005|23788500|
|175.850006|174.270004|31703500|
|177.119995|175.070007|23774100|
| 180.0|177.389999|32185200|
|182.389999|180.210007|32162900|
+----------+----------+--------+

df.filter(df["Close"] > 175).show()
+-------------------+----------+----------+----------+----------+----------+--------+
| Date| Open| High| Low| Close| Adj Close| Volume|
+-------------------+----------+----------+----------+----------+----------+--------+
|2018-02-23 00:00:00|173.669998|175.649994|173.539993| 175.5| 175.5|33812400|
|2018-02-26 00:00:00|176.350006|179.389999|176.210007|178.970001|178.970001|38162200|
|2018-02-27 00:00:00|179.100006|180.479996|178.160004|178.389999|178.389999|38928100|
|2018-02-28 00:00:00|179.259995|180.619995|178.050003|178.119995|178.119995|37782100|
|2018-03-02 00:00:00|172.800003|176.300003|172.449997|176.210007|176.210007|38454000|
|2018-03-05 00:00:00|175.210007|177.740005|174.520004|176.820007|176.820007|28401400|
|2018-03-06 00:00:00|177.910004| 178.25|176.130005|176.669998|176.669998|23788500|
|2018-03-07 00:00:00|174.940002|175.850006|174.270004|175.029999|175.029999|31703500|
|2018-03-08 00:00:00|175.479996|177.119995|175.070007|176.940002|176.940002|23774100|
|2018-03-09 00:00:00|177.960007| 180.0|177.389999|179.979996|179.979996|32185200|
|2018-03-12 00:00:00|180.289993|182.389999|180.210007|181.720001|181.720001|32162900|
+-------------------+----------+----------+----------+----------+----------+--------+

df.filter(df["Close"] > 175).select(["Volume"]).show()

+--------+
| Volume|
+--------+
|33812400|
|38162200|
|38928100|
|37782100|
|38454000|
|28401400|
|23788500|
|31703500|
|23774100|
|32185200|
|32162900|
+--------+

df.filter((df["Close"] > 175) & (df["Volume"] >= 38454000)).show()
+-------------------+----------+----------+----------+----------+----------+--------+
| Date| Open| High| Low| Close| Adj Close| Volume|
+-------------------+----------+----------+----------+----------+----------+--------+
|2018-02-27 00:00:00|179.100006|180.479996|178.160004|178.389999|178.389999|38928100|
|2018-03-02 00:00:00|172.800003|176.300003|172.449997|176.210007|176.210007|38454000|
+-------------------+----------+----------+----------+----------+----------+--------+

df.filter(df["Low"] == 178.160004).show()

+-------------------+----------+----------+----------+----------+----------+--------+
| Date| Open| High| Low| Close| Adj Close| Volume|
+-------------------+----------+----------+----------+----------+----------+--------+
|2018-02-27 00:00:00|179.100006|180.479996|178.160004|178.389999|178.389999|38928100|
+-------------------+----------+----------+----------+----------+----------+--------+

result = df.filter(df["Low"] == 178.160004).collect()
result[0]
Row(Date=datetime.datetime(2018, 2, 27, 0, 0), Open=179.100006, High=180.479996, Low=178.160004, Close=178.389999, Adj Close=178.389999, Volume=38928100)

result[0].asDict()

{'Adj Close': 178.389999,
'Close': 178.389999,
'Date': datetime.datetime(2018, 2, 27, 0, 0),
'High': 180.479996,
'Low': 178.160004,
'Open': 179.100006,
'Volume': 38928100}

result[0].asDict().keys()
dict_keys(['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'])

result[0].asDict()["Volume"]
38928100
--------------------------------------------------------------------

Sankara's Big Data Notes

Monday, 10 December 2018

Spark Notes

No comments:

Post a Comment

Flume - Simple Demo