>>> ut.count()
>>> ut.first()
'dispatching_base_number,date,active_vehicles,trips'
rows = ut.map(lambda line:line.split(","))
rows.map(lambda row : row[0]).distinct().count()
>>> base02617 = rows.filter(lambda row:"B02617" in row)
>>> base02617.collect()
>>> base02617.filter(lambda row:int(row[3]) > 15000) .map(lambda day:day[1]).distinct().count()
Spark Core Intro:
-----------------
Want to learn Spark
to know fundamentals of Spark
Evaluate Spark
Engine for efficient large-scale data processing. Faster than Hadoop MapReduce
Spark can complement existing Hadoop investments such as HDFS and Hive
Rich Echo system including support for SQL, ML, Multiple language APIs Java,Scala,Python
RDD - Resilient Distributed Datasets
Transformation
Actions
Spark Driver Programs and SparkContext
RDDs :
Primary abstraction for data interaction (lazy, in memory)
Immutable, distributed collection of elements separated into partitions
Multiple Types of RDDs
RDDs can be created from an external data sets such as Hadoop InputFormats, text Files on a variety of file systems
or existings RDDs via Spark Transformations
RDD functions which return pointers to new RDDs (lazy)
map,flatMap,filter
Transformation creates new RDDs
Action functions (RDD functions) which returns values to the driver
reduce, collect, count etc.,
Transformations ==> RDDs ==> Actions ==> output results
Spark Driver, Workers
Spark Driver ==> program that declares transformations and actions on RDDs of data
Driver submits the serialized RDD graph to the master where the master creates tasks.
These tasks are delegated to the works for execution
Works are where the tasks are actually executed.
Driver Program (Spark Context)
Cluster Manager
Worker Node [ Executor, Cache, Tasks ]
Driver Program using a Spark Context interact with Cluster Manager and distributing that load across worker nodes
Parallel processing
RDDs support 2 types of operations:
Transformations
Creates new dataset from an existing one
Lazy ; Only computed when a result is required
Transformed RDDs are recomputed each time an action is run against it.
persist / cache to avoid recomputing
Actions
Returna a value to the driver program
ubercsv = sc.textFile("uber.csv") ==> New RDD Created
rows = ubercsv.map(lambda l: len(l)) ==> Another new RDD
totalRows = rows.reduce(lambda a,b:a+b) ==> RDD is now computed across different machines
row.cache() ==> reuse without recomputing
Actions aggregate all the RDD elements using some function such as the previously seen Reduce
Returns the final result back to the driver program.
baby_names = sc.textFile("baby_names.csv")
rows = baby_names.map(lambda line : line.split(",")
for row in rows.take(rows.count()) : print(row[1])
First Name
DOMINIC
ADDISION.....
rows.filter(lambda line:"MICHAEL" in line).collect()
[u'2012',u'MICHAEL',u'KINGS',u'M',u'172']....
Algorithm difficulties
Optimization for existing hadoop
100 times faster
Processing power, time, code : Shrinks
tinier code, increase readability
expressiveness
fast
computation against disk - MapReduce
Computation against cached data in Memory - Spark
Directly interact with data using local machine
ScaleUp / ScaleOut
Fault Tolerant
Unify Big Data - Batch, Stream (Real time), MLLib
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Basics").getOrCreate()
df = spark.read.json("people.json")
df.printSchema()
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
df.describe().show()
+-------+------------------+-------+
|summary| age| name|
+-------+------------------+-------+
| count| 2| 3|
| mean| 24.5| null|
| stddev|7.7781745930520225| null|
| min| 19| Andy|
| max| 30|Michael|
+-------+------------------+-------+
df.columns
['age', 'name']
df.columns[0]
'age'
df.columns[-1]
'name'
df.describe()
DataFrame[summary: string, age: string, name: string]
from pyspark.sql.types import (StructField,StringType,
IntegerType,StructType)
data_schema = [StructField("age",IntegerType(),True),
StructField("name",StringType(),True)]
final_struct = StructType(fields=data_schema)
df = spark.read.json("people.json",schema=final_struct)
df.printSchema()
root
|-- age: integer (nullable = true)
|-- name: string (nullable = true)
df.describe().show()
+-------+------------------+-------+
|summary| age| name|
+-------+------------------+-------+
| count| 2| 3|
| mean| 24.5| null|
| stddev|7.7781745930520225| null|
| min| 19| Andy|
| max| 30|Michael|
+-------+------------------+-------+
df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
df["age"]
Column<b'age'>
type(df["age"])
pyspark.sql.column.Column
df.select("age")
DataFrame[age: int]
df.select("age").show()
+----+
| age|
+----+
|null|
| 30|
| 19|
+----+
df.head(2)
[Row(age=None, name='Michael'), Row(age=30, name='Andy')]
df.head(2)[0]
Row(age=None, name='Michael')
df.head(2)[1]
Row(age=30, name='Andy')
df.select("age").show()
+----+
| age|
+----+
|null|
| 30|
| 19|
+----+
df.select(["age","name"])
DataFrame[age: int, name: string]
df.select(["age","name"]).show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
df.withColumn("nameAge",df["age"]).show()
+----+-------+-------+
| age| name|nameAge|
+----+-------+-------+
|null|Michael| null|
| 30| Andy| 30|
| 19| Justin| 19|
+----+-------+-------+
df.withColumn("DoubleAge",df["age"]*2).show()
+----+-------+---------+
| age| name|DoubleAge|
+----+-------+---------+
|null|Michael| null|
| 30| Andy| 60|
| 19| Justin| 38|
+----+-------+---------+
df.withColumnRenamed("age","my_new_age").show()
+----------+-------+
|my_new_age| name|
+----------+-------+
| null|Michael|
| 30| Andy|
| 19| Justin|
+----------+-------+
df.createOrReplaceTempView("people")
results = spark.sql("SELECT * FROM people")
results.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
results = spark.sql("SELECT * FROM people WHERE age = 30")
results.show()
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+
spark.sql("SELECT * FROM people WHERE age = 30").show()
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ops').getOrCreate()
df = spark.read.csv("AAPL.csv",inferSchema=True,header=True)
df.head(2)[0]
Row(Date=datetime.datetime(2018, 2, 13, 0, 0), Open=161.949997, High=164.75, Low=161.649994, Close=164.339996, Adj Close=164.339996, Volume=32549200)
df.printSchema()
root
|-- Date: timestamp (nullable = true)
|-- Open: double (nullable = true)
|-- High: double (nullable = true)
|-- Low: double (nullable = true)
|-- Close: double (nullable = true)
|-- Adj Close: double (nullable = true)
|-- Volume: integer (nullable = true)
df.show()
+-------------------+----------+----------+----------+----------+----------+--------+
| Date| Open| High| Low| Close| Adj Close| Volume|
+-------------------+----------+----------+----------+----------+----------+--------+
|2018-02-13 00:00:00|161.949997| 164.75|161.649994|164.339996|164.339996|32549200|
|2018-02-14 00:00:00|163.039993|167.539993|162.880005|167.369995|167.369995|40644900|
|2018-02-15 00:00:00|169.789993|173.089996| 169.0|172.990005|172.990005|51147200|
|2018-02-16 00:00:00|172.360001|174.820007|171.770004|172.429993|172.429993|40176100|
|2018-02-20 00:00:00|172.050003|174.259995|171.419998|171.850006|171.850006|33930500|
|2018-02-21 00:00:00|172.830002|174.119995|171.009995|171.070007|171.070007|37471600|
|2018-02-22 00:00:00|171.800003|173.949997|171.710007| 172.5| 172.5|30991900|
|2018-02-23 00:00:00|173.669998|175.649994|173.539993| 175.5| 175.5|33812400|
|2018-02-26 00:00:00|176.350006|179.389999|176.210007|178.970001|178.970001|38162200|
|2018-02-27 00:00:00|179.100006|180.479996|178.160004|178.389999|178.389999|38928100|
|2018-02-28 00:00:00|179.259995|180.619995|178.050003|178.119995|178.119995|37782100|
|2018-03-01 00:00:00|178.539993|179.779999|172.660004| 175.0| 175.0|48802000|
|2018-03-02 00:00:00|172.800003|176.300003|172.449997|176.210007|176.210007|38454000|
|2018-03-05 00:00:00|175.210007|177.740005|174.520004|176.820007|176.820007|28401400|
|2018-03-06 00:00:00|177.910004| 178.25|176.130005|176.669998|176.669998|23788500|
|2018-03-07 00:00:00|174.940002|175.850006|174.270004|175.029999|175.029999|31703500|
|2018-03-08 00:00:00|175.479996|177.119995|175.070007|176.940002|176.940002|23774100|
|2018-03-09 00:00:00|177.960007| 180.0|177.389999|179.979996|179.979996|32185200|
|2018-03-12 00:00:00|180.289993|182.389999|180.210007|181.720001|181.720001|32162900|
+-------------------+----------+----------+----------+----------+----------+--------+
df.head(3)[2]
Row(Date=datetime.datetime(2018, 2, 15, 0, 0), Open=169.789993, High=173.089996, Low=169.0, Close=172.990005, Adj Close=172.990005, Volume=51147200)
df.filter("Close = 172.5").show()
+-------------------+----------+----------+----------+-----+---------+--------+
| Date| Open| High| Low|Close|Adj Close| Volume|
+-------------------+----------+----------+----------+-----+---------+--------+
|2018-02-22 00:00:00|171.800003|173.949997|171.710007|172.5| 172.5|30991900|
+-------------------+----------+----------+----------+-----+---------+--------+
df.filter("Close > 175").show()
+-------------------+----------+----------+----------+----------+----------+--------+
| Date| Open| High| Low| Close| Adj Close| Volume|
+-------------------+----------+----------+----------+----------+----------+--------+
|2018-02-23 00:00:00|173.669998|175.649994|173.539993| 175.5| 175.5|33812400|
|2018-02-26 00:00:00|176.350006|179.389999|176.210007|178.970001|178.970001|38162200|
|2018-02-27 00:00:00|179.100006|180.479996|178.160004|178.389999|178.389999|38928100|
|2018-02-28 00:00:00|179.259995|180.619995|178.050003|178.119995|178.119995|37782100|
|2018-03-02 00:00:00|172.800003|176.300003|172.449997|176.210007|176.210007|38454000|
|2018-03-05 00:00:00|175.210007|177.740005|174.520004|176.820007|176.820007|28401400|
|2018-03-06 00:00:00|177.910004| 178.25|176.130005|176.669998|176.669998|23788500|
|2018-03-07 00:00:00|174.940002|175.850006|174.270004|175.029999|175.029999|31703500|
|2018-03-08 00:00:00|175.479996|177.119995|175.070007|176.940002|176.940002|23774100|
|2018-03-09 00:00:00|177.960007| 180.0|177.389999|179.979996|179.979996|32185200|
|2018-03-12 00:00:00|180.289993|182.389999|180.210007|181.720001|181.720001|32162900|
+-------------------+----------+----------+----------+----------+----------+--------+
df.filter("Close > 175").select("High").show()
+----------+
| High|
+----------+
|175.649994|
|179.389999|
|180.479996|
|180.619995|
|176.300003|
|177.740005|
| 178.25|
|175.850006|
|177.119995|
| 180.0|
|182.389999|
+----------+
df.filter("Close > 175").select(["High","Low","Volume"]).show()
+----------+----------+--------+
| High| Low| Volume|
+----------+----------+--------+
|175.649994|173.539993|33812400|
|179.389999|176.210007|38162200|
|180.479996|178.160004|38928100|
|180.619995|178.050003|37782100|
|176.300003|172.449997|38454000|
|177.740005|174.520004|28401400|
| 178.25|176.130005|23788500|
|175.850006|174.270004|31703500|
|177.119995|175.070007|23774100|
| 180.0|177.389999|32185200|
|182.389999|180.210007|32162900|
+----------+----------+--------+
df.filter(df["Close"] > 175).show()
+-------------------+----------+----------+----------+----------+----------+--------+
| Date| Open| High| Low| Close| Adj Close| Volume|
+-------------------+----------+----------+----------+----------+----------+--------+
|2018-02-23 00:00:00|173.669998|175.649994|173.539993| 175.5| 175.5|33812400|
|2018-02-26 00:00:00|176.350006|179.389999|176.210007|178.970001|178.970001|38162200|
|2018-02-27 00:00:00|179.100006|180.479996|178.160004|178.389999|178.389999|38928100|
|2018-02-28 00:00:00|179.259995|180.619995|178.050003|178.119995|178.119995|37782100|
|2018-03-02 00:00:00|172.800003|176.300003|172.449997|176.210007|176.210007|38454000|
|2018-03-05 00:00:00|175.210007|177.740005|174.520004|176.820007|176.820007|28401400|
|2018-03-06 00:00:00|177.910004| 178.25|176.130005|176.669998|176.669998|23788500|
|2018-03-07 00:00:00|174.940002|175.850006|174.270004|175.029999|175.029999|31703500|
|2018-03-08 00:00:00|175.479996|177.119995|175.070007|176.940002|176.940002|23774100|
|2018-03-09 00:00:00|177.960007| 180.0|177.389999|179.979996|179.979996|32185200|
|2018-03-12 00:00:00|180.289993|182.389999|180.210007|181.720001|181.720001|32162900|
+-------------------+----------+----------+----------+----------+----------+--------+
df.filter(df["Close"] > 175).select(["Volume"]).show()
+--------+
| Volume|
+--------+
|33812400|
|38162200|
|38928100|
|37782100|
|38454000|
|28401400|
|23788500|
|31703500|
|23774100|
|32185200|
|32162900|
+--------+
df.filter((df["Close"] > 175) & (df["Volume"] >= 38454000)).show()
+-------------------+----------+----------+----------+----------+----------+--------+
| Date| Open| High| Low| Close| Adj Close| Volume|
+-------------------+----------+----------+----------+----------+----------+--------+
|2018-02-27 00:00:00|179.100006|180.479996|178.160004|178.389999|178.389999|38928100|
|2018-03-02 00:00:00|172.800003|176.300003|172.449997|176.210007|176.210007|38454000|
+-------------------+----------+----------+----------+----------+----------+--------+
df.filter(df["Low"] == 178.160004).show()
+-------------------+----------+----------+----------+----------+----------+--------+
| Date| Open| High| Low| Close| Adj Close| Volume|
+-------------------+----------+----------+----------+----------+----------+--------+
|2018-02-27 00:00:00|179.100006|180.479996|178.160004|178.389999|178.389999|38928100|
+-------------------+----------+----------+----------+----------+----------+--------+
result = df.filter(df["Low"] == 178.160004).collect()
result[0]
Row(Date=datetime.datetime(2018, 2, 27, 0, 0), Open=179.100006, High=180.479996, Low=178.160004, Close=178.389999, Adj Close=178.389999, Volume=38928100)
result[0].asDict()
{'Adj Close': 178.389999,
'Close': 178.389999,
'Date': datetime.datetime(2018, 2, 27, 0, 0),
'High': 180.479996,
'Low': 178.160004,
'Open': 179.100006,
'Volume': 38928100}
result[0].asDict().keys()
dict_keys(['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'])
result[0].asDict()["Volume"]
38928100
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("aggs").getOrCreate()
df = spark.read.csv("sales_info.csv",inferSchema=True,header=True)
df = spark.read.csv("sales_info.csv",inferSchema=True,header=True)
df.show()
+-------+-------+-----+
|Company| Person|Sales|
+-------+-------+-----+
| GOOG| Sam|200.0|
| GOOG|Charlie|120.0|
| GOOG| Frank|340.0|
| MSFT| Tina|600.0|
| MSFT| Amy|124.0|
| MSFT|Vanessa|243.0|
| FB| Carl|870.0|
| FB| Sarah|350.0|
| APPL| John|250.0|
| APPL| Linda|130.0|
| APPL| Mike|750.0|
| APPL| Chris|350.0|
+-------+-------+-----+
df.printSchema()
root
|-- Company: string (nullable = true)
|-- Person: string (nullable = true)
|-- Sales: double (nullable = true)
df.groupBy("Company")
<pyspark.sql.group.GroupedData at 0x7f59125082b0>
df.groupBy("Company").mean()
DataFrame[Company: string, avg(Sales): double]
df.groupBy("Company").mean().show()
+-------+-----------------+
|Company| avg(Sales)|
+-------+-----------------+
| APPL| 370.0|
| GOOG| 220.0|
| FB| 610.0|
| MSFT|322.3333333333333|
+-------+-----------------+
df.groupBy("Company").sum().show()
+-------+----------+
|Company|sum(Sales)|
+-------+----------+
| APPL| 1480.0|
| GOOG| 660.0|
| FB| 1220.0|
| MSFT| 967.0|
+-------+----------+
df.groupBy("Company").max().show()
+-------+----------+
|Company|max(Sales)|
+-------+----------+
| APPL| 750.0|
| GOOG| 340.0|
| FB| 870.0|
| MSFT| 600.0|
+-------+----------+
df.groupBy("Company").min().show()
+-------+----------+
|Company|min(Sales)|
+-------+----------+
| APPL| 130.0|
| GOOG| 120.0|
| FB| 350.0|
| MSFT| 124.0|
+-------+----------+
df.groupBy("Company").count().show()
+-------+-----+
|Company|count|
+-------+-----+
| APPL| 4|
| GOOG| 3|
| FB| 2|
| MSFT| 3|
+-------+-----+
df.agg({"Sales":"sum"}).show()
+----------+
|sum(Sales)|
+----------+
| 4327.0|
+----------+
df.agg({"Sales":"max"}).show()
+----------+
|max(Sales)|
+----------+
| 870.0|
+----------+
group_data.agg({"Sales":"max"}).show()
+-------+----------+
|Company|max(Sales)|
+-------+----------+
| APPL| 750.0|
| GOOG| 340.0|
| FB| 870.0|
| MSFT| 600.0|
+-------+----------+
from pyspark.sql.functions import (countDistinct,avg,stddev)
df.select(countDistinct("Sales")).show()
+---------------------+
|count(DISTINCT Sales)|
+---------------------+
| 11|
+---------------------+
df.select(avg("Sales")).show()
+-----------------+
| avg(Sales)|
+-----------------+
|360.5833333333333|
+-----------------+
df.select(avg("Sales").alias("Average Sales")).show()
+-----------------+
| Average Sales|
+-----------------+
|360.5833333333333|
+-----------------+
df.select(stddev("Sales")).show()
+------------------+
|stddev_samp(Sales)|
+------------------+
|250.08742410799007|
+------------------+
from pyspark.sql.functions import format_number
sales_std = df.select(stddev("Sales").alias("Std"))
sales_std.show()
+------------------+
| Std|
+------------------+
|250.08742410799007|
+------------------+
sales_std.select(format_number('std',2).alias("Standard Deviation")).show()
+------------------+
|Standard Deviation|
+------------------+
| 250.09|
+------------------+
df.orderBy("Sales").show()
+-------+-------+-----+
|Company| Person|Sales|
+-------+-------+-----+
| GOOG|Charlie|120.0|
| MSFT| Amy|124.0|
| APPL| Linda|130.0|
| GOOG| Sam|200.0|
| MSFT|Vanessa|243.0|
| APPL| John|250.0|
| GOOG| Frank|340.0|
| FB| Sarah|350.0|
| APPL| Chris|350.0|
| MSFT| Tina|600.0|
| APPL| Mike|750.0|
| FB| Carl|870.0|
+-------+-------+-----+
df.orderBy(df["Sales"].desc()).show()
+-------+-------+-----+
|Company| Person|Sales|
+-------+-------+-----+
| FB| Carl|870.0|
| APPL| Mike|750.0|
| MSFT| Tina|600.0|
| FB| Sarah|350.0|
| APPL| Chris|350.0|
| GOOG| Frank|340.0|
| APPL| John|250.0|
| MSFT|Vanessa|243.0|
| GOOG| Sam|200.0|
| APPL| Linda|130.0|
| MSFT| Amy|124.0|
| GOOG|Charlie|120.0|
+-------+-------+-----+
Missing Data:
Keep the missing data points as NULLs
Drop the missing points including entire row
Fill it with some other value
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Missing").getOrCreate()
df = spark.read.csv("ContainsNull.csv",inferSchema=True,header=True)
df.show()
+----+-----+-----+
| Id| Name|Sales|
+----+-----+-----+
|emp1| John| null|
|emp2| null| null|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+
df.na.drop().show() // it will drop any row which has null value single or double null columns will be filtered
+----+-----+-----+
| Id| Name|Sales|
+----+-----+-----+
|emp4|Cindy|456.0|
+----+-----+-----+
df.na.drop(thresh=2).show() // threshold = 2 means if 2 columns have null value that row will be eliminated
+----+-----+-----+
| Id| Name|Sales|
+----+-----+-----+
|emp1| John| null|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+
df.na.drop(how="any").show() // if any single column has null value, that row will be eliminated
+----+-----+-----+
| Id| Name|Sales|
+----+-----+-----+
|emp4|Cindy|456.0|
+----+-----+-----+
df.na.drop(how="all").show()
+----+-----+-----+
| Id| Name|Sales|
+----+-----+-----+
|emp1| John| null|
|emp2| null| null|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+
df.na.drop(subset=["Sales"]).show() // if Sales column has null value that row will be ommitted
+----+-----+-----+
| Id| Name|Sales|
+----+-----+-----+
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+
df.printSchema()
root
|-- Id: string (nullable = true)
|-- Name: string (nullable = true)
|-- Sales: double (nullable = true)
df.na.fill("FILLER").show() // any string column which has null values will be filled with "FILLER"
+----+------+-----+
| Id| Name|Sales|
+----+------+-----+
|emp1| John| null|
|emp2|FILLER| null|
|emp3|FILLER|345.0|
|emp4| Cindy|456.0|
+----+------+-----+
df.na.fill(0).show() // any numeric column which has null values will be filled with "0" Zero.
+----+-----+-----+
| Id| Name|Sales|
+----+-----+-----+
|emp1| John| 0.0|
|emp2| null| 0.0|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+
df.na.fill("No Name",subset=["Name"]).show() // A column named as 'Name' which has null values will be filled with 'No Name'
+----+-------+-----+
| Id| Name|Sales|
+----+-------+-----+
|emp1| John| null|
|emp2|No Name| null|
|emp3|No Name|345.0|
|emp4| Cindy|456.0|
+----+-------+-----+
from pyspark.sql.functions import mean
mean_val = df.select(mean(df["Sales"])).collect()
mean_val
[Row(avg(Sales)=400.5)]
mean_val[0][0]
400.5
mean_sales = mean_val[0][0]
df.na.fill(mean_sales,["Sales"]).show()
+----+-----+-----+
| Id| Name|Sales|
+----+-----+-----+
|emp1| John|400.5|
|emp2| null|400.5|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+
what ever mean result will be filled in the place of NULL
df.na.fill(df.select(mean(df["Sales"])).collect()[0][0],["Sales"]).show()
+----+-----+-----+
| Id| Name|Sales|
+----+-----+-----+
|emp1| John|400.5|
|emp2| null|400.5|
|emp3| null|345.0|
|emp4|Cindy|456.0|
+----+-----+-----+
---------------------------------------------------------------------
Spark Sql and Hql
[cloudera@quickstart ~]$ sudo find / -name
'hive-site.xml'
[cloudera@quickstart ~]$ sudo chmod -R 777
/usr/lib/spark/conf
[cloudera@quickstart ~]$ cp
/etc/hive/conf.dist/hive-site.xml
/usr/lib/spark/conf
_____________________________________
from hive-site.xml -->
hive.metastore.warehouse.dir
from spark 2.0.0 onwards above opt is
depricated
use following option..
------> spark.sql.warehouse.dir
_____________________________________________
____ [ tested in cloudera 5.8
spark version 1.6.0 ]
[cloudera@quickstart ~]$ls
/usr/lib/hue/apps/beeswax/data/sample_07.csv
[cloudera@quickstart ~]$ head -n 2
/usr/lib/hue/apps/beeswax/data/sample_07.csv
_____________________
val hq = new
org.apache.spark.sql.hive.HiveContext(sc)
hq.sql("create database sparkdb")
hq.sql("CREATE TABLE sample_07 (code
string,description string,total_emp
int,salary int) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\t' STORED AS TextFile")
[cloudera@quickstart ~]$ hadoop fs -mkdir
sparks
[cloudera@quickstart ~]$ hadoop fs -
copyFromLocal
/usr/lib/hue/apps/beeswax/data/sample_07.csv
sparks
[cloudera@quickstart ~]$ hadoop fs -ls
sparks
hq.sql("LOAD DATA INPATH
'/user/cloudera/sparks/sample_07.csv'
OVERWRITE INTO TABLE sample_07")
val df = hq.sql("SELECT * from sample_07")
__________________________________________
scala> df.filter(df("salary") > 150000).show
()
+-------+--------------------+---------
+------+
| code| description|total_emp|
salary|
+-------+--------------------+---------
+------+
|11-1011| Chief executives| 299160|
151370|
|29-1022|Oral and maxillof...| 5040|
178440|
|29-1023| Orthodontists| 5350|
185340|
|29-1024| Prosthodontists| 380|
169360|
|29-1061| Anesthesiologists| 31030|
192780|
|29-1062|Family and genera...| 113250|
153640|
|29-1063| Internists, general| 46260|
167270|
|29-1064|Obstetricians and...| 21340|
183600|
|29-1067| Surgeons| 50260|
191410|
|29-1069|Physicians and su...| 237400|
155150|
+-------+--------------------+---------
+------+
____________________________________________
val sqlContext = new
org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
[cloudera@quickstart ~]$ gedit json1
[cloudera@quickstart ~]$ hadoop fs -
copyFromLocal json1 sparks
[cloudera@quickstart ~]$ hadoop fs -cat
sparks/json1
{"name":"Ravi","age":23,"sex":"M"}
{"name":"Rani","age":24,"sex":"F"}
{"name":"Mani","sex":"M"}
{"name":"Vani","age":34}
{"name":"Veni","age":29,"sex":"F"}
[cloudera@quickstart ~]$
scala> val df = sqlContext.read.json
("/user/cloudera/sparks/json1")
scala> df.show()
+----+----+----+
| age|name| sex|
+----+----+----+
| 23|Ravi| M|
| 24|Rani| F|
|null|Mani| M|
| 34|Vani|null|
| 29|Veni| F|
+----+----+----+
scala> df.printSchema()
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
|-- sex: string (nullable = true)
scala> df.select("name").show()
+----+
|name|
+----+
|Ravi|
|Rani|
|Mani|
|Vani|
|Veni|
+----+
scala> df.select("age").show()
+----+
| age|
+----+
| 23|
| 24|
|null|
| 34|
| 29|
+----+
scala>
scala> df.select("name","age").show()
+----+----+
|name| age|
+----+----+
|Ravi| 23|
|Rani| 24|
|Mani|null|
|Vani| 34|
|Veni| 29|
+----+----+
scala> df.select("name","sex").show()
+----+----+
|name| sex|
+----+----+
|Ravi| M|
|Rani| F|
|Mani| M|
|Vani|null|
|Veni| F|
+----+----+
scala>
scala> df.select(df("name"),df
("age")+10).show()
+----+----------+
|name|(age + 10)|
+----+----------+
|Ravi| 33|
|Rani| 34|
|Mani| null|
|Vani| 44|
|Veni| 39|
+----+----------+
scala> df.filter(df("age")<34).show()
+---+----+---+
|age|name|sex|
+---+----+---+
| 23|Ravi| M|
| 24|Rani| F|
| 29|Veni| F|
+---+----+---+
scala> df.filter(df("age")>=5 && df("age")
<30).show()
+---+----+---+
|age|name|sex|
+---+----+---+
| 23|Ravi| M|
| 24|Rani| F|
| 29|Veni| F|
+---+----+---+
scala> df.groupBy("age").count().show()
+----+-----+
| age|count|
+----+-----+
| 34| 1|
|null| 1|
| 23| 1|
| 24| 1|
| 29| 1|
+----+-----+
scala> df.groupBy("sex").count().show()
+----+-----+
| sex|count|
+----+-----+
| F| 2|
| M| 2|
|null| 1|
+----+-----+
scala>
scala> df.registerTempTable("df")
scala> sqlContext.sql("select * from
df").collect.foreach(println)
[23,Ravi,M]
[24,Rani,F]
[null,Mani,M]
[34,Vani,null]
[29,Veni,F]
scala> val mm = sqlContext.sql("select * from
df")
mm: org.apache.spark.sql.DataFrame = [age:
bigint, name: string, sex: string]
scala> mm.registerTempTable("mm")
scala> sqlContext.sql("select * from
mm").collect.foreach(println)
[23,Ravi,M]
[24,Rani,F]
[null,Mani,M]
[34,Vani,null]
[29,Veni,F]
scala> mm.show()
+----+----+----+
| age|name| sex|
+----+----+----+
| 23|Ravi| M|
| 24|Rani| F|
|null|Mani| M|
| 34|Vani|null|
| 29|Veni| F|
+----+----+----+
scala> val x = mm
x: org.apache.spark.sql.DataFrame = [age:
bigint, name: string, sex: string]
scala>
cala> val aggr1 = df.groupBy("sex").agg( max
("age"), min("age"))
aggr1: org.apache.spark.sql.DataFrame = [sex:
string, max(age): bigint, min(age): bigint]
scala> aggr1.collect.foreach(println)
[F,29,24]
[M,23,23]
[null,34,34]
scala> aggr1.show()
+----+--------+--------+
| sex|max(age)|min(age)|
+----+--------+--------+
| F| 29| 24|
| M| 23| 23|
|null| 34| 34|
+----+--------+--------+
scala>
____________________
ex:
[cloudera@quickstart ~]$ cat > emp1
101,aaa,30000,m,11
102,bbbb,40000,f,12
103,cc,60000,m,12
104,dd,80000,f,11
105,cccc,90000,m,12
[cloudera@quickstart ~]$ cat > emp2
201,dk,90000,m,11
202,mm,100000,f,12
203,mmmx,80000,m,12
204,vbvb,70000,f,11
[cloudera@quickstart ~]$ hadoop fs -
copyFromLocal emp1 sparklab
[cloudera@quickstart ~]$ hadoop fs -
copyFromLocal emp2 sparklab
[cloudera@quickstart ~]$
scala> val emp1 = sc.textFile
("/user/cloudera/sparklab/emp1")
scala> val emp2 = sc.textFile
("/user/cloudera/sparklab/emp2")
scala> case class Emp(id:Int, name:String,
| sal:Int, sex:String, dno:Int)
scala> def toEmp(x:String) = {
| val w = x.split(",")
| Emp(w(0).toInt,
| w(1), w(2).toInt,
| w(3), w(4).toInt)
| }
toEmp: (x: String)Emp
scala> val e1 = emp1.map(x => toEmp(x))
e1: org.apache.spark.rdd.RDD[Emp] =
MapPartitionsRDD[43] at map at <console>:37
scala> val e2 = emp2.map(x => toEmp(x))
e2: org.apache.spark.rdd.RDD[Emp] =
MapPartitionsRDD[44] at map at <console>:37
scala>
scala> val df1 = e1.toDF
df1: org.apache.spark.sql.DataFrame = [id:
int, name: string, sal: int, sex: string,
dno: int]
scala> val df2 = e2.toDF
df2: org.apache.spark.sql.DataFrame = [id:
int, name: string, sal: int, sex: string,
dno: int]
scala>
scala> val df = sqlContext.sql("select * from
df1 union all select * from df2")
scala> val res = sqlContext.sql("select sex,
sum(sal) as tot, count(*) as cnt
from df group by sex")
scala>
scala> val wrres = res.map(x => x(0)+","+x
(1)+","+x(2))
scala> wrres.saveAsTextFile
("/user/cloudera/mytemp")
scala> hq.sql("create database park")
scala> hq.sql("use park")
scala> hq.sql("create table urres(sex string,
tot int, cnt int)
row format delimited
fields terminated by ',' ")
scala> hq.sql("load data inpath
'/user/cloudera/mytemp/part-00000' into table
urres ")
scala> val hiveres = hq.sql("select * from
urres")
scala> hiveres.show()
_____________________________________
spark lab1 : Spark Aggregations : map, flatMap, sc.textFile(), reduceByKey(), groupByKey()
spark Lab1:
___________
[cloudera@quickstart ~]$ cat > comment
i love hadoop
i love spark
i love hadoop and spark
[cloudera@quickstart ~]$ hadoop fs -mkdir spark
[cloudera@quickstart ~]$ hadoop fs -copyFromLocal comment spark
Word Count using spark:
scala> val r1 = sc.textFile("hdfs://quickstart.cloudera/user/cloudera/spark/comment")
scala> r1.collect.foreach(println)
scala> val r2 = r1.map(x => x.split(" "))
scala> val r3 = r2.flatMap(x => x)
instead of writing r2 and r3.
scala> val words = r1.flatMap(x =>
| x.split(" ") )
scala> val wpair = words.map( x =>
| (x,1) )
scala> val wc = wpair.reduceByKey((x,y) => x+y)
scala> wc.collect
scala> val wcres = wc.map( x =>
| x._1+","+x._2 )
scala> wcres.saveAsTextFile("hdfs://quickstart.cloudera/user/cloudera/spark/results2")
[cloudera@quickstart ~]$ cat emp
101,aa,20000,m,11
102,bb,30000,f,12
103,cc,40000,m,11
104,ddd,50000,f,12
105,ee,60000,m,12
106,dd,90000,f,11
[cloudera@quickstart ~]$ hadoop fs -copyFromLocal emp spark
[cloudera@quickstart ~]$
scala> val e1 = sc.textFile("/user/cloudera/spark/emp")
scala> val e2 = e1.map(_.split(","))
scala> val epair = e2.map( x=>
| (x(3), x(2).toInt ) )
scala> val res = epair.reduceByKey(_+_)
res: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[18] at reduceByKey at <console>:24
scala> res.collect.foreach(println)
(f,170000)
(m,120000)
scala> val resmax = epair.reduceByKey(
| (x,y) => Math.max(x,y))
scala> val resmin = epair.reduceByKey(Math.min(_,_))
scala> resmax.collect.foreach(println)
(f,90000)
(m,60000)
scala> resmin.collect.foreach(println)
(f,30000)
(m,20000)
scala> val grpd = epair.groupByKey()
scala> val resall = grpd.map(x =>
| (x._1, x._2.sum,x._2.size,x._2.max,x._2.min,x._2.sum/x._2.size) )
scala> resall.collect.foreach(println)
------------------------------------------------------
Spark Lab2
scala> val emp = sc.textFile("hdfs://quickstart.cloudera/user/cloudera/spark/emp")
scala> val earray = emp.map(x=> x.split(","))
earray: org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[2] at map at <console>:14
scala> earray.collect
Array[Array[String]] = Array(Array(101, aa, 20000, m, 11), Array(102, bb, 30000, f, 12), Array(103, cc, 40000, m, 11), Array(104, ddd, 50000, f, 12), Array(105, ee, 60000, m, 12), Array(106, dd, 90000, f, 11))
scala> val epair = earray.map( x =>
| (x(4), x(2).toInt))
scala> val ressum = epair.reduceByKey(_+_)
scala> val resmax = epair.reduceByKey(Math.max(_,_))
scala> val resmin = epair.reduceByKey(Math.min(_,_))
scala> ressum.collect.foreach(println)
(12,140000)
(11,150000)
scala> val grpByDno =
epair.groupByKey()
scala> grpByDno.collect
Array[(String, Iterable[Int])] = Array((12,CompactBuffer(30000, 50000, 60000)), (11,CompactBuffer(20000, 40000, 90000)))
scala> val resall = grpByDno.map(x =>
x._1+"\t"+
x._2.sum+"\t"+
x._2.size+"\t"+
x._2.sum/x._2.size+"\t"+
x._2.max+"\t"+
x._2.min )
12 140000 3 46666 60000 30000
11 150000 3 50000 90000 20000
[cloudera@quickstart ~]$ hadoop fs -cat spark/today1/part-00000
12 140000 3 46666 60000 30000
11 150000 3 50000 90000 20000
[cloudera@quickstart ~]$
____________________________________
aggregations by multiple grouping.
ex: equivalant sql/hql query.
select dno, sex , sum(sal) from emp
group by dno, sex;
---
scala> val DnoSexSalPair = earray.map(
| x => ((x(4),x(3)),x(2).toInt) )
scala> DnoSexSalPair.collect.foreach(println)
((11,m),20000)
((12,f),30000)
((11,m),40000)
((12,f),50000)
((12,m),60000)
((11,f),90000)
scala> val rsum = DnoSexSalPair.reduceByKey(_+_)
scala> rsum.collect.foreach(println)
((11,f),90000)
((12,f),80000)
((12,m),60000)
((11,m),60000)
scala> val rs = rsum.map( x =>
x._1._1+"\t"+x._1._2+"\t"+
x._2 )
scala> rs.collect.foreach(println)
11 f 90000
12 f 80000
12 m 60000
11 m 60000
_______________________________________
grouping by multiple columns, and multiple aggregations.
Assignment:
select dno, sex, sum(sal), max(sal) ,
min(sal), count(*), avg(sal)
from emp group by dno, sex;
val grpDnoSex =
DnoSexSalPair.groupByKey();
val r = grpDnoSex.map( x =>
x._1._1+"\t"+
x._1._2+"\t"+
x._2.sum+"\t"+
x._2.max+"\t"+
x._2.min+"\t"+
x._2.size+"\t"+
x._2.sum/x._2.size )
r.collect.foreach(println)
11 f 90000 90000 90000 1 90000
12 f 80000 50000 30000 2 40000
12 m 60000 60000 60000 1 60000
11 m 60000 40000 20000 2 30000
______________________________________
spark sql with json and xml processing
-----------
Spark Sql
---------------
[ ]
Spark sql is a library,
to process spark data objects,
using sql select statements.
Spark sql follows mysql based sql syntaxes.
==============================================
Spark sql provides,
two types of contexts.
i) sqlContext
ii) HiveContext.
import org.apache.spark.sql.SqlContext
val sqlCon = new SqlContext(sc)
using sqlContext ,
we can process spark objects using select statements.
Using HiveContext,
we can integrate , Hive with Spark.
Hive, is data warehouse environment in hadoop framework,
So total is stored and managed at Hive tables.
using HiveContext we can access entire hive enviroment (hive tables) from Spark.
difference between, hql statement from Hive,
and hql statement from Spark.
--> if hql is executed from Hive Environment,
the statement to process, will be converted
as mAPREDUCE job.
---> if same hive is integrated with spark,
and hql is submitted from spark,
it uses, DAG and Inmemory computing models.
which is more faster than MapReduce.
import org.apache.spark.sql.hive.HiveContext
val hc = new HiveContext(sc)
-----------------------------
Example of sqlContext.
val sqc = new SqlContext(sc)
file name --> file1
sample ---> 100,200,300
300,400,400
:
:
step1)
create case class for the data.
case class Rec(a:Int, b:Int, c:Int)
step2) create a function ,
to convert raw line into case object.
[function to provide schema ]
def makeRec(line:String)={
val w = line.split(",")
val a = w(0).toInt
val b = w(1).toInt
val c = w(2).toInt
val r = Rec(a, b,c)
r
}
--------
step3) load data.
val data = sc.textFile("/user/cloudera/sparklab/file1")
100,200,300
2000,340,456
:
:
step4) transform each record into case Object
val recs = data.map(x => makeRec(x))
step5) convert rdd into data frme.
val df = recs.toDF
step6) create table instance for the dataframe.
df.registerTempTable("samp")
step7) apply select statement of sql on temp table.
val r1 = sqc.sql("select a+b+c as tot from samp")
r1
------
tot
----
600
900
r1.registerTempTable(samp1)
val r2 = sqc.sql("select sum(tot) as gtot from samp1")
once "select" statement is applied on
temp table, returned object will be dataframe.
to apply sql on processed results,
again we need to register the dataframe
as temp table.
r1.registerAsTempTable("Samp2")
val r2 = sqc.sql("select * from samp2
where tot>=200")
-----------------------------------
sales
--------------------
:
12/27/2016,10000,3,10
:
:
-------------------------
Steps involing in Spark Sql.[sqlContext]
----------------------------
monthly sales report...
schema ---> date, price, qnt, discount
step1)
case class Sales(mon : Int, price:Int,
qnt :Int, disc: Int)
step2)
def toSales(line: String) = {
val w = line.split(",")
val mon = w(0).split("/")(0)
val p = w(1).toInt
val q = w(2).toInt
val d = w(3).toInt
val srec = Sales(mon,p,q,d)
srec
}
step3)
val data = sc.textFile("/user/cloudera/mydata/sales.txt")
step4)
val strans = data.map(x => toSales(x))
step5)
val sdf = strans.toDF
sdf.show
step6)
sdf.registerTempTable("SalesTrans")
step7) // play with select
---> mon, price, qnt, disc
val res1 = sqlContext.sql("select mon ,
sum(
(price - price*disc/100)*qnt
) as tsales from SalesTrans
group by mon")
res1.show
res1.printSchema
-----------------------------------------
val res2 = res1
res1.registerTempTable("tab1")
res2.registerTempTable("tab2")
val res3 = sqlContext.sql("select l.mon as m1,
r.mon as m2, l.tsales as t1,
r.tsales as t2
from tab1 l join tab2 r
where (l.mon-r.mon)==1")
// 11 rows.
res3.registerTempTable("tab3")
------------------------------
val res4 = sqlContext.sql("select
m1, m2, t1, t2, ((t2-t1)*100)/t1 as sgrowth
from tab3")
res4.show()
------------------------------------------
json1.json
--------------------------
{"name":"Ravi","age":25,"city":"Hyd"}
{"name":"Rani","sex":"F","city":"Del"}
:
:
---------------------------------------
val df = sqlContext.read.json("/user/cloudera/mydata/json1.json")
df.show
------------------------
name age sex city
----------------------------------------
ravi 25 null hyd
rani null F del
:
:
--------------------------------------
json2.json
------------------
{"name":"Ravi","age":25,
"wife":{"name":"Rani","age":23},"city":"Hyd"}}
:
:
val df2 = sqlContext.read.json("/../json2.json")
df2
-----------------------
name age wife city
Ravi 25 {"name":"rani","age":23} HYd
:
---------------------
df2.registerTempTable("Info")
val df3 = sqlContext.sql("select name,
wife.name as wname,
age, wife.age as wage,
abs(age-wife.age) as diff,
city from Info")
----------------------------------------
xml data processing with spark sql.
---spark sql does not have, direct libraries
for xml processing.
two ways.
i) 3 rd party api [ ex: databricks]
ii) using Hive Integreation.
2nd is best.
How to integrate Hive with spark .
---Using HiveContext.
step1)
copy hive-site.xml file into,
/usr/lib/spark/conf directory.
what if , hive-site.xml is not copied into
conf directory of spark?
--- spark can not understand,
hive's metastore location [derby/mysql/oracle ....]
this info is available with hive-site.xml .
step2)
create hive Context object
import org.apache.spark.sql.hive.HiveContext
val hc = new HiveContext(sc)
step3) access Hive Environment from spark
hc.sql("create database mydb")
hc.sql("use mydb")
hc.sql("create table samp(line string)")
hc.sql("load data local inpath 'file1'
into table samp")
val df = hc.sql("select * from samp")
-------------------------------------
xml1.xml
----------------------------------
<rec><name>Ravi</name><age>25</age></rec>
<rec><name>Rani</name><sex>F</sex></rec>
:
:
------------------------------------------
hc.sql("create table raw(line string)")
hc.sql("load data local inpath 'xml1.xml'
into table raw")
hc.sql("create table info(name string,
age int, sex string)")
hc.sql("insert overwrite table info
select xpath_string(line,'rec/name'),
xpath_int(line, 'rec/age'),
xpath_string(line, 'rec/sex')
from raw")
----------------------------------------
xml2.xml
------------
<rec><name><fname>Ravi</fname><lname>kumar</lname><age>24</age><contact><email><personal>ravi@gmail.com</personal><official>ravi@ventech.com</official></email><phone><mobile>12345</mobile><office>123900</office><residence>127845</residence></phone></contact><city>Hyd</city></rec>
hc.sql("create table xraw(line string)")
hc.sql("load data local inpath 'xml2.xml'
into table xraw")
hc.sql("create table xinfo(fname string ,
lname string, age int,
personal_email string,
official_email string,
mobile String,
office_phone string ,
residence_phone string,
city string)")
hc.sql("insert overwrite table xinfo
select
xpath_string(line,'rec/name/fname'),
xpath_string(line,'rec/name/lname'),
xpath_int(line,'rec/age'),
xpath_string(line,'rec/contact/email/personal'),
xpath_string(line,'rec/contact/email/official'),
xpath_string(line,'rec/contact/phone/mobile'),
xpath_string(line,'rec/contact/phone/office'),
xpath_string(line,'rec/contact/phone/residence'),
xpath_string(line,'rec/city')
from xraw")
-------------------------
xml3.xml
----------------
<tr><cid>101</cid><pr>200</pr><pr>300</pr><pr>300</pr></tr>
<tr><cid>102</cid><pr>400</pr><pr>800</pr></tr>
<tr><cid>101</cid><pr>1000</pr></tr>
--------------------------------
hc.sql("create table sraw")
hc.sql("load data local inpath 'xml3.xml'
into table sraw")
hc.sql("create table raw2(cid int, pr array<String>)")
hc.sql("insert overwrite table raw2
select xpath_int(line, 'tr/cid'),
xpath(line,'tr/pr/text()')
from sraw")
hc.sql("select * from raw2").show
-------------------------------
cid pr
101 [100,300,300]
102 [400,800]
101 [1000]
hc.sql("select explode(pr) as price from raw2").show
100
300
300
400
800
1000
hc.sql("select cid, explode(pr) as price from raw2").show
----> above is invalid.
hc.sql("create table raw3(cid int, pr int)")
hc.sql("Insert overwrite table raw3
select name, mypr from raw2
lateral view explode(pr) p as mypr")
hc.sql("select * from raw3").show
cid pr
101 200
101 300
101 300
102 400
102 800
101 1000
hc.sql("create table summary(cid int, totbill long)")
hc.sql("insert overwrite table summary
select cid , sum(pr) from raw3
group by cid")
--------------------
Spark Grouping Aggregations
demo grouping aggregations on structured data.
----------------------------------------------
[cloudera@quickstart ~]$ ls emp
emp
[cloudera@quickstart ~]$ cat emp
101,aaaa,40000,m,11
102,bbbbbb,50000,f,12
103,cccc,50000,m,12
104,dd,90000,f,13
105,ee,10000,m,12
106,dkd,40000,m,12
107,sdkfj,80000,f,13
108,iiii,50000,m,11
[cloudera@quickstart ~]$ hadoop fs -ls spLab
ls: `spLab': No such file or directory
[cloudera@quickstart ~]$ hadoop fs -mkdir spLab
[cloudera@quickstart ~]$ hadoop fs -
copyFromLocal emp spLab
scala> val data = sc.textFile
("/user/cloudera/spLab/emp")
data: org.apache.spark.rdd.RDD[String] =
/user/cloudera/spLab/emp MapPartitionsRDD[1] at
textFile at <console>:27
scala> data.collect.foreach(println)
101,aaaa,40000,m,11
102,bbbbbb,50000,f,12
103,cccc,50000,m,12
104,dd,90000,f,13
105,ee,10000,m,12
106,dkd,40000,m,12
107,sdkfj,80000,f,13
108,iiii,50000,m,11
scala>
scala> val arr = data.map(_.split(","))
arr: org.apache.spark.rdd.RDD[Array[String]] =
MapPartitionsRDD[2] at map at <console>:29
scala> arr.collect
res1: Array[Array[String]] = Array(Array(101,
aaaa, 40000, m, 11), Array(102, bbbbbb, 50000,
f, 12), Array(103, cccc, 50000, m, 12), Array
(104, dd, 90000, f, 13), Array(105, ee, 10000,
m, 12), Array(106, dkd, 40000, m, 12), Array
(107, sdkfj, 80000, f, 13), Array(108, iiii,
50000, m, 11))
scala>
scala> val pair1 = arr.map(x => (x(3), x
(2).toInt) )
pair1: org.apache.spark.rdd.RDD[(String, Int)] =
MapPartitionsRDD[3] at map at <console>:31
scala> // or
scala> val pair1 = arr.map{ x =>
| val sex = x(3)
| val sal = x(2).toInt
| (sex, sal)
| }
pair1: org.apache.spark.rdd.RDD[(String, Int)] =
MapPartitionsRDD[4] at map at <console>:31
scala>
scala> pair1.collect.foreach(println)
(m,40000)
(f,50000)
(m,50000)
(f,90000)
(m,10000)
(m,40000)
(f,80000)
(m,50000)
scala>
scala> // select sex, sum(sal) from emp group by
sex
scala> val rsum = pair1.reduceByKey((a,b) => a
+b)
rsum: org.apache.spark.rdd.RDD[(String, Int)] =
ShuffledRDD[5] at reduceByKey at <console>:33
scala> // or
scala> val rsum = pair1.reduceByKey(_+_)
rsum: org.apache.spark.rdd.RDD[(String, Int)] =
ShuffledRDD[6] at reduceByKey at <console>:33
scala> rsum.collect
res3: Array[(String, Int)] = Array((f,220000),
(m,190000))
scala>
// select sex, max(sal) from emp group by sex;
scala> val rmax = pair1.reduceByKey(Math.max
(_,_))
rmax: org.apache.spark.rdd.RDD[(String, Int)] =
ShuffledRDD[7] at reduceByKey at <console>:33
scala> rmax.collect
res4: Array[(String, Int)] = Array((f,90000),
(m,50000))
scala>
// select sex, min(sal) from emp group by sex;
scala> val rmin = pair1.reduceByKey(Math.min
(_,_))
rmin: org.apache.spark.rdd.RDD[(String, Int)] =
ShuffledRDD[8] at reduceByKey at <console>:33
scala> rmin.collect
res5: Array[(String, Int)] = Array((f,50000),
(m,10000))
scala>
// select sex, count(*) from emp
group by sex
scala> pair1.collect
res6: Array[(String, Int)] = Array((m,40000),
(f,50000), (m,50000), (f,90000), (m,10000),
(m,40000), (f,80000), (m,50000))
scala> pair1.countByKey
res7: scala.collection.Map[String,Long] = Map(f
-> 3, m -> 5)
scala> val pair2 = pair1.map(x => (x._1 , 1)
)
pair2: org.apache.spark.rdd.RDD[(String, Int)] =
MapPartitionsRDD[11] at map at <console>:33
scala> pair2.collect
res8: Array[(String, Int)] = Array((m,1), (f,1),
(m,1), (f,1), (m,1), (m,1), (f,1), (m,1))
scala> val rcnt = pair2.reduceByKey(_+_)
rcnt: org.apache.spark.rdd.RDD[(String, Int)] =
ShuffledRDD[12] at reduceByKey at <console>:35
scala> rcnt.collect
res9: Array[(String, Int)] = Array((f,3), (m,5))
scala>
// select sex, avg(sal) from emp group by sex;
scala> rsum.collect.foreach(println)
(f,220000)
(m,190000)
scala> rcnt.collect.foreach(println)
(f,3)
(m,5)
scala> val j = rsum.join(rcnt)
j: org.apache.spark.rdd.RDD[(String, (Int,
Int))] = MapPartitionsRDD[15] at join at
<console>:39
scala> j.collect
res12: Array[(String, (Int, Int))] = Array((f,
(220000,3)), (m,(190000,5)))
scala>
scala> j.collect
res13: Array[(String, (Int, Int))] = Array((f,
(220000,3)), (m,(190000,5)))
scala> val ravg = j.map{ x =>
| val sex = x._1
| val v = x._2
| val tot = v._1
| val cnt = v._2
| val avg = tot/cnt
| (sex, avg.toInt)
| }
ravg: org.apache.spark.rdd.RDD[(String, Int)] =
MapPartitionsRDD[17] at map at <console>:41
scala> ravg.collect
res15: Array[(String, Int)] = Array((f,73333),
(m,38000))
scala>
// select dno, range(sal) from emp
group by dno;
--> range is a difference between max and min.
scala> val pair3 = arr.map(x => ( x(4), x
(2).toInt ) )
pair3: org.apache.spark.rdd.RDD[(String, Int)] =
MapPartitionsRDD[18] at map at <console>:31
scala> pair3.collect.foreach(println)
(11,40000)
(12,50000)
(12,50000)
(13,90000)
(12,10000)
(12,40000)
(13,80000)
(11,50000)
scala>
scala> val dmax = pair3.reduceByKey(Math.max
(_,_))
dmax: org.apache.spark.rdd.RDD[(String, Int)] =
ShuffledRDD[19] at reduceByKey at <console>:33
scala> val dmin = pair3.reduceByKey(Math.min
(_,_))
dmin: org.apache.spark.rdd.RDD[(String, Int)] =
ShuffledRDD[20] at reduceByKey at <console>:33
scala> val dj = dmax.join(dmin)
dj: org.apache.spark.rdd.RDD[(String, (Int,
Int))] = MapPartitionsRDD[23] at join at
<console>:37
scala> val drange = dj.map{ x =>
| val dno = x._1
| val max = x._2._1
| val min = x._2._2
| val r = max-min
| (dno, r)
| }
drange: org.apache.spark.rdd.RDD[(String, Int)]
= MapPartitionsRDD[25] at map at <console>:39
scala> drange.collect.foreach(println)
(12,40000)
(13,10000)
(11,10000)
scala>
-------------------------------------
scala> // multiple aggregations.
scala> pair1.collect
res18: Array[(String, Int)] = Array((m,40000),
(f,50000), (m,50000), (f,90000), (m,10000),
(m,40000), (f,80000), (m,50000))
scala> val grp = pair1.groupByKey()
grp: org.apache.spark.rdd.RDD[(String, Iterable
[Int])] = ShuffledRDD[26] at groupByKey at
<console>:33
scala> grp.collect
res19: Array[(String, Iterable[Int])] = Array
((f,CompactBuffer(50000, 90000, 80000)),
(m,CompactBuffer(40000, 50000, 10000, 40000,
50000)))
scala> val r1 = grp.map(x => (x._1 , x._2.sum )
)
r1: org.apache.spark.rdd.RDD[(String, Int)] =
MapPartitionsRDD[27] at map at <console>:35
scala> r1.collect.foreach(println)
(f,220000)
(m,190000)
scala>
// select sex, sum(sal), count(*) ,
avg(sal) , max(sal), min(sal),
max(sal)-min(sal) as range
from emp group by sex;
scala> val rall = grp.map{ x =>
| val sex = x._1
| val cb = x._2
| val tot = cb.sum
| val cnt = cb.size
| val avg = (tot/cnt).toInt
| val max = cb.max
| val min = cb.min
| val r = max-min
| (sex,tot,cnt,avg,max,min,r)
| }
rall: org.apache.spark.rdd.RDD[(String, Int,
Int, Int, Int, Int, Int)] = MapPartitionsRDD[28]
at map at <console>:35
scala> rall.collect.foreach(println)
(f,220000,3,73333,90000,50000,40000)
(m,190000,5,38000,50000,10000,40000)
--------------------------------------------------
Spark : Performing grouping Aggregations based on Multiple Keys and saving results
// performing aggregations grouping by
multiple columns;
sql:
select dno, sex, sum(sal) from emp
group by dno, sex;
scala> val data = sc.textFile
("/user/cloudera/spLab/emp")
data: org.apache.spark.rdd.RDD[String] =
/user/cloudera/spLab/emp MapPartitionsRDD[1] at
textFile at <console>:27
scala> data.collect.foreach(println)
101,aaaa,40000,m,11
102,bbbbbb,50000,f,12
103,cccc,50000,m,12
104,dd,90000,f,13
105,ee,10000,m,12
106,dkd,40000,m,12
107,sdkfj,80000,f,13
108,iiii,50000,m,11
scala> val arr = data.map(_.split(","))
arr: org.apache.spark.rdd.RDD[Array[String]] =
MapPartitionsRDD[2] at map at <console>:29
scala> arr.collect
res1: Array[Array[String]] = Array(Array(101,
aaaa, 40000, m, 11), Array(102, bbbbbb, 50000,
f, 12), Array(103, cccc, 50000, m, 12), Array
(104, dd, 90000, f, 13), Array(105, ee, 10000,
m, 12), Array(106, dkd, 40000, m, 12), Array
(107, sdkfj, 80000, f, 13), Array(108, iiii,
50000, m, 11))
scala>
scala> val pair = arr.map(x => ( (x(4),x(3)) ,
x(2).toInt) )
pair: org.apache.spark.rdd.RDD[((String,
String), Int)] = MapPartitionsRDD[3] at map at
<console>:31
scala> pair.collect.foreach(println)
((11,m),40000)
((12,f),50000)
((12,m),50000)
((13,f),90000)
((12,m),10000)
((12,m),40000)
((13,f),80000)
((11,m),50000)
scala>
//or
val pair = data.map{ x =>
val w = x.split(",")
val dno = w(4)
val sex = w(3)
val sal = w(2).toInt
val mykey = (dno,sex)
val p = (mykey , sal)
p
}
scala> val res = pair.reduceByKey(_+_)
scala> res.collect.foreach(println)
((12,f),50000)
((13,f),170000)
((12,m),100000)
((11,m),90000)
scala> val r = res.map(x =>
(x._1._1,x._1._2,x._2) )
scala> r.collect.foreach(println)
(12,f,50000)
(13,f,170000)
(12,m,100000)
(11,m,90000)
-------------------------------------
spark reduceByKey() allows only single key for
grouping.
when you want grouping by multiple columns,
make multiple columns as a tuple,
keep the tuple as key in the pair.
---------------------------------------
sql:--> multi grouping and multi aggregations.
select dno, sex, sum(sal), count(*),
avg(sal) , max(sal), min(sal) from emp
group by dno, sex;
scala> val grp = pair.groupByKey()
grp: org.apache.spark.rdd.RDD[((String, String),
Iterable[Int])] = ShuffledRDD[7] at groupByKey
at <console>:31
scala> grp.collect.foreach(println)
((12,f),CompactBuffer(50000))
((13,f),CompactBuffer(90000, 80000))
((12,m),CompactBuffer(50000, 10000, 40000))
((11,m),CompactBuffer(40000, 50000))
scala> val agr = grp.map{ x =>
val dno = x._1._1
val sex = x._1._2
val cb = x._2
val tot = cb.sum
val cnt = cb.size
val avg = (tot/cnt).toInt
val max = cb.max
val min = cb.min
val r = (dno,sex,tot,cnt,avg,max,min)
r
}
agr: org.apache.spark.rdd.RDD[(String, String,
Int, Int, Int, Int, Int)] = MapPartitionsRDD[8]
at map at <console>:37
scala>
scala> agr.collect.foreach(println)
(12,f,50000,1,50000,50000,50000)
(13,f,170000,2,85000,90000,80000)
(12,m,100000,3,33333,50000,10000)
(11,m,90000,2,45000,50000,40000)
scala> // to save results into file.
agr.saveAsTextFile("/user/cloudera/spLab/res1")
[cloudera@quickstart ~]$ hadoop fs -ls spLab
Found 2 items
-rw-r--r-- 1 cloudera cloudera 158
2017-05-01 20:17 spLab/emp
drwxr-xr-x - cloudera cloudera 0
2017-05-02 20:29 spLab/res1
[cloudera@quickstart ~]$ hadoop fs -ls
spLab/res1
Found 2 items
-rw-r--r-- 1 cloudera cloudera 0
2017-05-02 20:29 spLab/res1/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 134
2017-05-02 20:29 spLab/res1/part-00000
[cloudera@quickstart ~]$ hadoop fs -cat
spLab/res1/part-00000
(12,f,50000,1,50000,50000,50000)
(13,f,170000,2,85000,90000,80000)
(12,m,100000,3,33333,50000,10000)
(11,m,90000,2,45000,50000,40000)
[cloudera@quickstart ~]$
// here, output is written as tuple shape.
// which is not valid format for hive, rdbms, or
other systems.
// before saving results following
transformation should be done.
val r1 = agr.map{ x=>
x._1+","+x._2+","+x._3+","+
x._4+","+x._5+","+x._6+","+x._7
}
scala> val r1 = agr.map{ x=>
| x._1+","+x._2+","+x._3+","+
| x._4+","+x._5+","+x._6+","+x._7
| }
r1: org.apache.spark.rdd.RDD[String] =
MapPartitionsRDD[5] at map at <console>:35
scala> r1.collect.foreach(println)
12,f,50000,1,50000,50000,50000
13,f,170000,2,85000,90000,80000
12,m,100000,3,33333,50000,10000
11,m,90000,2,45000,50000,40000
scala>
// or
scala> val r2 = agr.map{ x =>
| val dno = x._1
| val sex = x._2
| val tot = x._3
| val cnt = x._4
| val avg = x._5
| val max = x._6
| val min = x._7
| Array(dno,sex,tot.toString,cnt.toString,
| avg.toString, max.toString,
min.toString).mkString("\t")
| }
r2: org.apache.spark.rdd.RDD[String] =
MapPartitionsRDD[6] at map at <console>:35
scala> r2.collect.foreach(println)
12 f 50000 1 50000 50000
50000
13 f 170000 2 85000 90000
80000
12 m 100000 3 33333 50000
10000
11 m 90000 2 45000 50000
40000
scala>
[cloudera@quickstart ~]$ hadoop fs -ls
spLab/res2
Found 2 items
-rw-r--r-- 1 cloudera cloudera 0
2017-05-02 20:44 spLab/res2/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 126
2017-05-02 20:44 spLab/res2/part-00000
[cloudera@quickstart ~]$ hadoop fs -cat
spLab/res2/part-00000
12 f 50000 1 50000 50000
50000
13 f 170000 2 85000 90000
80000
12 m 100000 3 33333 50000
10000
11 m 90000 2 45000 50000
40000
[cloudera@quickstart ~]$
-- this results , can be directly exported into
rdbms.
[cloudera@quickstart ~]$ mysql -u root
-pcloudera
mysql> create database spres;
Query OK, 1 row affected (0.03 sec)
mysql> use spres;
Database changed
mysql> create table summary(dno int, sex char
(1),
-> tot int , cnt int, avg int, max int,
min int);
Query OK, 0 rows affected (0.10 sec)
mysql> select * from summary;
Empty set (0.00 sec)
mysql>
[cloudera@quickstart ~]$ sqoop export --connect
jdbc:mysql://localhost/spres --username root --
password cloudera --table summary --export-dir
'/user/cloudera/spLab/res2/part-00000' --input-
fields-terminated-by '\t'
to use spark written results by hive.
hive> create table info(dno int, sex string,
tot int, cnt int, avg int, max int,
min int)
row format delimited
fields terminated by '\t';
hive> load data
'/user/cloudera/spLab/res2/part-00000' into
table info;
mysql> select * from summary;
+------+------+--------+------+-------+-------
+-------+
| dno | sex | tot | cnt | avg | max |
min |
+------+------+--------+------+-------+-------
+-------+
| 12 | m | 100000 | 3 | 33333 | 50000 |
10000 |
| 11 | m | 90000 | 2 | 45000 | 50000 |
40000 |
| 12 | f | 50000 | 1 | 50000 | 50000 |
50000 |
| 13 | f | 170000 | 2 | 85000 | 90000 |
80000 |
+------+------+--------+------+-------+-------
+-------+
4 rows in set (0.03 sec)
Spark : Entire Column Aggregations
Entire Column Aggregations:
sql:
select sum(sal) from emp;
scala> val emp = sc.textFile("/user/cloudera/spLab/emp")
emp: org.apache.spark.rdd.RDD[String] = /user/cloudera/spLab/emp MapPartitionsRDD[1] at textFile at <console>:27
scala> val sals = emp.map(x => x.split(",")(2).toInt)
sals: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[2] at map at <console>:29
scala> sals.collect
res0: Array[Int] = Array(40000, 50000, 50000, 90000, 10000, 40000, 80000, 50000)
scala> sals.sum
res1: Double = 410000.0
scala> sals.reduce((a,b) => a + b)
res2: Int = 410000
scala>
---> reduce will be computed cluster. n
---> sum will collect all partitions data into client and computation happens at local.
sql:
select sum(sal), count(*), avg(sal),
max(sal) , min(sal) from emp;
scala> val tot = sals.sum
tot: Double = 410000.0
scala> val cnt = sals.count
cnt: Long = 8
scala> val avg = sals.mean
avg: Double = 51250.0
scala> val max = sals.max
max: Int = 90000
scala> val min = sals.min
min: Int = 10000
scala> val m = sals.reduce(Math.max(_,_))
m: Int = 90000
scala>
scala> val res = (tot,cnt,avg,max,min)
res: (Double, Long, Double, Int, Int) = (410000.0,8,51250.0,90000,10000)
scala> tot
res3: Double = 410000.0
scala>
----------------------------------------
Spark : Handling CSV files .. Removing Headers
scala> val l = List(10,20,30,40,50,56,67)
scala> val r2 = r.collect.reverse.take(3)
r2: Array[Int] = Array(67, 56, 50)
scala> val r2 = sc.parallelize(r.collect.reverse.take(3))
r2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:31
-------------------------------
hadling CSV files [ first is header ]
[cloudera@quickstart ~]$ gedit prods
[cloudera@quickstart ~]$ hadoop fs -copyFromLocal prods spLab
scala> val raw = sc.textFile("/user/cloudera/spLab/prods")
raw: org.apache.spark.rdd.RDD[String] = /user/cloudera/spLab/prods MapPartitionsRDD[11] at textFile at <console>:27
scala> raw.collect.foreach(println)
"pid","name","price"
p1,Tv,50000
p2,Lap,70000
p3,Ipod,8000
p4,Mobile,9000
scala> raw.count
res18: Long = 5
to eleminate first element, slice is used .
scala> l
res19: List[Int] = List(10, 20, 30, 40, 50, 50, 56, 67)
scala> l.slice(2,5)
res20: List[Int] = List(30, 40, 50)
scala> l.slice(1,l.size)
res21: List[Int] = List(20, 30, 40, 50, 50, 56, 67)
way1:
scala> raw.collect
res29: Array[String] = Array("pid","name","price", p1,Tv,50000, p2,Lap,70000, p3,Ipod,8000, p4,Mobile,9000)
scala> val data = sc.parallelize(raw.collect.slice(1,raw.collect.size))
data: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[12] at parallelize at <console>:29
scala> data.collect.foreach(println)
p1,Tv,50000
p2,Lap,70000
p3,Ipod,8000
p4,Mobile,9000
scala>
here slice is not available with rdd.
so , data to be collected into local , then slice has to applied.
if rdd volume is bigger, client can not collect it. flow will be failed.
Way2:
------
val data = raw.filter(x =>
!line.contains("pid"))
data.persist
--adv: no need to collect data into client[local]
--disadv : to eleminate 1 row, scanning all rows.
-----------------------------------------
Spark : Conditional Transformations
Conditions Transformations:
val trans = emp.map{ x =>
val w = x.split(",");
val sal = w(2).toInt
val grade = if(sal>=70000) "A" else
if(sal>=50000) "B" else
if(sal>=30000) "C" else "D"
val tax = sal*10/100
val dno = w(4).toInt
val dname = dno match{
case 11 => "Marketing"
case 12 => "Hr"
case 13 => "Finance"
case _ => "Other"
}
var sex = w(3).toLowerCase
sex = if(sex=="m") "Male" else "Female"
val res = Array(w(0), w(1),
w(2),tax.toString, grade, sex, dname).mkString(",")
res
}
trans.saveAsTextFile("/user/cloudera/spLab/results4")
-----------------------------------------
Spark : Union and Distinct
Unions in spark.
val l1 = List(10,20,30,40,50)
val l2 = List(100,200,300,400,500)
val r1 = sc.parallelize(l1)
val r2 = sc.parallelize(l2)
val r = r1.union(r2)
scala> r.collect.foreach(println)
[Stage 0:> (0 + 0 10
20
30
40
50
100
200
300
400
500
scala> r.count
res1: Long = 10
spark union allows duplicates.
Using ++ operatory, merging can be done.
scala> val r3 = r1 ++ r2
r3: org.apache.spark.rdd.RDD[Int] = UnionRDD[3] at $plus$plus at <console>:35
scala> r3.collect
res4: Array[Int] = Array(10, 20, 30, 40, 50, 100, 200, 300, 400, 500)
scala>
meging more than two sets.
^
scala> val rr = r1.union(r2).union(rx)
rr: org.apache.spark.rdd.RDD[Int] = UnionRDD[6] at union at <console>:37
scala> rr.count
res5: Long = 13
scala> rr.collect
res6: Array[Int] = Array(10, 20, 30, 40, 50, 100, 200, 300, 400, 500, 15, 25, 35)
scala>// or
scala> val rr = r1 ++ r2 ++ rx
rr: org.apache.spark.rdd.RDD[Int] = UnionRDD[8] at $plus$plus at <console>:37
scala> rr.collect
res7: Array[Int] = Array(10, 20, 30, 40, 50, 100, 200, 300, 400, 500, 15, 25, 35)
scala>
--- eleminate duplicates.
scala> val x = List(10,20,30,40,10,10,20)
x: List[Int] = List(10, 20, 30, 40, 10, 10, 20)
scala> x.distinct
res8: List[Int] = List(10, 20, 30, 40)
scala> val y = sc.parallelize(x)
y: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:29
scala> r1.collect
res14: Array[Int] = Array(10, 20, 30, 40, 50)
scala> y.collect
res15: Array[Int] = Array(10, 20, 30, 40, 10, 10, 20)
scala> val nodupes = (r1 ++ y).distinct
nodupes: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[13] at distinct at <console>:35
scala> nodupes.collect
[Stage 10:> (0 + 0 res16: Array[Int] = Array(30, 50, 40, 20, 10)
scala>
---------------------------------------
[cloudera@quickstart ~]$ hadoop fs -cat spLab/emp
101,aaaa,40000,m,11
102,bbbbbb,50000,f,12
103,cccc,50000,m,12
104,dd,90000,f,13
105,ee,10000,m,12
106,dkd,40000,m,12
107,sdkfj,80000,f,13
108,iiii,50000,m,11
[cloudera@quickstart ~]$ hadoop fs -cat spLab/emp2
201,Ravi,80000,m,12
202,Varun,90000,m,11
203,Varuna,100000,f,13
204,Vanila,50000,f,12
205,Mani,30000,m,14
206,Manisha,30000,f,14
[cloudera@quickstart ~]$
scala> val branch1 = sc.textFile("/user/cloudera/spLab/emp")
branch1: org.apache.spark.rdd.RDD[String] = /user/cloudera/spLab/emp MapPartitionsRDD[15] at textFile at <console>:27
scala> val branch2 = sc.textFile("/user/cloudera/spLab/emp2")
branch2: org.apache.spark.rdd.RDD[String] = /user/cloudera/spLab/emp2 MapPartitionsRDD[17] at textFile at <console>:27
scala> val emp = branch1.union(branch2)
emp: org.apache.spark.rdd.RDD[String] = UnionRDD[18] at union at <console>:31
scala> emp.collect.foreach(println)
scala> emp.collect.foreach(println)
101,aaaa,40000,m,11
102,bbbbbb,50000,f,12
103,cccc,50000,m,12
104,dd,90000,f,13
105,ee,10000,m,12
106,dkd,40000,m,12
107,sdkfj,80000,f,13
108,iiii,50000,m,11
201,Ravi,80000,m,12
202,Varun,90000,m,11
203,Varuna,100000,f,13
204,Vanila,50000,f,12
205,Mani,30000,m,14
206,Manisha,30000,f,14
--------------------------------
distinct:
to eleminated duplicates
based on entire row match.
limitations: can not eleminated based on some column(s) match.
for this solution:
by iterating compactBuffer.
[ later we will see ]
grouping aggregation on merged set.
scala> val pair = emp.map{ x =>
| val w = x.split(",")
| val dno = w(4).toInt
| val sal = w(2).toInt
| (dno, sal)
| }
pair: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[19] at map at <console>:35
scala> val eres = pair.reduceByKey(_+_)
eres: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[20] at reduceByKey at <console>:37
scala> eres.collect.foreach(println)
(14,60000)
(12,280000)
(13,270000)
(11,180000)
scala>
-- in this output we dont have seperate total for branch1 and branch2.
Spark : CoGroup And Handling Empty Compact Buffers
Co Grouping using Spark:-
-------------------------
scala> branch1.collect.foreach(println)
101,aaaa,40000,m,11
102,bbbbbb,50000,f,12
103,cccc,50000,m,12
104,dd,90000,f,13
105,ee,10000,m,12
106,dkd,40000,m,12
107,sdkfj,80000,f,13
108,iiii,50000,m,11
scala> branch2.collect.foreach(println)
201,Ravi,80000,m,12
202,Varun,90000,m,11
203,Varuna,100000,f,13
204,Vanila,50000,f,12
205,Mani,30000,m,14
206,Manisha,30000,f,14
scala> def toDnoSalPair(line:String) = {
val w = line.split(",")
val dno = w(4).toInt
val dname = dno match{
case 11 => "Marketing"
case 12 => "Hr"
case 13 => "Finance"
case _ => "Other"
}
val sal = w(2).toInt
(dname, sal)
}
toDnoSalPair: (line: String)(String, Int)
scala> toDnoSalPair("101,aaaaa,60000,m,12")
res22: (String, Int) = (Hr,60000)
scala>
scala> val pair1 = branch1.map(x => toDnoSalPair(x))
pair1: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[21] at map at <console>:33
scala> val pair2 = branch2.map(x => toDnoSalPair(x))
pair2: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[22] at map at <console>:33
scala> pair1.collect.foreach(println)
(Marketing,40000)
(Hr,50000)
(Hr,50000)
(Finance,90000)
(Hr,10000)
(Hr,40000)
(Finance,80000)
(Marketing,50000)
scala> pair2.collect.foreach(println)
(Hr,80000)
(Marketing,90000)
(Finance,100000)
(Hr,50000)
(Other,30000)
(Other,30000)
scala>
scala> val cg = pair1.cogroup(pair2)
cg: org.apache.spark.rdd.RDD[(String, (Iterable[Int], Iterable[Int]))] = MapPartitionsRDD[24] at cogroup at <console>:39
scala> cg.collect.foreach(println)
(Hr,(CompactBuffer(50000, 50000, 10000, 40000),CompactBuffer(80000, 50000)))
(Other,(CompactBuffer(),CompactBuffer(30000, 30000)))
(Marketing,(CompactBuffer(40000, 50000),CompactBuffer(90000)))
(Finance,(CompactBuffer(90000, 80000),CompactBuffer(100000)))
scala>
scala> val res = cg.map{ x =>
val dname = x._1
val cb1 = x._2._1
val cb2 = x._2._2
val tot1 = cb1.sum
val tot2 = cb2.sum
val tot = tot1+tot2
(dname,tot1,tot2,tot)
}
scala> res.collect.foreach(println)
(Hr,150000,130000,280000)
(Other,0,60000,60000)
(Marketing,90000,90000,180000)
(Finance,170000,100000,270000)
from above , sum of empty compact buffer ,
size of empty compact buffer are zero.
but we get problem with
sum/size and max , min
val res = cg.map{ x =>
val dname = x._1
val cb1 = x._2._1
val cb2 = x._2._2
val max1 = cb1.max
val max2 = cb2.max
(dname,max1,max2)
}
-- res.collect , can not execute.
problem with max on empty compact buffer.
-- same we get for min.
val res = cg.map{ x =>
val dname = x._1
val cb1 = x._2._1
val cb2 = x._2._2
val tot1 = cb1.sum
val tot2 = cb2.sum
val cnt1 = cb1.size
val cnt2 = cb2.size
(dname, (tot1,cnt1), (tot2,cnt2))
}
-- no problem with sum and size on empty compact buffer.
val res = cg.map{ x =>
val dname = x._1
val cb1 = x._2._1
val cb2 = x._2._2
val tot1 = cb1.sum
val tot2 = cb2.sum
val cnt1 = cb1.size
val cnt2 = cb2.size
val avg1 = (tot1/cnt1).toInt
val avg2 = (tot2/cnt2).toInt
(dname, avg1, avg2)
}
res.collect will be failed.
bcoz, for avg in denominator zero is applied.
Solution:
----------
val res = cg.map{ x =>
val dname = x._1
val cb1 = x._2._1
val cb2 = x._2._2
val tot1 = cb1.sum
val tot2 = cb2.sum
val cnt1 = cb1.size
val cnt2 = cb2.size
var max1 = 0
var min1 = 0
var avg1 = 0
if (cnt1!=0){
avg1 = tot1/cnt1
max1 = cb1.max
min1 = cb1.min
}
var max2 = 0
var min2 = 0
var avg2 = 0
if (cnt2!=0){
avg2 = tot2/cnt2
max2 = cb2.max
min2 = cb2.min
}
(dname,(tot1,cnt1,avg1,max1,min1),
(tot2,cnt2,avg2,max2,min2)) }
scala> res.collect.foreach(println)
(Hr,(150000,4,37500,50000,10000),(130000,2,65000,80000,50000))
(Other,(0,0,0,0,0),(60000,2,30000,30000,30000))
(Marketing,(90000,2,45000,50000,40000),(90000,1,90000,90000,90000))
(Finance,(170000,2,85000,90000,80000),(100000,1,100000,100000,100000))
-----------------------------
Cogroup on more than two
scala> val p1 = sc.parallelize(List(("m",10000),("f",30000),("m",50000)))
p1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[30] at parallelize at <console>:27
scala> val p2 = sc.parallelize(List(("m",10000),("f",30000)))
p2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[31] at parallelize at <console>:27
scala> val p3 = sc.parallelize(List(("m",10000),("m",30000)))
p3: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[32] at parallelize at <console>:27
scala> val cg = p1.cogroup(p2,p3)
cg: org.apache.spark.rdd.RDD[(String, (Iterable[Int], Iterable[Int], Iterable[Int]))] = MapPartitionsRDD[34] at cogroup at <console>:33
scala> cg.collect.foreach(println)
(f,(CompactBuffer(30000),CompactBuffer(30000),CompactBuffer()))
(m,(CompactBuffer(10000, 50000),CompactBuffer(10000),CompactBuffer(10000, 30000)))
scala> val r = cg.map{x =>
| val sex = x._1
| val tot1 = x._2._1.sum
| val tot2 = x._2._2.sum
| val tot3 = x._2._3.sum
| (sex, tot1, tot2, tot3)
| }
r: org.apache.spark.rdd.RDD[(String, Int, Int, Int)] = MapPartitionsRDD[35] at map at <console>:37
scala> r.collect.foreach(println)
(f,30000,30000,0)
(m,60000,10000,40000)
scala>
Spark : Joins
[cloudera@quickstart ~]$ hadoop fs -copyFromLocal emp spLab/e
[cloudera@quickstart ~]$ hadoop fs -copyFromLocal dept spLab/d
[cloudera@quickstart ~]$ hadoop fs -cat spLab/e
101,aaaa,40000,m,11
102,bbbbbb,50000,f,12
103,cccc,50000,m,12
104,dd,90000,f,13
105,ee,10000,m,12
106,dkd,40000,m,12
107,sdkfj,80000,f,13
108,iiii,50000,m,11
109,jj,10000,m,14
110,kkk,20000,f,15
111,dddd,30000,m,15
[cloudera@quickstart ~]$ hadoop fs -cat spLab/d
11,marketing,hyd
12,hr,del
13,fin,del
21,admin,hyd
22,production,del
[cloudera@quickstart ~]$
val emp = sc.textFile("/user/cloudera/spLab/e")
val dept = sc.textFile("/user/cloudera/spLab/d")
val epair = emp.map{x =>
val w = x.split(",")
val dno = w(4).toInt
val sal = w(2).toInt
(dno, sal)
}
epair.collect.foreach(println)
(11,40000)
(12,50000)
(12,50000)
(13,90000)
(12,10000)
(12,40000)
(13,80000)
(11,50000)
(14,10000)
val dpair = dept.map{ x =>
val w = x.split(",")
val dno = w(0).toInt
val loc = w(2)
(dno, loc)
}
scala> dpair.collect.foreach(println)
(11,hyd)
(12,del)
(13,del)
(21,hyd)
(22,del)
-- inner join
val ij = epair.join(dpair)
ij.collect.foreach(println)
ij.collect.foreach(println)
(13,(90000,del))
(13,(80000,del))
(11,(40000,hyd))
(11,(50000,hyd))
(12,(50000,del))
(12,(50000,del))
(12,(10000,del))
(12,(40000,del))
-- left outer join
val lj = epair.leftOuterJoin(dpair)
lj.collect.foreach(println)
scala> lj.collect.foreach(println)
(13,(90000,Some(del)))
(13,(80000,Some(del)))
(15,(20000,None))
(15,(30000,None))
(11,(40000,Some(hyd)))
(11,(50000,Some(hyd)))
(14,(10000,None))
(12,(50000,Some(del)))
(12,(50000,Some(del)))
(12,(10000,Some(del)))
(12,(40000,Some(del)))
-- right outer join
val rj = epair.rightOuterJoin(dpair)
rj.collect.foreach(println)
(13,(Some(90000),del))
(13,(Some(80000),del))
(21,(None,hyd))
(22,(None,del))
(11,(Some(40000),hyd))
(11,(Some(50000),hyd))
(12,(Some(50000),del))
(12,(Some(50000),del))
(12,(Some(10000),del))
(12,(Some(40000),del))
-- full outer join
val fj = epair.fullOuterJoin(dpair)
fj.collect.foreach(println)
(13,(Some(90000),Some(del)))
(13,(Some(80000),Some(del)))
(15,(Some(20000),None))
(15,(Some(30000),None))
(21,(None,Some(hyd)))
(22,(None,Some(del)))
(11,(Some(40000),Some(hyd)))
(11,(Some(50000),Some(hyd)))
(14,(Some(10000),None))
(12,(Some(50000),Some(del)))
(12,(Some(50000),Some(del)))
(12,(Some(10000),Some(del)))
(12,(Some(40000),Some(del)))
location based aggregations:
val locSal = fj.map{ x =>
val sal = x._2._1
val loc = x._2._2
val s = if(sal==None) 0 else sal.get
val l = if(loc==None) "NoCity" else loc.get
(l, s)
}
locSal.collect.foreach(println)
(del,90000)
(del,80000)
(NoCity,20000)
(NoCity,30000)
(hyd,0)
(del,0)
(hyd,40000)
(hyd,50000)
(NoCity,10000)
(del,50000)
(del,50000)
(del,10000)
(del,40000)
val locSummary = locSal.reduceByKey(_+_)
locSummary.collect.foreach(println)
scala> locSummary.collect.foreach(println)
(hyd,90000)
(del,320000)
(NoCity,60000)
-----------------
val stats = fj.map{ x =>
val sal = x._2._1
val loc = x._2._2
val stat = if(sal!=None & loc!=None) "Working" else
if(sal==None) "BenchProj" else "BenchTeam"
val s = if(sal==None) 0 else sal.get
(stat, s)
}
stats.collect.foreach(println)
(Working,90000)
(Working,80000)
(BenchTeam,20000)
(BenchTeam,30000)
(BenchProj,0)
(BenchProj,0)
(Working,40000)
(Working,50000)
(BenchTeam,10000)
(Working,50000)
(Working,50000)
(Working,10000)
(Working,40000)
val res = stats.reduceByKey(_+_)
res.collect.foreach(println)
(BenchTeam,60000)
(Working,410000)
(BenchProj,0)
Spark : Joins 2
Denormalizing datasets using Joins
[cloudera@quickstart ~]$ cat > children
c101,p101,Ravi,34
c102,p101,Rani,24
c103,p102,Mani,20
c104,p103,Giri,22
c105,p102,Vani,22
[cloudera@quickstart ~]$ cat > parents
p101,madhu,madhavi,hyd
p102,Sathya,Veni,Del
p103,Varma,Varuna,hyd
[cloudera@quickstart ~]$ hadoop fs -copyFromLocal children spLab
[cloudera@quickstart ~]$ hadoop fs -copyFromLocal parents spLab
[cloudera@quickstart ~]$
val children = sc.textFile("/user/cloudera/spLab/children")
val parents = sc.textFile("/user/cloudera/spLab/parents")
val chPair = children.map{ x =>
val w = x.split(",")
val pid = w(1)
val chInfo =Array(w(0), w(2), w(3)).
mkString(",")
(pid, chInfo)
}
chPair.collect.foreach(println)
(p101,c101,Ravi,34)
(p101,c102,Rani,24)
(p102,c103,Mani,20)
(p103,c104,Giri,22)
(p102,c105,Vani,22)
val PPair = parents.map{ x =>
val w = x.split(",")
val pid = w(0)
val pInfo = Array(w(1),w(2),w(3)).mkString(",")
(pid, pInfo)
}
PPair.collect.foreach(println)
PPair.collect.foreach(println)
(p101,madhu,madhavi,hyd)
(p102,Sathya,Veni,Del)
(p103,Varma,Varuna,hyd)
val family = chPair.join(PPair)
family.collect.foreach(println)
(p101,(c101,Ravi,34,madhu,madhavi,hyd))
(p101,(c102,Rani,24,madhu,madhavi,hyd))
(p102,(c103,Mani,20,Sathya,Veni,Del))
(p102,(c105,Vani,22,Sathya,Veni,Del))
(p103,(c104,Giri,22,Varma,Varuna,hyd))
val profiles = family.map{ x =>
val cinfo = x._2._1
val pinfo = x._2._2
val info = cinfo +","+ pinfo
info
}
profiles.collect.foreach(println)
c101,Ravi,34,madhu,madhavi,hyd
c102,Rani,24,madhu,madhavi,hyd
c103,Mani,20,Sathya,Veni,Del
c105,Vani,22,Sathya,Veni,Del
c104,Giri,22,Varma,Varuna,hyd
profiles.saveAsTextFile("/user/cloudera/spLab/profiles")
[cloudera@quickstart ~]$ hadoop fs -ls spLab/profiles
Found 2 items
-rw-r--r-- 1 cloudera cloudera 0 2017-05-08 21:02 spLab/profiles/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 150 2017-05-08 21:02 spLab/profiles/part-00000
[cloudera@quickstart ~]$ hadoop fs -cat spLab/profiles/part-00000
c101,Ravi,34,madhu,madhavi,hyd
c102,Rani,24,madhu,madhavi,hyd
c103,Mani,20,Sathya,Veni,Del
c105,Vani,22,Sathya,Veni,Del
c104,Giri,22,Varma,Varuna,hyd
[cloudera@quickstart ~]$
Spark : Spark streaming and Kafka Integration
steps:
1) start zookeper server
2) Start Kafka brokers [ one or more ]
3) create topic .
4) start console producer [ to write messages into topic ]
5) start console consumer [ to test , whether messages are stremed ]
6) create spark streaming context,
which streams from kafka topic.
7) perform transformations or aggregations
8) output operation : which will direct the results into another kafka topic.
------------------------------------------
following code tested with ,
spark 1.6.0 and kafka 0.10.2.0
kafka and spark streaming
bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties
/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic spark-topic
bin/kafka-topics.sh --list --zookeeper localhost:2181
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic spark-topic
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic spark-topic --from-beginning
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
val ssc = new StreamingContext(sc, Seconds(5))
import org.apache.spark.streaming.kafka.KafkaUtils
//1.
val kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181","spark-streaming-consumer-group", Map("spark-topic" -> 5))
val lines = kafkaStream.map(x => x._2.toUpperCase)
val warr = lines.map(x => x.split(" "))
val pair = warr.map(x => (x,1))
val wc = pair.reduceByKey(_+_)
wc.print()
// use below code to write results into kafka topic
ssc.start
------------------------------
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic results1
// writing into kafka topic.
import org.apache.kafka.clients.producer.ProducerConfig
import java.util.HashMap
import org.apache.kafka.clients.producer.KafkaProducer
import org.apache.kafka.clients.producer.ProducerRecord
wc.foreachRDD(rdd =>
rdd.foreachPartition(partition =>
partition.foreach{
case t:(w:String,cnt:Long)=>{
val x = w+"\t"+cnt
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,
"localhost:9092")
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
println(x)
val producer = new KafkaProducer[String,String](props)
val message=new ProducerRecord[String, String]("results1",null,x)
producer.send(message)
}
}))
-- execute above code before ssc.start.
--------------------------------------------
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic results1 --from-beginning
-------------------
val kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181","spark-streaming-consumer-group", Map("spark-topic" -> 5))
1. --? KafkaUtils.createStream()..
needs 4 arguments.
1st ---> streaming Context
2nd --> zk details.
3rd --- > consumer group id
4th ----> Topics.
spark streaming can read from multiple topics.
topic should be as a key value pair of map object
key ---> topic name
value ---> no.of consumer threads.
to read from multiple topics,
the 4th argument should be as follows.
Map("t1"->2,"t2"->4,"t3"->1)
-------------------------
each given number of consumer threads will applied on each partition of kafka topic.
ex: topic has 3 threads,
consumber threads are 5.
so , total number of threads = 15.
but these 15 theads are not parallely executed.
at shot, 5 threads for one partiton will be parallely consuming data.
to make all (15) parallel.
val numparts = 3
val kstreams = (1 to numparts).map{x =>
val kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181","spark-streaming-consumer- group", Map("spark-topic" -> 5))
}
---------------------------------------------------------------------
scala> val x = sc.parallelize(List(1,2,3,4,5,6));
x: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:27
scala> val times2 = x.map(x=>x*2);
times2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:29
scala> times2.foreach(println);
2
4
6
8
10
12
scala> times2.collect();
res5: Array[Int] = Array(2, 4, 6, 8, 10, 12)
code:
val x = sc.parallelize(List(1,2,3,4,5));
val times2 = x.map(_*2);
times2.collect();
O/P
Array[Int] = Array(2, 4, 6, 8, 10)
val rdd = sc.parallelize(List("Hi All","Welcome to India"));
val fm =rdd.flatMap(_.split(" ")); // word by word split
fm.collect();
output:
Array[String] = Array(Hi, All, Welcome, to, India)
scala> fm.foreach(println);
Hi
All
Welcome
to
India
val rdd = sc.parallelize(List("APPLE","BALL","CAT","DEER","CAN"));
val filtered = rdd.filter(_.contains("C"));
scala> filtered.collect();
res9: Array[String] = Array(CAT, CAN)
scala> filtered.foreach(println);
CAT
CAN
//distinct example
val r1 = sc.makeRDD(List(1,2,3,4,5,3,1));
println("\n distinct output");
r1.distinct().foreach(x => print(x + " "));
4 1 3 5 2
//union example - combined together
val r2 = sc.makeRDD(List(1,4,3,6,7));
r1.union(r2).foreach(x=>print(x+" "));
1 2 3 4 5 3 1 1 4 3 6 7
//common among them
r1.intersection(r2).foreach(x=>print(x + " "));
4 1 3
scala> r1.collect();
res19: Array[Int] = Array(1, 2, 3, 4, 5, 3, 1)
scala> r2.collect();
res20: Array[Int] = Array(1, 4, 3, 6, 7)
scala> r1.subtract(r2).foreach(x=>print(x+" "));
2 5 // leave 7 of r2
Cross join:
-----------
r1.cartesian(r2).foreach(x=>print(x+" "));
(1,1) (1,4) (1,3) (1,6) (1,7) (2,1) (2,4) (2,3) (2,6) (2,7) (3,1) (3,4) (3,3) (3,6) (3,7) (4,1) (4,4) (4,3) (4,6) (4,7) (5,1) (5,4) (5,3) (5,6) (5,7) (3,1) (3,4) (3,3) (3,6) (3,7) (1,1) (1,4) (1,3) (1,6) (1,7)
scala>
count:
--------
val rdd = sc.parallelize(List('A','B','C','D'));
rdd.count();
res24: Long = 4
Sum:
------
scala> val rdd = sc.parallelize(List(1,2,3,4));
scala> rdd.reduce(_+_)
res27: Int = 10
scala> val rdd = sc.parallelize(List("arun-1","kalai-2","siva-3","nalan-4","aruvi-5"));
rdd.first();
res33: String = arun-1
scala> rdd.take(3);
res34: Array[String] = Array(arun-1, kalai-2, siva-3)
scala> rdd.foreach(println);
arun-1
kalai-2
siva-3
nalan-4
aruvi-5
C:\Users\Anbu\Google Drive\Bigdata_Weekend_Batch_July_2018\Assignments\spark\Nasa_data_logs\Nasa_Webserver_log.tsv
val r1 = sc.textFile("file:///home/cloudera/Desktop/Nasa_Webserver_log.tsv");
val visitsCount = r1.filter (x => x.contains("countdown.html"));
visitsCount.count();
res0: Long = 8586
val IPAddress = r1.map (line => line.split("\t")).map(parts => parts.take(1))
val logTime = r1.map(l => l.split("\t")).map(p=>p.take(2));
IPAddress.collect()
logTime.collect()
val v1 = sqlContext.read.format("json").load("file:///home/cloudera/Desktop/Files/customer_data.json")
scala> v1.printSchema
root
|-- email: string (nullable = true)
|-- first_name: string (nullable = true)
|-- gender: string (nullable = true)
|-- id: long (nullable = true)
|-- ip_address: string (nullable = true)
|-- last_name: string (nullable = true)
scala> v1.count
res1: Long = 1000
scala> v1.show(2)
+--------------------+----------+------+---+--------------+---------+
| email|first_name|gender| id| ip_address|last_name|
+--------------------+----------+------+---+--------------+---------+
| knattrass0@loc.gov| Kaspar| Male| 1| 244.159.51.76| Nattrass|
|rnulty1@multiply.com| Rosamund|Female| 2|237.123.21.130| Nulty|
+--------------------+----------+------+---+--------------+---------+
only showing top 2 rows
scala> v1.rdd.partitions.size
res3: Int = 1
scala> v1.columns
res4: Array[String] = Array(email, first_name, gender, id, ip_address, last_name)
scala> v1.columns.foreach(println)
first_name
gender
id
ip_address
last_name
scala> v1.show()
+--------------------+----------+------+---+---------------+---------+
| email|first_name|gender| id| ip_address|last_name|
+--------------------+----------+------+---+---------------+---------+
| knattrass0@loc.gov| Kaspar| Male| 1| 244.159.51.76| Nattrass|
|rnulty1@multiply.com| Rosamund|Female| 2| 237.123.21.130| Nulty|
|pglasbey2@deviant...| Pia|Female| 3| 80.11.243.170| Glasbey|
|dgowthorpe3@buzzf...| Dante| Male| 4| 197.253.81.98|Gowthorpe|
|wsprowson4@accuwe...| Willamina|Female| 5| 64.125.155.144| Sprowson|
| tbraunds5@ning.com| Trish|Female| 6| 38.111.102.64| Braunds|
|tmasey6@businessw...| Tybie|Female| 7| 44.87.135.133| Masey|
|lpapaccio7@howstu...| Leona|Female| 8| 64.233.173.104| Papaccio|
|hsaltrese8@cbsloc...| Hendrick| Male| 9| 179.21.162.161| Saltrese|
|mkingsnod9@archiv...| Marna|Female| 10| 66.254.243.50| Kingsnod|
| akybirda@mysql.com| Abram| Male| 11|166.179.168.234| Kybird|
| ktuiteb@ucoz.com| Kenneth| Male| 12| 132.182.90.153| Tuite|
|gberingerc@creati...| Gerhardt| Male| 13| 222.102.76.16| Beringer|
| adreweryd@hibu.com| Avictor| Male| 14| 196.191.41.114| Drewery|
| dupexe@myspace.com| Diahann|Female| 15| 226.50.117.72| Upex|
|dcoldbathef@wikip...| Daryl|Female| 16| 7.99.204.200|Coldbathe|
| gkestong@tamu.edu| Galven| Male| 17| 35.16.66.151| Keston|
|dilchenkoh@istock...| Daffi|Female| 18| 192.45.226.104| Ilchenko|
|lwychardi@sfgate.com| Ladonna|Female| 19| 94.194.233.152| Wychard|
| lsapirj@unblog.fr| Latrena|Female| 20|107.141.139.191| Sapir|
+--------------------+----------+------+---+---------------+---------+
only showing top 20 rows
scala> v1.show(false);
+------------------------------+----------+------+---+---------------+---------+
|email |first_name|gender|id |ip_address |last_name|
+------------------------------+----------+------+---+---------------+---------+
|knattrass0@loc.gov |Kaspar |Male |1 |244.159.51.76 |Nattrass |
|rnulty1@multiply.com |Rosamund |Female|2 |237.123.21.130 |Nulty |
|pglasbey2@deviantart.com |Pia |Female|3 |80.11.243.170 |Glasbey |
|dgowthorpe3@buzzfeed.com |Dante |Male |4 |197.253.81.98 |Gowthorpe|
|wsprowson4@accuweather.com |Willamina |Female|5 |64.125.155.144 |Sprowson |
|tbraunds5@ning.com |Trish |Female|6 |38.111.102.64 |Braunds |
|tmasey6@businessweek.com |Tybie |Female|7 |44.87.135.133 |Masey |
|lpapaccio7@howstuffworks.com |Leona |Female|8 |64.233.173.104 |Papaccio |
|hsaltrese8@cbslocal.com |Hendrick |Male |9 |179.21.162.161 |Saltrese |
|mkingsnod9@archive.org |Marna |Female|10 |66.254.243.50 |Kingsnod |
|akybirda@mysql.com |Abram |Male |11 |166.179.168.234|Kybird |
|ktuiteb@ucoz.com |Kenneth |Male |12 |132.182.90.153 |Tuite |
|gberingerc@creativecommons.org|Gerhardt |Male |13 |222.102.76.16 |Beringer |
|adreweryd@hibu.com |Avictor |Male |14 |196.191.41.114 |Drewery |
|dupexe@myspace.com |Diahann |Female|15 |226.50.117.72 |Upex |
|dcoldbathef@wikipedia.org |Daryl |Female|16 |7.99.204.200 |Coldbathe|
|gkestong@tamu.edu |Galven |Male |17 |35.16.66.151 |Keston |
|dilchenkoh@istockphoto.com |Daffi |Female|18 |192.45.226.104 |Ilchenko |
|lwychardi@sfgate.com |Ladonna |Female|19 |94.194.233.152 |Wychard |
|lsapirj@unblog.fr |Latrena |Female|20 |107.141.139.191|Sapir |
+------------------------------+----------+------+---+---------------+---------+
only showing top 20 rows
scala> v1.select("email","first_name").show
+--------------------+----------+
| email|first_name|
+--------------------+----------+
| knattrass0@loc.gov| Kaspar|
|rnulty1@multiply.com| Rosamund|
|pglasbey2@deviant...| Pia|
|dgowthorpe3@buzzf...| Dante|
|wsprowson4@accuwe...| Willamina|
| tbraunds5@ning.com| Trish|
|tmasey6@businessw...| Tybie|
|lpapaccio7@howstu...| Leona|
|hsaltrese8@cbsloc...| Hendrick|
|mkingsnod9@archiv...| Marna|
| akybirda@mysql.com| Abram|
| ktuiteb@ucoz.com| Kenneth|
|gberingerc@creati...| Gerhardt|
| adreweryd@hibu.com| Avictor|
| dupexe@myspace.com| Diahann|
|dcoldbathef@wikip...| Daryl|
| gkestong@tamu.edu| Galven|
|dilchenkoh@istock...| Daffi|
|lwychardi@sfgate.com| Ladonna|
| lsapirj@unblog.fr| Latrena|
+--------------------+----------+
only showing top 20 rows
scala> v1.select("email","first_name").show(2)
+--------------------+----------+
| email|first_name|
+--------------------+----------+
| knattrass0@loc.gov| Kaspar|
|rnulty1@multiply.com| Rosamund|
+--------------------+----------+
scala> v1.write.parquet("file:///home/cloudera/Desktop/Files/myParquet")
scala> val v2 = sqlContext.read.format("parquet").load("file:///home/cloudera/Desktop/Files/myParquet");
v2: org.apache.spark.sql.DataFrame = [email: string, first_name: string, gender: string, id: bigint, ip_address: string, last_name: string]
v2.printSchema
root
|-- email: string (nullable = true)
|-- first_name: string (nullable = true)
|-- gender: string (nullable = true)
|-- id: long (nullable = true)
|-- ip_address: string (nullable = true)
|-- last_name: string (nullable = true)
scala> v2.write.orc("file:///home/cloudera/Desktop/Files/myOrc");
scala> import com.databricks.spark.avro_
<console>:25: error: object avro_ is not a member of package com.databricks.spark
import com.databricks.spark.avro_
Login as Admin
su
password : cloudera
cp /etc/hive/conf/hive-site.xml /etc/spark/conf/hive-site.xml
It will give direct access to Hive within SparkSQL.
Here we create a database and a table and a record in Hive and soon we are going to access the same within SparkSQL
[cloudera@quickstart ~]$ hive
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> show databases;
OK
default
Time taken: 1.384 seconds, Fetched: 1 row(s)
hive> create database sara;
OK
Time taken: 13.242 seconds
hive> use sara;
OK
Time taken: 0.205 seconds
hive> create table mytable (id int, name string);
OK
Time taken: 0.712 seconds
hive> insert into mytable (id,name) values(101,'Raja');
hive> select * from mytable;
OK
101 Raja
hive> describe mytable;
OK
id int
name string
Here we use SparkSQL to access Hive (hive-site.xml is already copied into spark/conf so, no need to mention anything related to hive. By default Hive will be accessible when we use sqlContext
scala> sqlContext.sql("show databases").show;
+-------+
| result|
+-------+
|default|
| sara|
+-------+
scala> sqlContext.sql("use sara");
res2: org.apache.spark.sql.DataFrame = [result: string]
scala> sqlContext.sql("show tables").show
+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
| mytable| false|
+---------+-----------+
Here we import content from .orc external file into SparkSQL and export it into Hive.
scala> val v3 = sqlContext.read.format("orc").load("file:///home/cloudera/Desktop/Files/myOrc")
v3: org.apache.spark.sql.DataFrame = [email: string, first_name: string, gender: string, id: bigint, ip_address: string, last_name: string]
scala> v3.printSchema
root
|-- email: string (nullable = true)
|-- first_name: string (nullable = true)
|-- gender: string (nullable = true)
|-- id: long (nullable = true)
|-- ip_address: string (nullable = true)
|-- last_name: string (nullable = true)
scala> v3.show(2);
+--------------------+----------+------+---+--------------+---------+
| email|first_name|gender| id| ip_address|last_name|
+--------------------+----------+------+---+--------------+---------+
| knattrass0@loc.gov| Kaspar| Male| 1| 244.159.51.76| Nattrass|
|rnulty1@multiply.com| Rosamund|Female| 2|237.123.21.130| Nulty|
+--------------------+----------+------+---+--------------+---------+
only showing top 2 rows
we export the orc content imported into a temp table which resides in Spark in memory and not in hive disk storage
scala> v3.registerTempTable("customer_temp");
scala> sqlContext.sql("show tables").show
+-------------+-----------+
| tableName|isTemporary|
+-------------+-----------+
|customer_temp| true| //// see here customer_temp is flagged as isTemporary. It wont be available in Hive
| mytable| false|
+-------------+-----------+
hive> show tables;
OK
mytable
// Here hive shows mytable only and not displaying customer_temp because it resides in memory
Logged out from Hive
hive> exit
Logged out from spark
scala>exit
Logon to spark again
$ spark-shell
scala> sqlContext.sql("use sara");
scala> sqlContext.sql("show tables").show
+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
| mytable| false|
+---------+-----------+
customer_temp is not there because once we logged out and login back, the session will end and it will destroy temp tables
val v3 = sqlContext.read.format("orc").load("file:///home/cloudera/Desktop/Files/myOrc");
To export dataframe with data into hive tble (permanent table)
v3.saveAsTable("customer_per");
scala> sqlContext.sql("show tables").show
+------------+-----------+
| tableName|isTemporary|
+------------+-----------+
|customer_per| false|
| mytable| false|
+------------+-----------+
scala> sqlContext.sql("select * from customer_per").show /// long data will be shrinked and use ....
18/08/16 01:29:58 WARN parquet.CorruptStatistics: Ignoring statistics because created_by is null or empty! See PARQUET-251 and PARQUET-297
+--------------------+----------+------+---+---------------+---------+
| email|first_name|gender| id| ip_address|last_name|
+--------------------+----------+------+---+---------------+---------+
| knattrass0@loc.gov| Kaspar| Male| 1| 244.159.51.76| Nattrass|
|rnulty1@multiply.com| Rosamund|Female| 2| 237.123.21.130| Nulty|
|pglasbey2@deviant...| Pia|Female| 3| 80.11.243.170| Glasbey|
|dgowthorpe3@buzzf...| Dante| Male| 4| 197.253.81.98|Gowthorpe|
|wsprowson4@accuwe...| Willamina|Female| 5| 64.125.155.144| Sprowson|
| tbraunds5@ning.com| Trish|Female| 6| 38.111.102.64| Braunds|
|tmasey6@businessw...| Tybie|Female| 7| 44.87.135.133| Masey|
|lpapaccio7@howstu...| Leona|Female| 8| 64.233.173.104| Papaccio|
|hsaltrese8@cbsloc...| Hendrick| Male| 9| 179.21.162.161| Saltrese|
|mkingsnod9@archiv...| Marna|Female| 10| 66.254.243.50| Kingsnod|
| akybirda@mysql.com| Abram| Male| 11|166.179.168.234| Kybird|
| ktuiteb@ucoz.com| Kenneth| Male| 12| 132.182.90.153| Tuite|
|gberingerc@creati...| Gerhardt| Male| 13| 222.102.76.16| Beringer|
| adreweryd@hibu.com| Avictor| Male| 14| 196.191.41.114| Drewery|
| dupexe@myspace.com| Diahann|Female| 15| 226.50.117.72| Upex|
|dcoldbathef@wikip...| Daryl|Female| 16| 7.99.204.200|Coldbathe|
| gkestong@tamu.edu| Galven| Male| 17| 35.16.66.151| Keston|
|dilchenkoh@istock...| Daffi|Female| 18| 192.45.226.104| Ilchenko|
|lwychardi@sfgate.com| Ladonna|Female| 19| 94.194.233.152| Wychard|
| lsapirj@unblog.fr| Latrena|Female| 20|107.141.139.191| Sapir|
+--------------------+----------+------+---+---------------+---------+
only showing top 20 rows
scala> sqlContext.sql("select * from customer_per").show(false); // output will be displayed full long data (see there : no .... here)
+------------------------------+----------+------+---+---------------+----------
|email |first_name|gender|id |ip_address |last_name|
+------------------------------+----------+------+---+---------------+---------+
|knattrass0@loc.gov |Kaspar |Male |1 |244.159.51.76 |Nattrass |
|rnulty1@multiply.com |Rosamund |Female|2 |237.123.21.130 |Nulty |
|pglasbey2@deviantart.com |Pia |Female|3 |80.11.243.170 |Glasbey |
|dgowthorpe3@buzzfeed.com |Dante |Male |4 |197.253.81.98 |Gowthorpe|
|wsprowson4@accuweather.com |Willamina |Female|5 |64.125.155.144 |Sprowson |
|tbraunds5@ning.com |Trish |Female|6 |38.111.102.64 |Braunds |
|tmasey6@businessweek.com |Tybie |Female|7 |44.87.135.133 |Masey |
|lpapaccio7@howstuffworks.com |Leona |Female|8 |64.233.173.104 |Papaccio |
|hsaltrese8@cbslocal.com |Hendrick |Male |9 |179.21.162.161 |Saltrese |
|mkingsnod9@archive.org |Marna |Female|10 |66.254.243.50 |Kingsnod |
|akybirda@mysql.com |Abram |Male |11 |166.179.168.234|Kybird |
|ktuiteb@ucoz.com |Kenneth |Male |12 |132.182.90.153 |Tuite |
|gberingerc@creativecommons.org|Gerhardt |Male |13 |222.102.76.16 |Beringer |
|adreweryd@hibu.com |Avictor |Male |14 |196.191.41.114 |Drewery |
|dupexe@myspace.com |Diahann |Female|15 |226.50.117.72 |Upex |
|dcoldbathef@wikipedia.org |Daryl |Female|16 |7.99.204.200 |Coldbathe|
|gkestong@tamu.edu |Galven |Male |17 |35.16.66.151 |Keston |
|dilchenkoh@istockphoto.com |Daffi |Female|18 |192.45.226.104 |Ilchenko |
|lwychardi@sfgate.com |Ladonna |Female|19 |94.194.233.152 |Wychard |
|lsapirj@unblog.fr |Latrena |Female|20 |107.141.139.191|Sapir |
+------------------------------+----------+------+---+---------------+---------+
only showing top 20 rows
scala> v3.show
+--------------------+----------+------+---+---------------+---------+
| email|first_name|gender| id| ip_address|last_name|
+--------------------+----------+------+---+---------------+---------+
| knattrass0@loc.gov| Kaspar| Male| 1| 244.159.51.76| Nattrass|
|rnulty1@multiply.com| Rosamund|Female| 2| 237.123.21.130| Nulty|
|pglasbey2@deviant...| Pia|Female| 3| 80.11.243.170| Glasbey|
|dgowthorpe3@buzzf...| Dante| Male| 4| 197.253.81.98|Gowthorpe|
|wsprowson4@accuwe...| Willamina|Female| 5| 64.125.155.144| Sprowson|
| tbraunds5@ning.com| Trish|Female| 6| 38.111.102.64| Braunds|
|tmasey6@businessw...| Tybie|Female| 7| 44.87.135.133| Masey|
|lpapaccio7@howstu...| Leona|Female| 8| 64.233.173.104| Papaccio|
|hsaltrese8@cbsloc...| Hendrick| Male| 9| 179.21.162.161| Saltrese|
|mkingsnod9@archiv...| Marna|Female| 10| 66.254.243.50| Kingsnod|
| akybirda@mysql.com| Abram| Male| 11|166.179.168.234| Kybird|
| ktuiteb@ucoz.com| Kenneth| Male| 12| 132.182.90.153| Tuite|
|gberingerc@creati...| Gerhardt| Male| 13| 222.102.76.16| Beringer|
| adreweryd@hibu.com| Avictor| Male| 14| 196.191.41.114| Drewery|
| dupexe@myspace.com| Diahann|Female| 15| 226.50.117.72| Upex|
|dcoldbathef@wikip...| Daryl|Female| 16| 7.99.204.200|Coldbathe|
| gkestong@tamu.edu| Galven| Male| 17| 35.16.66.151| Keston|
|dilchenkoh@istock...| Daffi|Female| 18| 192.45.226.104| Ilchenko|
|lwychardi@sfgate.com| Ladonna|Female| 19| 94.194.233.152| Wychard|
| lsapirj@unblog.fr| Latrena|Female| 20|107.141.139.191| Sapir|
+--------------------+----------+------+---+---------------+---------+
only showing top 20 rows
scala> v3.show(2); // data frame syntax to view data
+--------------------+----------+------+---+--------------+---------+
| email|first_name|gender| id| ip_address|last_name|
+--------------------+----------+------+---+--------------+---------+
| knattrass0@loc.gov| Kaspar| Male| 1| 244.159.51.76| Nattrass|
|rnulty1@multiply.com| Rosamund|Female| 2|237.123.21.130| Nulty|
+--------------------+----------+------+---+--------------+---------+
only showing top 2 rows
scala> sqlContext.sql("select * from customer_per").show(2);
+--------------------+----------+------+---+--------------+---------+
| email|first_name|gender| id| ip_address|last_name|
+--------------------+----------+------+---+--------------+---------+
| knattrass0@loc.gov| Kaspar| Male| 1| 244.159.51.76| Nattrass|
|rnulty1@multiply.com| Rosamund|Female| 2|237.123.21.130| Nulty|
+--------------------+----------+------+---+--------------+---------+
only showing top 2 rows
To know number of record count:
scala> sqlContext.sql("select * from customer_per").count
res18: Long = 1000
scala> sqlContext.sql("select count(*) from customer_per").show
18/08/16 01:36:42 WARN hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
+----+
| _c0|
+----+
|1000|
+----+
// Dataframe syntax to know the number of records count
scala> v3.count
res21: Long = 1000
Resilient Distributed Dataset (RDD) into Data Frame (df)
we need to include
sqlContext.implicits._
scala> import sqlContext.implicits._
import sqlContext.implicits._
create a wordcount.txt file in Files folder of Desktop
Launch terminal
gedit wordcount.txt
I love India
You know that or not
I love my India
You know that or not
I love Singapore
You know that or not
I love Bangalore
Why I am writing these here?
Kaipulla Thoongudha?
save this file in /home/cloudera/Desktop/Files/wordcount.txt
Now load this file into RDD:
scala> val r1 = sc.textFile("file:///home/cloudera/Desktop/Files/wordcount.txt");
scala> r1.collect.foreach(println);
I love India
You know that or not
I love my India
You know that or not
I love Singapore
You know that or not
I love Bangalore
Why I am writing these here?
Kaipulla Thoongudha?
scala> r1.partitions.size
res25: Int = 1
scala> r1.foreach(println);
I love India
You know that or not
I love my India
You know that or not
I love Singapore
You know that or not
I love Bangalore
Why I am writing these here?
Kaipulla Thoongudha?
scala> val r2 = r1.flatMap(l => l.split(" "))
r2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[69] at flatMap at <console>:32
scala> r2.foreach(println);
I
love
India
You
know
that
or
not
I
love
my
India
You
know
that
or
not
I
love
Singapore
You
know
that
or
not
I
love
Bangalore
Why
I
am
writing
these
here?
Kaipulla
Thoongudha?
scala> val r3 = r2.map (x => (x,1));
r3: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[70] at map at <console>:34
scala> r3.foreach(println);
(I,1)
(love,1)
(India,1)
(You,1)
(know,1)
(that,1)
(or,1)
(not,1)
(I,1)
(love,1)
(my,1)
(India,1)
(You,1)
(know,1)
(that,1)
(or,1)
(not,1)
(I,1)
(love,1)
(Singapore,1)
(You,1)
(know,1)
(that,1)
(or,1)
(not,1)
(I,1)
(love,1)
(Bangalore,1)
(Why,1)
(I,1)
(am,1)
(writing,1)
(these,1)
(here?,1)
(Kaipulla,1)
(Thoongudha?,1)
scala> val r4 = r3.reduceByKey((x,y) => x+y)
scala> r4.foreach(println);
(not,3)
(here?,1)
(writing,1)
(that,3)
(am,1)
(or,3)
(You,3)
(Thoongudha?,1)
(love,4)
(Bangalore,1)
(Singapore,1)
(I,5)
(know,3)
(Why,1)
(my,1)
(Kaipulla,1)
(these,1)
(India,2)
/// r1,r2,r3,r4 are RDDs. but r5 is dataframe
toDF : is used to convert RDD into Data frame
-----------------------------------------------
scala> val r5 = r3.reduceByKey((x,y) => x+y).toDF("word","word_count"); // word and word_count are column names specified here
r5: org.apache.spark.sql.DataFrame = [word: string, word_count: int]
scala> r5.show
+-----------+----------+
| word|word_count| // column names are here
+-----------+----------+
| not| 3|
| here?| 1|
| writing| 1|
| that| 3|
| am| 1|
| or| 3|
| You| 3|
|Thoongudha?| 1|
| love| 4|
| Bangalore| 1|
| Singapore| 1|
| I| 5|
| know| 3|
| Why| 1|
| my| 1|
| Kaipulla| 1|
| these| 1|
| India| 2|
+-----------+----------+
scala> val r6 = r3.reduceByKey((x,y) => x+y).toDF; // column names are missing
r6: org.apache.spark.sql.DataFrame = [_1: string, _2: int] // _1 and _2 are column names internally given by spark
scala> r6.show
+-----------+---+
| _1| _2| // _1 and _2 are column names internally given by spark
+-----------+---+
| not| 3|
| here?| 1|
| writing| 1|
| that| 3|
| am| 1|
| or| 3|
| You| 3|
|Thoongudha?| 1|
| love| 4|
| Bangalore| 1|
| Singapore| 1|
| I| 5|
| know| 3|
| Why| 1|
| my| 1|
| Kaipulla| 1|
| these| 1|
| India| 2|
+-----------+---+
scala> r5.columns
res3: Array[String] = Array(word, word_count)
scala> r5.columns.foreach(println);
word
word_count
Load wordcount.txt file and do wordcount using spark with scala
val r1 = sc.textFile("file:///home/cloudera/Desktop/Files/wordcount.txt");
val r2 = r1.flatMap(l => l.split(" "))
val r3 = r2.map (x => (x,1));
val r4 = r3.reduceByKey((x,y) => x+y)
val r5 = r3.reduceByKey((x,y) => x+y).toDF("word","word_count");
scala> sqlContext.sql("use sara");
scala> r5.registerTempTable("wordcount_temp"); // create a temporary table in spark in memory
scala> r5.saveAsTable("wordcount_per"); // create a permanent table in hive disk storage
scala> sqlContext.sql("show tables").show
+--------------+-----------+
| tableName|isTemporary|
+--------------+-----------+
|wordcount_temp| true| // temporary
| customer_per| false| // permanent
| mytable| false|
| wordcount_per| false|
+--------------+-----------+
$ hive
hive> use sara;
OK
Time taken: 0.684 seconds
hive> show tables;
OK
customer_per
mytable
wordcount_per
Time taken: 0.425 seconds, Fetched: 3 row(s) // Here wordcount_temp is not there
Back to Spark...
scala> sqlContext.sql("select * from wordcount_temp").show
+-----------+----------+
| word|word_count|
+-----------+----------+
| not| 3|
| here?| 1|
| writing| 1|
| that| 3|
| am| 1|
| or| 3|
| You| 3|
|Thoongudha?| 1|
| love| 4|
| Bangalore| 1|
| Singapore| 1|
| I| 5|
| know| 3|
| Why| 1|
| my| 1|
| Kaipulla| 1|
| these| 1|
| India| 2|
+-----------+----------+
scala> sqlContext.sql("select * from wordcount_per").count
Long = 18
Load json directly into Data frame:
scala> val df = sqlContext.read.format("json").load("file:///home/cloudera/Desktop/Files/city_json");
df: org.apache.spark.sql.DataFrame = [abbr: string, district: string, id: bigint, name: string, population: bigint]
scala> df.printSchema();
root
|-- abbr: string (nullable = true)
|-- district: string (nullable = true)
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- population: long (nullable = true)
scala> df.select("name","population").show(2);
+--------+----------+
| name|population|
+--------+----------+
| Kabul| 1780000|
|Qandahar| 237500|
+--------+----------+
only showing top 2 rows
scala> df.select ("id","name","district","population","abbr").show(5);
+---+--------------+-------------+----------+----+
| id| name| district|population|abbr|
+---+--------------+-------------+----------+----+
| 1| Kabul| Kabol| 1780000| AFG|
| 2| Qandahar| Qandahar| 237500| AFG|
| 3| Herat| Herat| 186800| AFG|
| 4|Mazar-e-Sharif| Balkh| 127800| AFG|
| 5| Amsterdam|Noord-Holland| 731200| NLD|
+---+--------------+-------------+----------+----+
only showing top 5 rows
scala> df.select($"name".as("country"),($"population" * 0.10).alias("population")).show;
+----------------+------------------+
| country| population|
+----------------+------------------+
| Kabul| 178000.0|
| Qandahar| 23750.0|
| Herat| 18680.0|
| Mazar-e-Sharif| 12780.0|
| Amsterdam| 73120.0|
| Rotterdam|59332.100000000006|
| Haag| 44090.0|
| Utrecht|23432.300000000003|
| Eindhoven|20184.300000000003|
| Tilburg| 19323.8|
| Groningen|17270.100000000002|
| Breda|16039.800000000001|
| Apeldoorn| 15349.1|
| Nijmegen|15246.300000000001|
| Enschede|14954.400000000001|
| Haarlem| 14877.2|
| Almere| 14246.5|
| Arnhem| 13802.0|
| Zaanstad| 13562.1|
|´s-Hertogenbosch| 12917.0|
+----------------+------------------+
scala> df.filter($"population" > 100000).sort($"population".desc).show();
+----+--------------------+---+-----------------+----------+
|abbr| district| id| name|population|
+----+--------------------+---+-----------------+----------+
| BRA| São Paulo|206| São Paulo| 9968485|
| IDN| Jakarta Raya|939| Jakarta| 9604900|
| GBR| England|456| London| 7285000|
| EGY| Kairo|608| Cairo| 6789479|
| BRA| Rio de Janeiro|207| Rio de Janeiro| 5598953|
| CHL| Santiago|554|Santiago de Chile| 4703954|
| BGD| Dhaka|150| Dhaka| 3612850|
| EGY| Aleksandria|609| Alexandria| 3328196|
| AUS| New South Wales|130| Sydney| 3276207|
| ARG| Distrito Federal| 69| Buenos Aires| 2982146|
| ESP| Madrid|653| Madrid| 2879052|
| AUS| Victoria|131| Melbourne| 2865329|
| IDN| East Java|940| Surabaya| 2663820|
| ETH| Addis Abeba|756| Addis Abeba| 2495000|
| IDN| West Java|941| Bandung| 2429000|
| ZAF| Western Cape|712| Cape Town| 2352121|
| BRA| Bahia|208| Salvador| 2302832|
| EGY| Giza|610| Giza| 2221868|
| PHL|National Capital Reg|765| Quezon| 2173831|
| DZA| Alger| 35| Alger| 2168000|
+----+--------------------+---+-----------------+----------+
scala> df.groupBy("abbr").agg(sum("population").as("population")).show()
+----+----------+
|abbr|population|
+----+----------+
| BEL| 1609322|
| GRD| 4621|
| BEN| 968503|
| BRN| 21484|
| GEO| 1880900|
| GRL| 13445|
| BFA| 1229000|
| GLP| 75380|
| HTI| 1517338|
| ECU| 5744142|
| BLZ| 62915|
| DOM| 2438276|
| ARE| 1728336|
| ARG| 19996563|
| HND| 1287000|
| ARM| 1633100|
| GMB| 144926|
| ALB| 270000|
| FRO| 14542|
| BGD| 8569906|
+----+----------+
only showing top 20 rows
scala> df.select("abbr").distinct().sort("abbr").show
+----+
|abbr|
+----+
| ABW|
| AFG|
| AGO|
| AIA|
| ALB|
| AND|
| ANT|
| ARE|
| ARG|
| ARM|
| ASM|
| ATG|
| AUS|
| AZE|
| BDI|
| BEL|
| BEN|
| BFA|
| BGD|
| BGR|
+----+
only showing top 20 rows
---------------------------------------------------------------------
scala> val r1 = List((11,10000),(11,20000),(12,30000),(12,40000),(13,50000))
r1: List[(Int, Int)] = List((11,10000), (11,20000), (12,30000), (12,40000), (13,50000))
scala> val r2 = List((11,"Hyd"),(12,"Del"),(13,"Hyd"))
r2: List[(Int, String)] = List((11,Hyd), (12,Del), (13,Hyd))
scala> val rdd1 = sc.parallelize(r1)
rdd1: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:29
scala> rdd1.collect.foreach(println) // salary info
(11,10000)
(11,20000)
(12,30000)
(12,40000)
(13,50000)
scala> val rdd2 = sc.parallelize(r2)
rdd2: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[1] at parallelize at <console>:29
scala> rdd2.collect.foreach(println) // location info
(11,Hyd)
(12,Del)
(13,Hyd)
scala> val j = rdd1.join(rdd2)
j: org.apache.spark.rdd.RDD[(Int, (Int, String))] = MapPartitionsRDD[4] at join at <console>:35
scala> j.collect.foreach(println)
(13,(50000,Hyd))
(11,(10000,Hyd))
(11,(20000,Hyd))
(12,(30000,Del))
(12,(40000,Del))
scala> var citySalPair = j.map { x =>
| val city = x._2._2
| val sal = x._2._1
| (city,sal)
| }
citySalPair: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[5] at map at <console>:43
scala> citySalPair.collect.foreach(println)
(Hyd,50000)
(Hyd,10000)
(Hyd,20000)
(Del,30000)
(Del,40000)
scala> val res = citySalPair.reduceByKey(_+_)
res: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[6] at reduceByKey at <console>:45
scala> res.collect.foreach(println) // salary grouped by city aggregation
(Del,70000)
(Hyd,80000)
In the above example, we made join against 2 different tuples which have 2 fields only
if a Tuple has more than 2 fields how to do join?
d
scala> val e = List((11,30000,10000),(11,40000,20000),(12,50000,30000),(13,60000,20000),(12,80000,30000))
e: List[(Int, Int, Int)] = List((11,30000,10000), (11,40000,20000), (12,50000,30000), (13,60000,20000), (12,80000,30000))
scala> val ee = sc.parallelize(e)
ee: org.apache.spark.rdd.RDD[(Int, Int, Int)] = ParallelCollectionRDD[7] at parallelize at <console>:29
scala> ee.foreach(println)
(11,30000,10000)
(11,40000,20000)
(12,50000,30000)
(13,60000,20000)
(12,80000,30000)
scala> rdd2.collect.foreach(println)
(11,Hyd)
(12,Del)
(13,Hyd)
scala> val j2 = ee.join(rdd2)
<console>:35: error: value join is not a member of org.apache.spark.rdd.RDD[(Int, Int, Int)]
val j2 = ee.join(rdd2)
^
// while joining both should be key, value pairs
scala> ee.collect.foreach(println) // Here the below is not a key,value pair
(11,30000,10000)
(11,40000,20000)
(12,50000,30000)
(13,60000,20000)
(12,80000,30000)
we need to do one more transformation
scala> val e3 = ee.map { x =>
| val dno = x._1
| val sal = x._2
| val bonus = x._3
| (dno, (sal,bonus))
| }
e3: org.apache.spark.rdd.RDD[(Int, (Int, Int))] = MapPartitionsRDD[8] at map at <console>:37
scala> e3.collect.foreach(println) // Here we formed key,value pairs
(11,(30000,10000))
(11,(40000,20000))
(12,(50000,30000))
(13,(60000,20000))
(12,(80000,30000))
scala> val j4 = e3.join(rdd2)
j4: org.apache.spark.rdd.RDD[(Int, ((Int, Int), String))] = MapPartitionsRDD[14] at join at <console>:43
scala> j4.collect.foreach(println)
(13,((60000,20000),Hyd))
(11,((30000,10000),Hyd))
(11,((40000,20000),Hyd))
(12,((50000,30000),Del))
(12,((80000,30000),Del))
scala> val j3 = e3.join(rdd2)
j3: org.apache.spark.rdd.RDD[(Int, ((Int, Int), String))] = MapPartitionsRDD[12] at join at <console>:37
scala> j3.collect.foreach(println)
(13,((60000,20000),Hyd))
(11,((30000,10000),Hyd))
(11,((40000,20000),Hyd))
(12,((50000,30000),Del))
(12,((80000,30000),Del))
scala> val pair = j3.map { x =>
val sal = x._2._1._1
val bonus = x._2._1._2
val tot = sal+bonus
val city = x._2._2
(city,tot)
}
scala> pair.collect.foreach(println)
(Hyd,80000)
(Hyd,40000)
(Hyd,60000)
(Del,80000)
(Del,110000)
scala> val resultOfCityAgg = pair.reduceByKey(_+_)
resultOfCityAgg: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[14] at reduceByKey at <console>:41
scala> resultOfCityAgg.foreach(println)
(Del,190000)
(Hyd,180000)
create the following files in local linux then copy them into hdfs:
-----------------------------------------------------------------
[cloudera@quickstart ~]$ cat > emp
101,aaaa,70000,m,12
102,bbbbb,90000,f,12
103,cc,10000,m,11
104,dd,40000,m,12
105,cccc,70000,f,13
106,de,80000,f,13
107,io,90000,m,14
108,yu,100000,f,14
109,poi,30000,m,11
110,aaa,60000,f,14
123,djdj,900000,m,15
122,asasd,10000,m,15
[cloudera@quickstart ~]$ cat > dept
11,marketing,hyd
12,hr,del
13,finance,hyd
14,admin,del
15,accounts,hyd
Copy the files into Sparks (HDFS):
----------------------------------
[cloudera@quickstart ~]$ hdfs dfs -copyFromLocal emp Sparks
[cloudera@quickstart ~]$ hdfs dfs -copyFromLocal dept Sparks
[cloudera@quickstart ~]$
[cloudera@quickstart ~]$ hdfs dfs -ls Sparks
Found 2 items
-rw-r--r-- 1 cloudera cloudera 71 2018-10-09 02:06 Sparks/dept
-rw-r--r-- 1 cloudera cloudera 232 2018-10-09 02:06 Sparks/emp
[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/dept
11,marketing,hyd
12,hr,del
13,finance,hyd
14,admin,del
15,accounts,hyd
[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/emp
101,aaaa,70000,m,12
102,bbbbb,90000,f,12
103,cc,10000,m,11
104,dd,40000,m,12
105,cccc,70000,f,13
106,de,80000,f,13
107,io,90000,m,14
108,yu,100000,f,14
109,poi,30000,m,11
110,aaa,60000,f,14
123,djdj,900000,m,15
122,asasd,10000,m,15
scala> emp.collect.foreach(println)
101,aaaa,70000,m,12
102,bbbbb,90000,f,12
103,cc,10000,m,11
104,dd,40000,m,12
105,cccc,70000,f,13
106,de,80000,f,13
107,io,90000,m,14
108,yu,100000,f,14
109,poi,30000,m,11
110,aaa,60000,f,14
123,djdj,900000,m,15
122,asasd,10000,m,15
scala> dept.collect.foreach(println)
11,marketing,hyd
12,hr,del
13,finance,hyd
14,admin,del
15,accounts,hyd
To perform joins, we need to make key,value pairs
scala> val e = emp.map { x =>
| val w = x.split(",")
| val dno = w(4).toInt
| val id = w(0)
| val name = w(1)
| val sal = w(2).toInt
| val sex = w(3)
| val info = id +","+name+","+sal+","+sex
| (dno,info)
| }
e: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[20] at map at <console>:29
scala> e.collect.foreach(println)
(12,101,aaaa,70000,m) /// (12, "101,aaa,70000,m") internally
(12,102,bbbbb,90000,f)
(11,103,cc,10000,m)
(12,104,dd,40000,m)
(13,105,cccc,70000,f)
(13,106,de,80000,f)
(14,107,io,90000,m)
(14,108,yu,100000,f)
(11,109,poi,30000,m)
(14,110,aaa,60000,f)
(15,123,djdj,900000,m)
(15,122,asasd,10000,m)
scala> val d = dept.map { x =>
| val w = x.split(",")
| val dno = w(0).toInt
| val info = w(1)+","+w(2)
| (dno,info)
| }
d: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[21] at map at <console>:29
scala> d.collect.foreach(println)
(11,marketing,hyd) /// (11,"marketing,hyd") internally
(12,hr,del)
(13,finance,hyd)
(14,admin,del)
(15,accounts,hyd)
scala> val ed = e.join(d)
ed: org.apache.spark.rdd.RDD[(Int, (String, String))] = MapPartitionsRDD[24] at join at <console>:35
scala> ed.collect.foreach(println)
(13,(105,cccc,70000,f,finance,hyd)) // (13,("105,cccc,70000,f",'finance,hyd")
(13,(106,de,80000,f,finance,hyd))
(15,(123,djdj,900000,m,accounts,hyd))
(15,(122,asasd,10000,m,accounts,hyd))
(11,(103,cc,10000,m,marketing,hyd))
(11,(109,poi,30000,m,marketing,hyd))
(14,(107,io,90000,m,admin,del))
(14,(108,yu,100000,f,admin,del))
(14,(110,aaa,60000,f,admin,del))
(12,(101,aaaa,70000,m,hr,del))
(12,(102,bbbbb,90000,f,hr,del))
(12,(104,dd,40000,m,hr,del))
scala> val ed2 = ed.map { x =>
| val einfo = x._2._1
| val dinfo = x._2._2
| val info = einfo +","+dinfo
| info
| }
ed2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[25] at map at <console>:37
scala> ed2.collect.foreach(println)
105,cccc,70000,f,finance,hyd
106,de,80000,f,finance,hyd
123,djdj,900000,m,accounts,hyd
122,asasd,10000,m,accounts,hyd
103,cc,10000,m,marketing,hyd
109,poi,30000,m,marketing,hyd
107,io,90000,m,admin,del
108,yu,100000,f,admin,del
110,aaa,60000,f,admin,del
101,aaaa,70000,m,hr,del
102,bbbbb,90000,f,hr,del
104,dd,40000,m,hr,del
Write the RDD into HDFS as a file:
scala> ed2.saveAsTextFile("/user/cloudera/Sparks/res1")
[cloudera@quickstart ~]$ hdfs dfs -ls Sparks
Found 3 items
-rw-r--r-- 1 cloudera cloudera 71 2018-10-09 02:06 Sparks/dept
-rw-r--r-- 1 cloudera cloudera 232 2018-10-09 02:06 Sparks/emp
drwxr-xr-x - cloudera cloudera 0 2018-10-09 02:21 Sparks/res1
[cloudera@quickstart ~]$ hdfs dfs -ls Sparks/res1
Found 2 items
-rw-r--r-- 1 cloudera cloudera 0 2018-10-09 02:21 Sparks/res1/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 325 2018-10-09 02:21 Sparks/res1/part-00000
[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/res1/part-00000
105,cccc,70000,f,finance,hyd
106,de,80000,f,finance,hyd
123,djdj,900000,m,accounts,hyd
122,asasd,10000,m,accounts,hyd
103,cc,10000,m,marketing,hyd
109,poi,30000,m,marketing,hyd
107,io,90000,m,admin,del
108,yu,100000,f,admin,del
110,aaa,60000,f,admin,del
101,aaaa,70000,m,hr,del
102,bbbbb,90000,f,hr,del
104,dd,40000,m,hr,del
scala> val emp = sc.textFile("/user/cloudera/Sparks/emp")
emp: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/emp MapPartitionsRDD[28] at textFile at <console>:27
scala> val dept = sc.textFile("/user/cloudera/Sparks/dept")
dept: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/dept MapPartitionsRDD[30] at textFile at <console>:27
scala> val ednosal = emp.map { x =>
| val w = x.split(",")
| val dno = w(4)
| val sal = w(2).toInt
| (dno,sal)
| }
ednosal: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[31] at map at <console>:29
scala> ednosal.collect.foreach(println)
(12,70000)
(12,90000)
(11,10000)
(12,40000)
(13,70000)
(13,80000)
(14,90000)
(14,100000)
(11,30000)
(14,60000)
(15,900000)
(15,10000)
scala> val dnoCity = dept.map { x =>
| val w = x.split(",")
| val dno = w(0)
| val city = w(2)
| (dno,city)
| }
dnoCity: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[32] at map at <console>:29
scala> dnoCity.collect.foreach(println)
(11,hyd)
(12,del)
(13,hyd)
(14,del)
(15,hyd)
Now ednosal and dnoCity both are key,value pairs
scala> val edjoin = ednosal.join(dnoCity)
edjoin: org.apache.spark.rdd.RDD[(String, (Int, String))] = MapPartitionsRDD[35] at join at <console>:35
scala> edjoin.collect.foreach(println)
(14,(90000,del))
(14,(100000,del))
(14,(60000,del))
(15,(900000,hyd))
(15,(10000,hyd))
(12,(70000,del))
(12,(90000,del))
(12,(40000,del))
(13,(70000,hyd))
(13,(80000,hyd))
(11,(10000,hyd))
(11,(30000,hyd))
scala> val citysal = edjoin.map { x =>
| val city = x._2._2
| val sal = x._2._1
| (city,sal)
| }
citysal: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[36] at map at <console>:37
scala> citysal.collect.foreach(println)
(del,90000)
(del,100000)
(del,60000)
(hyd,900000)
(hyd,10000)
(del,70000)
(del,90000)
(del,40000)
(hyd,70000)
(hyd,80000)
(hyd,10000)
(hyd,30000)
scala>
// performing city based aggregation
scala> val res = citysal.reduceByKey(_+_)
res: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[37] at reduceByKey at <console>:39
scala> res.collect.foreach(println)
(hyd,1100000)
(del,450000)
Resilient Distributed DataSets
RDD is subdivided into partitions and partitions are distributed across multiple slaves
3 ways to create RDDs
a) Read Data from file using SparkContext (sc)
b) When you perform transformation against existing RDD
c) When you parallelize local objects
Two types of operations
Transformations and Actions
Transformations:
a) element wise
operation over each element of the collection
map, flatMap
b) grouping aggregations
reduceByKey, groupByKey
c) Filters
filter, filterByRange
Actions:
RDD data flow will be executed, when action is performed.
During the flow execution, RDDs will be loaded into RAM
scala> val x = List(10,20,30,40,30,23,45,36)
x: List[Int] = List(10, 20, 30, 40, 30, 23, 45, 36)
scala> val y = sc.parallelize(x)
y: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[38] at parallelize at <console>:29
scala> val a = x.map (x => x + 100)
a: List[Int] = List(110, 120, 130, 140, 130, 123, 145, 136)
scala> val b = y.map (x => x + 100)
b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[39] at map at <console>:31
scala> b.collect.foreach(println)
110
120
130
140
130
123
145
136
Here x, a are local objects
y, and b are RDDs
Local objects:
object which resides in client machine is called local objects
RDDs are declared at client
during flow execution, loaded into slaves of spark cluster
x is declared as local
y is parallelized object of x so that y is created as RDD
a is local object, because a is transformed object of x (local)
scala> x
res29: List[Int] = List(10, 20, 30, 40, 30, 23, 45, 36)
scala> val c = x.filter (x => x >= 40)
c: List[Int] = List(40, 45)
scala> val d = y.filter(x => x >= 40)
d: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[40] at filter at <console>:31
scala> val res = d.collect
res: Array[Int] = Array(40, 45)
collect : is an Actions
a) y will be loaded into RAM and waits for data
b) d will be loaded into RAM and performs computation once is ready y will be removed from RAM
On d, collect action will be executed
Collect Action will collect the results of all partitions of the RDD into client machine (local)
val res = d.collect
res is local object
what parallelize() will do?
it converts local objects into RDDs
val a = sc.parallelize(List(10,20,30,40))
'a' is RDD, with 1 partitions
so that, during execution parallel process not possible
val b = sc.parallelize(List(10,20,30,40),2)
Now 'b' is RDD, which has 2 partitions
during execution these 2 partitions will be loaded into RAMs of 2 separate slaves. so that
parallel processing is possible
scala> val x = List(10,20,30,40,1,2,3,4,90,12)
x: List[Int] = List(10, 20, 30, 40, 1, 2, 3, 4, 90, 12)
scala> x.size
res30: Int = 10
scala> x.length
res31: Int = 10
scala> val r1 = sc.parallelize(x) // partition is 1
r1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[41] at parallelize at <console>:29
scala> r1.partitions.size
res32: Int = 1
scala> val r2 = sc.parallelize(x,3) // partition is 3 so parallel achieved
r2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[42] at parallelize at <console>:29
scala> r2.partitions.size
res33: Int = 3
//we are going to perform wordcount analysis in spark
//create a text file in local linux named as comment with repeatative words as content of it.
[cloudera@quickstart ~]$ cat > comment
I love Spark
I love Hadoop
I love Spark and Hadoop
Hadoop and Spark are great systems
[cloudera@quickstart ~]$ ls comment
comment
[cloudera@quickstart ~]$ pwd
/home/cloudera
//copy comment into Sparks folder of HDFS
[cloudera@quickstart ~]$ hdfs dfs -copyFromLocal comment Sparks
[cloudera@quickstart ~]$ hdfs dfs -ls Sparks
Found 4 items
-rw-r--r-- 1 cloudera cloudera 86 2018-10-10 01:21 Sparks/comment
-rw-r--r-- 1 cloudera cloudera 71 2018-10-09 02:06 Sparks/dept
-rw-r--r-- 1 cloudera cloudera 232 2018-10-09 02:06 Sparks/emp
drwxr-xr-x - cloudera cloudera 0 2018-10-09 02:21 Sparks/res1
//view it
[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/comment
I love Spark
I love Hadoop
I love Spark and Hadoop
Hadoop and Spark are great systems
[cloudera@quickstart ~]$
scala> val data = sc.textFile("/user/cloudera/Sparks/comment")
data: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/comment MapPartitionsRDD[44] at textFile at <console>:27
scala> data.count
res35: Long = 4
scala> data.collect.foreach(println)
I love Spark
I love Hadoop
I love Spark and Hadoop
Hadoop and Spark are great systems
scala> data.collect
res37: Array[String] = Array(I love Spark, I love Hadoop, I love Spark and Hadoop, Hadoop and Spark are great systems)
read text file contents and put them into RDDs
scala> val lines = sc.textFile("hdfs://quickstart.cloudera/user/cloudera/Sparks/comment")
lines: org.apache.spark.rdd.RDD[String] = hdfs://quickstart.cloudera/user/cloudera/Sparks/comment MapPartitionsRDD[46] at textFile at <console>:27
scala> lines.count
res38: Long = 4
scala> lines.collect
res39: Array[String] = Array(I love Spark, I love Hadoop, I love Spark and Hadoop, Hadoop and Spark are great systems)
scala> lines.collect.foreach(println)
I love Spark
I love Hadoop
I love Spark and Hadoop
Hadoop and Spark are great systems
scala> lines.foreach(println)
I love Spark
I love Hadoop
I love Spark and Hadoop
Hadoop and Spark are great systems
scala> val words = lines.flatMap(x => x.split(" "))
words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[47] at flatMap at <console>:31
scala> val wordss = lines.flatMap(_.split(" ")) // short hand format
wordss: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[48] at flatMap at <console>:29
scala> words.foreach(println)
I
love
Spark
I
love
Hadoop
I
love
Spark
and
Hadoop
Hadoop
and
Spark
are
great
systems
scala> wordss.foreach(println)
I
love
Spark
I
love
Hadoop
I
love
Spark
and
Hadoop
Hadoop
and
Spark
are
great
systems
scala> words.collect
res44: Array[String] = Array(I, love, Spark, I, love, Hadoop, I, love, Spark, and, Hadoop, Hadoop, and, Spark, are, great, systems)
scala> val pair = words.map (x => (x,1))
pair: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[49] at map at <console>:33
scala> pair.collect
res45: Array[(String, Int)] = Array((I,1), (love,1), (Spark,1), (I,1), (love,1), (Hadoop,1), (I,1), (love,1), (Spark,1), (and,1), (Hadoop,1), (Hadoop,1), (and,1), (Spark,1), (are,1), (great,1), (systems,1))
scala> pair.collect.foreach(println)
(I,1)
(love,1)
(Spark,1)
(I,1)
(love,1)
(Hadoop,1)
(I,1)
(love,1)
(Spark,1)
(and,1)
(Hadoop,1)
(Hadoop,1)
(and,1)
(Spark,1)
(are,1)
(great,1)
(systems,1)
scala> val wc = pair.reduceByKey((a,b) => a+b) // full format
scala> val wc = pair.reduceByKey (_+_)
wc: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[50] at reduceByKey at <console>:44
scala> wc.collect
res47: Array[(String, Int)] = Array((are,1), (Spark,3), (love,3), (I,3), (great,1), (and,2), (systems,1), (Hadoop,3))
scala> wc.collect.foreach(println)
(are,1)
(Spark,3)
(love,3)
(I,3)
(great,1)
(and,2)
(systems,1)
(Hadoop,3)
Complete short hand involved below:
scala> val words = lines.flatMap (_.split(" "))
words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[51] at flatMap at <console>:29
scala> val pair = words.map ( (_,1))
pair: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[52] at map at <console>:31
scala> val res = pair.reduceByKey(_+_)
res: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[53] at reduceByKey at <console>:33
scala> res.collect.foreach(println)
(are,1)
(Spark,3)
(love,3)
(I,3)
(great,1)
(and,2)
(systems,1)
(Hadoop,3)
scala> res.count
res50: Long = 8
//single line implementation
scala> val wc = lines.flatMap(_.split(" ")).map ((_,1)).reduceByKey(_+_)
wc: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[56] at reduceByKey at <console>:29
scala> wc.collect.foreach(println)
(are,1)
(Spark,3)
(love,3)
(I,3)
(great,1)
(and,2)
(systems,1)
(Hadoop,3)
val wc = lines.map { x =>
val w = x.split(" ")
val p = w.map((_,1))
p
}.flatMap(x => x).reduceByKey(_+_)
scala> wc.collect.foreach(println)
(are,1)
(Spark,3)
(love,3)
(I,3)
(great,1)
(and,2)
(systems,1)
(Hadoop,3)
Unnecessary Space Example:
[cloudera@quickstart ~]$ val data = sc.textFile("/user/cloudera/Sparks/comment")
bash: syntax error near unexpected token `('
[cloudera@quickstart ~]$ words.collect
bash: words.collect: command not found
[cloudera@quickstart ~]$ cat >unnecessaryspace.txt
I loVE INdiA I loVE PaLlaThuR I LoVE BanGALorE
Hadoop VS SPArK faCEBooK SCAla ArunACHAlam VenkaTAChaLAm "
[cloudera@quickstart ~]$ hdfs dfs -copyFromLocal unnecessaryspace.txt Sparks
scala function to remove unnecessary space
scala> def removeSpace(line:String) = {
| // i LoVE spARk
| val w = line.trim().split(" ")
| val words = w.filter(x => x != "")
| words.mkString(" ")
| }
removeSpace: (line: String)String
execute the function:
scala> removeSpace("I LovE Spark ")
res56: String = I LovE Spark
scala> val data = sc.textFile("/user/cloudera/Sparks/unnecessaryspace.txt")
data: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/unnecessaryspace.txt MapPartitionsRDD[62] at textFile at <console>:27
scala> data.count
res57: Long = 2
scala> data.collect.foreach(println)
I loVE INdiA I loVE PaLlaThuR I LoVE BanGALorE
Hadoop VS SPArK faCEBooK SCAla ArunACHAlam VenkaTAChaLAm "
scala> val data = sc.textFile("/user/cloudera/Sparks/unnecessaryspace.txt")
data: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/unnecessaryspace.txt MapPartitionsRDD[62] at textFile at <console>:27
scala> data.count
res57: Long = 2
scala> data.collect.foreach(println)
I loVE INdiA I loVE PaLlaThuR I LoVE BanGALorE
Hadoop VS SPArK faCEBooK SCAla ArunACHAlam VenkaTAChaLAm "
scala> removeSpace("I LovE Spark ")
res59: String = I LovE Spark
scala> val words = data.flatMap { x =>
| val x2 = removeSpace(x).toLowerCase.split(" ")
| x2
| }
words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[63] at flatMap at <console>:37
scala> words.collect
res60: Array[String] = Array(i, love, india, i, love, pallathur, i, love, bangalore, hadoop, vs, spark, facebook, scala, arunachalam, venkatachalam, ")
scala> words.collect.foreach(println)
i
love
india
i
love
pallathur
i
love
bangalore
hadoop
vs
spark
scala
arunachalam
venkatachalam
"
scala> val pair = words.map (x => (x,1))
pair: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[64] at map at <console>:40
scala> val wc = pair.reduceByKey(_+_)
wc: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[65] at reduceByKey at <console>:42
scala> wc.collect.foreach(println)
(bangalore,1)
(india,1)
(",1)
(scala,1)
(spark,1)
(hadoop,1)
(love,3) // 3 times
(facebook,1)
(i,3) // 3 times
(venkatachalam,1)
(arunachalam,1)
(pallathur,1)
(vs,1)
cat > emp
101,aaaa,70000,m,12
102,bbbbb,90000,f,12
103,cc,10000,m,11
104,dd,40000,m,12
105,cccc,70000,f,13
106,de,80000,f,13
107,io,90000,m,14
108,yu,100000,f,14
109,poi,30000,m,11
110,aaa,60000,f,14
123,djdj,900000,m,15
122,asasd,10000,m,15
hdfs dfs -copyFromLocal emp Sparks
[cloudera@quickstart ~]$ hdfs dfs -ls Sparks
Found 5 items
-rw-r--r-- 1 cloudera cloudera 86 2018-10-10 01:21 Sparks/comment
-rw-r--r-- 1 cloudera cloudera 71 2018-10-09 02:06 Sparks/dept
-rw-r--r-- 1 cloudera cloudera 232 2018-10-09 02:06 Sparks/emp -- this is the file
drwxr-xr-x - cloudera cloudera 0 2018-10-09 02:21 Sparks/res1
-rw-r--r-- 1 cloudera cloudera 137 2018-10-10 01:58 Sparks/unnecessaryspace.txt
[cloudera@quickstart ~]$
scala> val emp = sc.textFile("/user/cloudera/Sparks/emp")
emp: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/emp MapPartitionsRDD[67] at textFile at <console>:27
scala> val eArr = emp.map (x => x.split(" "))
eArr: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[68] at map at <console>:31
scala> //sex based aggregations on sal
scala> val pairSexSal = eArr.map ( x => (x(3),x(2).toInt))
pairSexSal: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[71] at map at <console>:33
scala> val res1 = pairSexSal.reduceByKey(_+_)
res1: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[72] at reduceByKey at <console>:35
scala> //select sex,max(sal) from emp group by sex;
scala> val pairSexSal = eArr.map ( x => (x(3),x(2).toInt))
pairSexSal: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[71] at map at <console>:33
scala> val res1 = pairSexSal.reduceByKey(_+_)
res1: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[72] at reduceByKey at <console>:35
scala> //select sex,max(sal) from emp group by sex;
scala> val res2 = pairSexSal.reduceByKey(Math.max(_,_))
res2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[73] at reduceByKey at <console>:35
scala> //select sex,min(sal) from emp group by sex;
scala> val res3 = pairSexSal.reduceByKey(Math.min(_,_))
res3: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[74] at reduceByKey at <console>:35
scala> val data = sc.textFile("/user/cloudera/Sparks/emp")
data: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/emp MapPartitionsRDD[76] at textFile at <console>:27
scala> data.collect.foreach(println)
101,aaaa,70000,m,12
102,bbbbb,90000,f,12
103,cc,10000,m,11
104,dd,40000,m,12
105,cccc,70000,f,13
106,de,80000,f,13
107,io,90000,m,14
108,yu,100000,f,14
109,poi,30000,m,11
110,aaa,60000,f,14
123,djdj,900000,m,15
122,asasd,10000,m,15
// select dno,sex, sum(sal) from emp group by sex;
scala> val pair = data.map { x =>
| val w = x.split(",")
| val dno = w(4)
| val sex = w(3)
| val sal = w(2).toInt
| val myKey = (dno,sex)
| (myKey,sal)
| }
pair: org.apache.spark.rdd.RDD[((String, String), Int)] = MapPartitionsRDD[77] at map at <console>:31
scala> pair.collect.foreach(println)
((12,m),70000)
((12,f),90000)
((11,m),10000)
((12,m),40000)
((13,f),70000)
((13,f),80000)
((14,m),90000)
((14,f),100000)
((11,m),30000)
((14,f),60000)
((15,m),900000)
((15,m),10000)
(key,value) - in that place of key tuple (15,m) exists
when ever multi group is required keep that k,v as tuple
scala> var res = pair.reduceByKey(_+_)
res: org.apache.spark.rdd.RDD[((String, String), Int)] = ShuffledRDD[78] at reduceByKey at <console>:33
scala> val res = pair.reduceByKey((x,y) => x+y)
res: org.apache.spark.rdd.RDD[((String, String), Int)] = ShuffledRDD[79] at reduceByKey at <console>:38
scala> res.collect
res68: Array[((String, String), Int)] = Array(((14,m),90000), ((12,f),90000), ((15,m),910000), ((14,f),160000), ((13,f),150000), ((12,m),110000), ((11,m),40000))
scala> res.collect.foreach(println)
((14,m),90000)
((12,f),90000)
((15,m),910000)
((14,f),160000)
((13,f),150000)
((12,m),110000)
((11,m),40000)
Grouping by single or set of columns but aggregation is only one
I want multiple aggregations sum,max,min etc.,
reduceByKey doesn't support multiple aggregation
scala> data.collect
res72: Array[String] = Array(101,aaaa,70000,m,12, 102,bbbbb,90000,f,12, 103,cc,10000,m,11, 104,dd,40000,m,12, 105,cccc,70000,f,13, 106,de,80000,f,13, 107,io,90000,m,14, 108,yu,100000,f,14, 109,poi,30000,m,11, 110,aaa,60000,f,14, 123,djdj,900000,m,15, 122,asasd,10000,m,15)
scala> data.collect.foreach(println)
101,aaaa,70000,m,12
102,bbbbb,90000,f,12
103,cc,10000,m,11
104,dd,40000,m,12
105,cccc,70000,f,13
106,de,80000,f,13
107,io,90000,m,14
108,yu,100000,f,14
109,poi,30000,m,11
110,aaa,60000,f,14
123,djdj,900000,m,15
122,asasd,10000,m,15
scala> data.take(3).foreach(println)
101,aaaa,70000,m,12
102,bbbbb,90000,f,12
103,cc,10000,m,11
data.skip(3) is not available
val data = sc.textFile("....")
val arr = data.map (x => x.split(","))
val pairSexSal = arr.map (x => ( x(3),x(2).toInt))
val res1 = pairSexSal.reduceByKey (_+_)
val res2 = pairSexSal.reduceByKey(Math.max(_,_))
val res3 = pairSexSal.reduceByKey(Math.min(_,_))
data
array
pair
(pair.persist / pair.cache)
res1,res2,res3
3 different flows :
data -> array -> pair -> res1
data -> array -> pair -> res2
data -> array -> pair -> res3
RDD wont be persisted when declared
when 1st time loaded and computed then it will be persisted (res1.collect)
res1.collect now pairSexSal.persist
so result is available in RAM
res2.collect --> all the steps from the beginning wont be executed
--> just operations continues after pairSexSal
because pairSexSal result is already persisted so no need to recalculate from the beginning
collect is a method of Spark and not a part of Scala
take is availabe in Scala as well as Spark
.collect is exclusive only for Spark RDD
scala> data.collect.foreach(println)
101,aaaa,70000,m,12
102,bbbbb,90000,f,12
103,cc,10000,m,11
104,dd,40000,m,12
105,cccc,70000,f,13
106,de,80000,f,13
107,io,90000,m,14
108,yu,100000,f,14
109,poi,30000,m,11
110,aaa,60000,f,14
123,djdj,900000,m,15
122,asasd,10000,m,15
scala> val arr = data.map (x => x.split(","))
arr: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[80] at map at <console>:31
scala> val arr = data.map(x => x.split(","))
arr: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[84] at map at <console>:31
scala> val pair = arr.map (x => (x(3),x(2).toInt))
pair: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[85] at map at <console>:33
scala> pair.collect.foreach(println)
(m,70000)
(f,90000)
(m,10000)
(m,40000)
(f,70000)
(f,80000)
(m,90000)
(f,100000)
(m,30000)
(f,60000)
(m,900000)
(m,10000)
scala> pair.persist
res87: pair.type = MapPartitionsRDD[85] at map at <console>:33
scala> val res1 = pair.reduceByKey(_+_)
res1: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[86] at reduceByKey at <console>:35
scala> val res2 = pair.reduceByKey(Math.max(_,_))
res2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[87] at reduceByKey at <console>:35
scala> val res3 = pair.reduceByKey(Math.min(_,_))
res3: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[88] at reduceByKey at <console>:35
faster execution:
sum
scala> res3.collect
res88: Array[(String, Int)] = Array((f,60000), (m,10000))
max
scala> res2.collect
res89: Array[(String, Int)] = Array((f,100000), (m,900000))
min
scala> res1.collect
res90: Array[(String, Int)] = Array((f,400000), (m,1150000))
when we comeout of the session, shell persistance will be released (go away)
Here we store the result into HDFS as a file but
it will be the copy of tuple what we have like within brackets
scala> res1.saveAsTextFile("/user/cloudera/Sparks/RES1")
scala> res2.saveAsTextFile("/user/cloudera/Sparks/RES2")
scala> res3.saveAsTextFile("/user/cloudera/Sparks/RES3")
[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/RES1/part-00000
(f,400000)
(m,1150000)
[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/RES2/part-00000
(f,100000)
(m,900000)
[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/RES3/part-00000
(f,60000)
(m,10000)
scala> res1.collect
res101: Array[(String, Int)] = Array((f,400000), (m,1150000))
// here we make just a string instead of tuple to write string output into a file
scala> val tempResult = res1.map { x =>
| val res = x._1 + "\t" + x._2 // we make tab delimited string instead of tuple with brrackets
| res
| }
tempResult: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[94] at map at <console>:47
scala> tempResult
res102: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[94] at map at <console>:47
scala> tempResult.collect
res103: Array[String] = Array(f 400000, m 1150000)
// here we write string output into a file
scala> tempResult.saveAsTextFile("/user/cloudera/Sparks/R1")
check the file content (HDFS)
hdfs dfs -cat Sparks/R1/part-00000
f 400000
m 1150000
Note:
Don't write output as tuple in HDFS
Transform the RDD as string and then write into HDFS
scala> val pair2 = arr.map( x => ((x(4),x(3)),x(2).toInt))
pair2: org.apache.spark.rdd.RDD[((String, String), Int)] = MapPartitionsRDD[96] at map at <console>:33
scala> val res4 = pair2.reduceByKey(_+_)
res4: org.apache.spark.rdd.RDD[((String, String), Int)] = ShuffledRDD[97] at reduceByKey at <console>:35
scala> res4.collect.foreach(println)
((14,m),90000)
((12,f),90000)
((15,m),910000)
((14,f),160000)
((13,f),150000)
((12,m),110000)
((11,m),40000)
scala> res4.saveAsTextFile("/user/cloudera/Sparks/Re1")
[cloudera@quickstart ~]$ hdfs dfs -ls Sparks/Re1
Found 2 items
-rw-r--r-- 1 cloudera cloudera 0 2018-10-10 10:24 Sparks/Re1/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 109 2018-10-10 10:24 Sparks/Re1/part-00000
[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/Re1/part-00000
((14,m),90000)
((12,f),90000)
((15,m),910000)
((14,f),160000)
((13,f),150000)
((12,m),110000)
((11,m),40000)
Here ((dno,sex),sal) ---> tuple inside a tuple as key and salary as value written into HDFS
But our required output is : 14 m 90000
we need to transform the RDD as follows to make tab delimited string
//using multiline code
scala> val r4 = res4.map { x =>
| val dno = x._1._1
| val sex = x._1._2
| val sal = x._2
| dno + "\t" + sex + "\t" + sal
| }
r4: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[100] at map at <console>:37
scala> r4.collect.foreach(println)
14 m 90000
12 f 90000
15 m 910000
14 f 160000
13 f 150000
12 m 110000
11 m 40000
//using single line code
scala> val r4 = res4.map (x => x._1._1 + "\t" + x._1._2 + "\t" + x._2)
r4: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[101] at map at <console>:37
scala> r4.collect.foreach(println)
14 m 90000
12 f 90000
15 m 910000
14 f 160000
13 f 150000
12 m 110000
11 m 40000
scala> r4.saveAsTextFile("/user/cloudera/Sparks/Re4")
[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/Re4/part-0000014 m 90000
12 f 90000
15 m 910000
14 f 160000
13 f 150000
12 m 110000
11 m 40000
Make a scala function for reuse: input x is tuple, y is delimiter
scala> def pairToString(x:(String,Int),delim:String) = {
| val a = x._1
| val b = x._2
| a + delim + b
| }
pairToString: (x: (String, Int), delim: String)String
scala> res1.collect.foreach(println)
(f,400000)
(m,1150000)
// comma delimited string
scala> val Re5 = res1.map ( x => pairToString(x,","))
Re5: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[103] at map at <console>:48
scala> Re5.collect.foreach(println)
f,400000
m,1150000
//Tab Delimited string
scala> val Re6 = res1.map (x => pairToString(x,"\t"))
Re6: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[104] at map at <console>:48
scala> Re6.collect.foreach(println)
f 400000
m 1150000
saveAsTextFile --> How many number of files will be created in HDFS?
Number of Files count ==> Number of partitions
scala> val myList = List(10,20,30,40,50,50,30,40,10,23)
myList: List[Int] = List(10, 20, 30, 40, 50, 50, 30, 40, 10, 23)
scala> myList.size
res119: Int = 10
scala> val rdd1 = sc.parallelize(myList)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[105] at parallelize at <console>:29
scala> rdd1.partitions.size
res121: Int = 1
scala> rdd1.saveAsTextFile("/user/cloudera/Sparks/rdd1Result")
scala> val rdd2 = sc.parallelize(myList,3)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[107] at parallelize at <console>:29
scala> rdd2.partitions.size
res123: Int = 3
hdfs dfs -ls Sparks
Found 14 items
drwxr-xr-x - cloudera cloudera 0 2018-10-10 10:48 Sparks/rdd1Result (single partition)
drwxr-xr-x - cloudera cloudera 0 2018-10-10 10:49 Sparks/rdd2Result (3 partitions)
rdd1 has single partition so : part-00000 (single file will be present in that folder)
[cloudera@quickstart ~]$ hdfs dfs -ls Sparks/rdd1Result
Found 2 items
-rw-r--r-- 1 cloudera cloudera 0 2018-10-10 10:48 Sparks/rdd1Result/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 30 2018-10-10 10:48 Sparks/rdd1Result/part-00000
rdd2 has 3 partitions so : part-000000, part-00001, part-00002 total 3 files present there
[cloudera@quickstart ~]$ hdfs dfs -ls Sparks/rdd2Result
Found 4 items
-rw-r--r-- 1 cloudera cloudera 0 2018-10-10 10:49 Sparks/rdd2Result/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 9 2018-10-10 10:49 Sparks/rdd2Result/part-00000
-rw-r--r-- 1 cloudera cloudera 9 2018-10-10 10:49 Sparks/rdd2Result/part-00001
-rw-r--r-- 1 cloudera cloudera 12 2018-10-10 10:49 Sparks/rdd2Result/part-00002
How many number of files will be created when we use saveAsTextFile ?
that depends on number of partitions for the given RDD
For Single Aggregation, use reduceByKey
For Multiple Aggregation, use groupByKey
it will make compact buffer (iterator in java)
scala> val data = sc.textFile("/user/cloudera/Sparks/emp")
data: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/emp MapPartitionsRDD[112] at textFile at <console>:27
scala> val arr = data.map (x => x.split(","))
arr: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[113] at map at <console>:31
scala> val pair1 = arr.map (x => (x(3),x(2).toInt))
pair1: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[114] at map at <console>:33
scala> pair1.collect.foreach(println)
(m,70000)
(f,90000)
(m,10000)
(m,40000)
(f,70000)
(f,80000)
(m,90000)
(f,100000)
(m,30000)
(f,60000)
(m,900000)
(m,10000)
Going to apply multiple aggregations
scala> val grp = pair1.groupByKey() // always first one is key
grp: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[115] at groupByKey at <console>:35
// for male one tuple, female another tuple
scala> grp.collect.foreach(println)
(f,CompactBuffer(90000, 70000, 80000, 100000, 60000))
(m,CompactBuffer(70000, 10000, 40000, 90000, 30000, 900000, 10000))
// single grouping column but multiple aggregations
scala> val res = grp.map{ x =>
| val sex = x._1
| val cb = x._2
| val tot = cb.sum
| val cnt = cb.size
| val avg = tot / cnt
| val max = cb.max
| val min = cb.min
| val result = sex + "," + tot + "," + cnt + "," + avg + "," + max + "," + min
| result
| }
res: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[116] at map at <console>:37
scala> res
res127: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[116] at map at <console>:37
scala> res.collect.foreach(println)
f,400000,5,80000,100000,60000
m,1150000,7,164285,900000,10000
scala> res.saveAsTextFile("/user/cloudera/Sparks/res100")
[cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/Sparks/res100/part-00000
f,400000,5,80000,100000,60000
m,1150000,7,164285,900000,10000
scala> data.collect
res130: Array[String] = Array(101,aaaa,70000,m,12, 102,bbbbb,90000,f,12, 103,cc,10000,m,11, 104,dd,40000,m,12, 105,cccc,70000,f,13, 106,de,80000,f,13, 107,io,90000,m,14, 108,yu,100000,f,14, 109,poi,30000,m,11, 110,aaa,60000,f,14, 123,djdj,900000,m,15, 122,asasd,10000,m,15)
scala> data.collect.foreach(println)
101,aaaa,70000,m,12
102,bbbbb,90000,f,12
103,cc,10000,m,11
104,dd,40000,m,12
105,cccc,70000,f,13
106,de,80000,f,13
107,io,90000,m,14
108,yu,100000,f,14
109,poi,30000,m,11
110,aaa,60000,f,14
123,djdj,900000,m,15
122,asasd,10000,m,15
scala> arr.collect
res132: Array[Array[String]] = Array(Array(101, aaaa, 70000, m, 12), Array(102, bbbbb, 90000, f, 12), Array(103, cc, 10000, m, 11), Array(104, dd, 40000, m, 12), Array(105, cccc, 70000, f, 13), Array(106, de, 80000, f, 13), Array(107, io, 90000, m, 14), Array(108, yu, 100000, f, 14), Array(109, poi, 30000, m, 11), Array(110, aaa, 60000, f, 14), Array(123, djdj, 900000, m, 15), Array(122, asasd, 10000, m, 15))
scala> val pair2 = arr.map(x => ( (x(4),x(3)),x(2).toInt))
pair2: org.apache.spark.rdd.RDD[((String, String), Int)] = MapPartitionsRDD[119] at map at <console>:33
scala> pair2.collect.foreach(println)
((12,m),70000)
((12,f),90000)
((11,m),10000)
((12,m),40000)
((13,f),70000)
((13,f),80000)
((14,m),90000)
((14,f),100000)
((11,m),30000)
((14,f),60000)
((15,m),900000)
((15,m),10000)
scala> val grp2 = pair2.groupByKey()
grp2: org.apache.spark.rdd.RDD[((String, String), Iterable[Int])] = ShuffledRDD[120] at groupByKey at <console>:35
// multiple compact buffers for each key grouped together
scala> grp2.collect.foreach(println)
((14,m),CompactBuffer(90000))
((12,f),CompactBuffer(90000))
((15,m),CompactBuffer(900000, 10000))
((14,f),CompactBuffer(100000, 60000))
((13,f),CompactBuffer(70000, 80000))
((12,m),CompactBuffer(70000, 40000))
((11,m),CompactBuffer(10000, 30000))
val res2 = grp2.map { x =>
val k = x._1
val dno = k._1
val sex = k._2
val cb = x._2
val tot = cb.sum
val cnt = cb.size
val avg = tot / cnt
val max = cb.max
val min = cb.min
(dno,sex,tot,cnt,avg,max,min)
}
scala> res2.collect.foreach(println)
(14,m,90000,1,90000,90000,90000)
(12,f,90000,1,90000,90000,90000)
(15,m,910000,2,455000,900000,10000)
(14,f,160000,2,80000,100000,60000)
(13,f,150000,2,75000,80000,70000)
(12,m,110000,2,55000,70000,40000)
(11,m,40000,2,20000,30000,10000)
reduceByKey provides better performance than groupByKey
// select sum(sal) from emp
// select sum(sal),avg(sal),count(*),max(sal),min(sal) from emp;
// we need the aggregations without grouping for all employees
scala> data.collect
res137: Array[String] = Array(101,aaaa,70000,m,12, 102,bbbbb,90000,f,12, 103,cc,10000,m,11, 104,dd,40000,m,12, 105,cccc,70000,f,13, 106,de,80000,f,13, 107,io,90000,m,14, 108,yu,100000,f,14, 109,poi,30000,m,11, 110,aaa,60000,f,14, 123,djdj,900000,m,15, 122,asasd,10000,m,15)
scala> arr.collect
res138: Array[Array[String]] = Array(Array(101, aaaa, 70000, m, 12), Array(102, bbbbb, 90000, f, 12), Array(103, cc, 10000, m, 11), Array(104, dd, 40000, m, 12), Array(105, cccc, 70000, f, 13), Array(106, de, 80000, f, 13), Array(107, io, 90000, m, 14), Array(108, yu, 100000, f, 14), Array(109, poi, 30000, m, 11), Array(110, aaa, 60000, f, 14), Array(123, djdj, 900000, m, 15), Array(122, asasd, 10000, m, 15))
scala>
scala> val sals = arr.map (x => x(2).toInt)
sals: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[122] at map at <console>:33
scala> sals.collect.foreach(println)
70000
90000
10000
40000
70000
80000
90000
100000
30000
60000
900000
10000
scala> val tot = sals.sum
tot: Double = 1550000.0
scala> val tot = sals.reduce(_+_)
tot: Int = 1550000
scala> val cnt = sals.count
cnt: Long = 12
scala> val avg = tot / cnt
avg: Long = 129166
scala> val max = sals.max
max: Int = 900000
scala> val min = sals.min
min: Int = 10000
scala> val tot = sals.reduce(_+_)
tot: Int = 1550000
scala> val cnt = sals.count
cnt: Long = 12
scala> val avg = tot / cnt
avg: Long = 129166
scala> val max = sals.reduce(Math.max)
max: Int = 900000
scala> val max = sals.reduce(Math.max(_,_))
max: Int = 900000
scala> val min = sals.reduce(Math.min)
min: Int = 10000
scala> val min = sals.reduce(Math.min(_,_))
min: Int = 10000
reduce will work on each partitions of each RAMs of every machines of a clusters where data already parallelized
-- better performance
-- parallelism achieved
sum will work on non parallel manner. sum will collect data from RDDs and put it into single machine and gives burden
-- bad performance
-- non parallelism
val lst = sc.parallelize(List(10,20,30,40,50,60,70,80,90,100),2)
rdd --> lst
has 2 partitions
partition 1 -> List(10,20,30,40,50)
partition 2 -> List(60,70,80,90,100)
lst.sum --> all data from all partitions will be collected into local then sum executed at local (non parallel)
lst.reduce(_+_)
operation executed at cluster performed in all partitions wherever data reside in partitions
partition1 result = 150
partition2 result = 400
finally independent result of each partition will be collected into any one of the spark slave and there only
sum will be calculated
finally -> List(150,400) ==> 550
this final result 550 will be sent to client machine who needs result
grouping aggregation --> reduceByKey
entire collection's sum or aggregation --> reduce
scala> val res = (tot,cnt,avg,max,min)
res: (Int, Long, Long, Int, Int) = (1550000,12,129166,900000,10000)
scala> val r = List(tot,cnt,avg,max,min).mkString("\t")
r: String = 1550000 12 129166 900000 10000
scala> r
res142: String = 1550000 12 129166 900000 10000
scala> data.collect
res143: Array[String] = Array(101,aaaa,70000,m,12, 102,bbbbb,90000,f,12, 103,cc,10000,m,11, 104,dd,40000,m,12, 105,cccc,70000,f,13, 106,de,80000,f,13, 107,io,90000,m,14, 108,yu,100000,f,14, 109,poi,30000,m,11, 110,aaa,60000,f,14, 123,djdj,900000,m,15, 122,asasd,10000,m,15)
scala> data.take(3).foreach(println)
101,aaaa,70000,m,12val
102,bbbbb,90000,f,12
103,cc,10000,m,11
scala> val res = data.map { x =>
val w = x.trim().split(",")
val id = w(0)
val name = w(1).toLowerCase
val fc = name.slice(0,1).toUpperCase
val rc = name.slice(1,name.size).toLowerCase
val sal = w(2).toInt
val grade = if (sal >= 70000) "A" else
if (sal >= 50000) "B" else
if (sal >= 30000) "C" else "D"
val dno = w(4).toInt
val dname = dno match {
case 11 => "Marketing"
case 12 => "HR"
case 13 => "Finance"
case others => "Others"
}
var sex = w(3).toLowerCase
sex = if (sex =="f") "Female" else "Male"
val Name = fc + rc
List(id,Name,w(2),grade,sex,dname).mkString("\t")
}
res: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at map at <console>:29
scala> res.collect
res2: Array[String] = Array(101 Aaaa 70000 A Male HR, 102 Bbbbb 90000 A Female HR, 103 Cc 10000 D Male Marketing, 104 Dd 40000 C Male HR, 105 Cccc 70000 A Female Finance, 106 De 80000 A Female Finance, 107 Io 90000 A Male Others, 108 Yu 100000 A Female Others, 109 Poi 30000 C Male Marketing, 110 Aaa 60000 B Female Others, 123 Djdj 900000 AMale Others, 122 Asasd 10000 D Male Others)
scala> res.collect.foreach(println)
101 Aaaa 70000 A Male HR
102 Bbbbb 90000 A Female HR
103 Cc 10000 D Male Marketing
104 Dd 40000 C Male HR
105 Cccc 70000 A Female Finance
106 De 80000 A Female Finance
107 Io 90000 A Male Others
108 Yu 100000 A Female Others
109 Poi 30000 C Male Marketing
110 Aaa 60000 B Female Others
123 Djdj 900000 A Male Others
122 Asasd 10000 D Male Others
// select sum(sal) from emp where sex ="m"
// writing a function to find gender
scala> def isMale(x:String) = {
| val w = x.split(",")
| val sex = w(3).toLowerCase
| sex =="m"
| }
isMale: (x: String)Boolean
scala> var males = data.filter(x => isMale(x))
males: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[4] at filter at <console>:31
scala> males.collect.foreach(println)
101,aaaa,70000,m,12
103,cc,10000,m,11
104,dd,40000,m,12
107,io,90000,m,14
109,poi,30000,m,11
123,djdj,900000,m,15
122,asasd,10000,m,15
filtered data : it has just male employee datas
scala> val sals = males.map ( x => x.split(",")(2).toInt)
sals: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[6] at map at <console>:33
scala> sals.collect
res7: Array[Int] = Array(70000, 10000, 40000, 90000, 30000, 900000, 10000)
scala> sals.reduce(_+_)
res8: Int = 1150000
// Maximum salary of female employee collection
scala> val fems = data.filter(x => !isMale(x))
fems: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at filter at <console>:31
scala> fems.collect.foreach(println)
102,bbbbb,90000,f,12
105,cccc,70000,f,13
106,de,80000,f,13
108,yu,100000,f,14
110,aaa,60000,f,14
scala> val maxOfFemale = fems.map (x => x.split(",")(2).toInt).reduce(Math.max(_,_))
maxOfFemale: Int = 100000
Merging RDDs (Union)
-------------------
scala> val l1 = List(10,20,30,50,80)
l1: List[Int] = List(10, 20, 30, 50, 80)
scala> val l2 = List(20,30,10,90,200)
l2: List[Int] = List(20, 30, 10, 90, 200)
scala> val r1 = sc.parallelize(l1)
r1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:29
scala> val r2 = sc.parallelize(l2)
r2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at <console>:29
scala> r1.collect
res11: Array[Int] = Array(10, 20, 30, 50, 80)
scala> r2.collect
res12: Array[Int] = Array(20, 30, 10, 90, 200)
scala> val r = r1.union(r2)
r: org.apache.spark.rdd.RDD[Int] = UnionRDD[11] at union at <console>:35
scala> r.count
res13: Long = 10
scala> r.collect // which merges 2 RDDs with duplicate values (UNION ALL)
res14: Array[Int] = Array(10, 20, 30, 50, 80, 20, 30, 10, 90, 200)
Merge 2 RDDs with duplicates then later eliminate duplicates
scala> val r3 = sc.parallelize(List(1,2,3,4,5,10,80,20))
r3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[15] at parallelize at <console>:27
//Merging more than 2 RDDs
scala> val result = r1.union(r2).union(r3)
result: org.apache.spark.rdd.RDD[Int] = UnionRDD[17] at union at <console>:37
scala> result.collect
res16: Array[Int] = Array(10, 20, 30, 50, 80, 20, 30, 10, 90, 200, 1, 2, 3, 4, 5, 10, 80, 20)
scala> result.count
res17: Long = 18
scala> val re1 = r1 ++ r2 // Mergining similar to r1.union(r2)
re1: org.apache.spark.rdd.RDD[Int] = UnionRDD[18] at $plus$plus at <console>:35
scala> re1.collect
res18: Array[Int] = Array(10, 20, 30, 50, 80, 20, 30, 10, 90, 200)
scala> val re2 = r1 ++ r2 ++ r3 // similar to : r1.union(r2).union(r3)
re2: org.apache.spark.rdd.RDD[Int] = UnionRDD[20] at $plus$plus at <console>:37
scala> re2.collect
res19: Array[Int] = Array(10, 20, 30, 50, 80, 20, 30, 10, 90, 200, 1, 2, 3, 4, 5, 10, 80, 20)
scala> val data = sc.parallelize(List(10,20,10,20,30,20,10,10))
data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[21] at parallelize at <console>:27
scala> data.collect
res20: Array[Int] = Array(10, 20, 10, 20, 30, 20, 10, 10)
scala> val data2 = data.distinct // avoid or eliminate duplicate
data2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[24] at distinct at <console>:29
scala> data2.collect
res21: Array[Int] = Array(30, 20, 10)
//Duplicates eliminated
scala> re1.distinct.collect
res26: Array[Int] = Array(200, 80, 30, 50, 90, 20, 10)
//Duplicates eliminated
scala> re2.distinct.collect
res27: Array[Int] = Array(30, 90, 3, 4, 1, 10, 200, 80, 50, 20, 5, 2)
//Duplicates included
scala> re1.collect
res28: Array[Int] = Array(10, 20, 30, 50, 80, 20, 30, 10, 90, 200)
//Duplicates included
scala> re2.collect
res29: Array[Int] = Array(10, 20, 30, 50, 80, 20, 30, 10, 90, 200, 1, 2, 3, 4, 5, 10, 80, 20)
scala> val x = sc.parallelize(List("A","B","c","D"))
x: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[31] at parallelize at <console>:27
scala> val y = sc.parallelize(List("A","c","M","N"))
y: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[32] at parallelize at <console>:27
scala> val z = x ++ y
z: org.apache.spark.rdd.RDD[String] = UnionRDD[33] at $plus$plus at <console>:31
//with duplicates
scala> z.collect
res30: Array[String] = Array(A, B, c, D, A, c, M, N)
//without duplicates
scala> z.distinct.collect
res31: Array[String] = Array(B, N, D, M, A, c)
Cross Join - Cartesian Join
Each element of left side RDD, will join with each elements of the right side RDD
key,value pair
scala> val pair1 = sc.parallelize(Array(("p1",10000),("p2",1000),("p2",20000),("p2",50000),("p3",60000)))
pair1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[39] at parallelize at <console>:27
scala> val pair2 = sc.parallelize(Array(("p1",20000),("p2",50000),("p1",10000)))
pair2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[40] at parallelize at <console>:27
//cross join goes here
scala> val cr = pair1.cartesian(pair2)
cr: org.apache.spark.rdd.RDD[((String, Int), (String, Int))] = CartesianRDD[41] at cartesian at <console>:31
pair1 count : 5, pair2 count : 3
cross will results 5 x 3 = 15 elements
scala> cr.collect.foreach(println)
((p1,10000),(p1,20000))
((p1,10000),(p2,50000))
((p1,10000),(p1,10000))
((p2,1000),(p1,20000))
((p2,1000),(p2,50000))
((p2,1000),(p1,10000))
((p2,20000),(p1,20000))
((p2,20000),(p2,50000))
((p2,20000),(p1,10000))
((p2,50000),(p1,20000))
((p2,50000),(p2,50000))
((p2,50000),(p1,10000))
((p3,60000),(p1,20000))
((p3,60000),(p2,50000))
((p3,60000),(p1,10000))
cartesian against 2 lists
//count is : 4
scala> val rdd1 = sc.parallelize(List(10,20,30,40))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[42] at parallelize at <console>:27
//count is : 2
scala> val rdd2 = sc.parallelize(List(10,200))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[43] at parallelize at <console>:27
scala> val result = rdd1.cartesian(rdd2)
result: org.apache.spark.rdd.RDD[(Int, Int)] = CartesianRDD[44] at cartesian at <console>:31
//count : 4 x 2 = 8
scala> result.collect.foreach(println)
(10,10)
(10,200)
(20,10)
(20,200)
(30,10)
(30,200)
(40,10)
(40,200)
// Taking data from emp in hdfs
scala> val data = sc.textFile("/user/cloudera/Sparks/emp")
data: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/emp MapPartitionsRDD[46] at textFile at <console>:27
scala> val dpair = data.map { x =>
| val w = x.split(",")
| val dno = w(4)
| val sal = w(2).toInt
| (dno,sal)
| }
dpair: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[47] at map at <console>:31
scala> dpair.collect.foreach(println)
(12,70000)
(12,90000)
(11,10000)
(12,40000)
(13,70000)
(13,80000)
(14,90000)
(14,100000)
(11,30000)
(14,60000)
(15,900000)
(15,10000)
scala> val dres = dpair.reduceByKey(_+_)
dres: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[48] at reduceByKey at <console>:33
// grouped aggregations
scala> dres.collect.foreach(println)
(14,250000)
(15,910000)
(12,200000)
(13,150000)
(11,40000)
// making a copy of dres
scala> val dres2 = dres
dres2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[48] at reduceByKey at <console>:33
// performing cartesian join here
scala> val cr = dres.cartesian(dres2)
cr: org.apache.spark.rdd.RDD[((String, Int), (String, Int))] = CartesianRDD[49] at cartesian at <console>:37
//cross join
scala> val cr = dres.cartesian(dres2)
cr: org.apache.spark.rdd.RDD[((String, Int), (String, Int))] = CartesianRDD[49] at cartesian at <console>:37
scala>
scala> cr.collect.foreach(println)
((14,250000),(14,250000))
((14,250000),(15,910000))
((14,250000),(12,200000))
((14,250000),(13,150000))
((14,250000),(11,40000))
((15,910000),(14,250000))
((15,910000),(15,910000))
((15,910000),(12,200000))
((15,910000),(13,150000))
((15,910000),(11,40000))
((12,200000),(14,250000))
((12,200000),(15,910000))
((12,200000),(12,200000))
((12,200000),(13,150000))
((12,200000),(11,40000))
((13,150000),(14,250000))
((13,150000),(15,910000))
((13,150000),(12,200000))
((13,150000),(13,150000))
((13,150000),(11,40000))
((11,40000),(14,250000))
((11,40000),(15,910000))
((11,40000),(12,200000))
((11,40000),(13,150000))
((11,40000),(11,40000))
val cr2 = cr.map { x =>
val t1 = x._1
val t2 = x._2
val dno1 = t1._1
val tot1 = t1._2
val dno2 = t2._1
val tot2 = t2._2
(dno1,dno2,tot1,tot2)
}
scala> cr2.collect.foreach(println)
(14,14,250000,250000) // reject this
(14,15,250000,910000)
(14,12,250000,200000)
(14,13,250000,150000)
(14,11,250000,40000)
(15,14,910000,250000)
(15,15,910000,910000) // reject this
(15,12,910000,200000)
(15,13,910000,150000)
(15,11,910000,40000)
(12,14,200000,250000)
(12,15,200000,910000)
(12,12,200000,200000) // reject this
(12,13,200000,150000)
(12,11,200000,40000)
(13,14,150000,250000)
(13,15,150000,910000)
(13,12,150000,200000)
(13,13,150000,150000) // reject this
(13,11,150000,40000)
(11,14,40000,250000)
(11,15,40000,910000)
(11,12,40000,200000)
(11,13,40000,150000)
(11,11,40000,40000) // reject this
if dno1 == dno2 then reject that
want to eliminate same dept
// if dno1 != dno2 get the result
scala> val cr3 = cr2.filter( x => x._1 != x._2)
cr3: org.apache.spark.rdd.RDD[(String, String, Int, Int)] = MapPartitionsRDD[51] at filter at <console>:41
// which dept's salary is greater than to which other dept's salary
scala> cr3.collect.foreach(println)
(14,15,250000,910000)
(14,12,250000,200000)
(14,13,250000,150000)
(14,11,250000,40000)
(15,14,910000,250000)
(15,12,910000,200000)
(15,13,910000,150000)
(15,11,910000,40000)
(12,14,200000,250000)
(12,15,200000,910000)
(12,13,200000,150000)
(12,11,200000,40000)
(13,14,150000,250000)
(13,15,150000,910000)
(13,12,150000,200000)
(13,11,150000,40000)
(11,14,40000,250000)
(11,15,40000,910000)
(11,12,40000,200000)
(11,13,40000,150000)
// i want tot1 > tot2 results only
scala> val cr4 = cr3.filter(x => x._3 >= x._4)
cr4: org.apache.spark.rdd.RDD[(String, String, Int, Int)] = MapPartitionsRDD[52] at filter at <console>:43
scala> cr4.collect.foreach(println)
(14,12,250000,200000)
(14,13,250000,150000)
(14,11,250000,40000)
(15,14,910000,250000)
(15,12,910000,200000)
(15,13,910000,150000)
(15,11,910000,40000)
(12,13,200000,150000)
(12,11,200000,40000)
(13,11,150000,40000)
scala> val cr5 = cr4.map (x => (x._1,1))
cr5: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[53] at map at <console>:45
scala> cr5.collect.foreach(println)
(14,1)
(14,1)
(14,1)
(15,1)
(15,1)
(15,1)
(15,1)
(12,1)
(12,1)
(13,1)
scala> val finalres = cr5.reduceByKey(_+_)
finalres: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[54] at reduceByKey at <console>:47
scala> finalres.collect.foreach(println)
(14,3)
(15,4)
(12,2)
(13,1)
scala> data.collect.foreach(println)
101,aaaa,70000,m,12
102,bbbbb,90000,f,12
103,cc,10000,m,11
104,dd,40000,m,12
105,cccc,70000,f,13
106,de,80000,f,13
107,io,90000,m,14
108,yu,100000,f,14
109,poi,30000,m,11
110,aaa,60000,f,14
123,djdj,900000,m,15
122,asasd,10000,m,15
we are able to compare 1 key with all other keys
if same key we filtered out
qn : Each dept total salary is greater than to how many other dept salary
create a sales data file in local
[cloudera@quickstart ~]$ gedit sales
[cloudera@quickstart ~]$ cat sales
01/01/2016,30000
01/05/2016,80000
01/30/2016,90000
02/01/2016,20000
02/25/2016,48000
03/01/2016,22000
03/05/2016,89000
03/30/2016,91000
04/01/2016,100000
04/25/2016,71000
05/01/2016,31500
06/05/2016,86600
07/30/2016,92000
08/01/2016,32000
09/25/2016,43000
09/01/2016,32300
10/05/2016,85000
10/30/2016,80000
11/01/2016,70300
11/25/2016,50000
12/01/2016,30000
12/05/2016,20200
//copy the file into hdfs
[cloudera@quickstart ~]$ hdfs dfs -copyFromLocal sales Sparks
check the data
[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/sales
01/01/2016,30000
01/05/2016,80000
01/30/2016,90000
02/01/2016,20000
02/25/2016,48000
03/01/2016,22000
03/05/2016,89000
03/30/2016,91000
04/01/2016,100000
04/25/2016,71000
05/01/2016,31500
06/05/2016,86600
07/30/2016,92000
08/01/2016,32000
09/25/2016,43000
09/01/2016,32300
10/05/2016,85000
10/30/2016,80000
11/01/2016,70300
11/25/2016,50000
12/01/2016,30000
12/05/2016,20200
//make RDD - read file and put the data in RDD
scala> val sales = sc.textFile("/user/cloudera/Sparks/sales")
sales: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/sales MapPartitionsRDD[56] at textFile at <console>:27
scala> sales.collect.foreach(println)
01/01/2016,30000
01/05/2016,80000
01/30/2016,90000
02/01/2016,20000
02/25/2016,48000
03/01/2016,22000
03/05/2016,89000
03/30/2016,91000
04/01/2016,100000
04/25/2016,71000
05/01/2016,31500
06/05/2016,86600
07/30/2016,92000
08/01/2016,32000
09/25/2016,43000
09/01/2016,32300
10/05/2016,85000
10/30/2016,80000
11/01/2016,70300
11/25/2016,50000
12/01/2016,30000
12/05/2016,20200
val pair = sales.map { x =>
val w = x.split(",")
val dt = w(0)
val pr = w(1).toInt
val m = dt.slice(0,2).toInt
(m,pr)
}
scala> pair.collect
res6: Array[(Int, Int)] = Array((1,30000), (1,80000), (1,90000), (2,20000), (2,48000), (3,22000), (3,89000), (3,91000), (4,10000), (5,31500), (6,86600), (7,92000), (8,32000), (9,43000), (9,32300), (10,85000), (10,80000), (11,70300), (11,50000), (12,30000), (12,20200))
scala> pair.collect.foreach(println)
(1,30000)
(1,80000)
(1,90000)
(2,20000)
(2,48000)
(3,22000)
(3,89000)
(3,91000)
(4,10000)
(5,31500)
(6,86600)
(7,92000)
(8,32000)
(9,43000)
(9,32300)
(10,85000)
(10,80000)
(11,70300)
(11,50000)
(12,30000)
(12,20200)
scala> val rep = pair.reduceByKey(_+_)
rep: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[12] at reduceByKey at <console>:31
scala> rep.collect.foreach(println)
(4,10000)
(11,120300)
(1,200000)
(6,86600)
(3,202000)
(7,92000)
(9,75300)
(8,32000)
(12,50200)
(10,165000)
(5,31500)
(2,68000)
// make ascending order
scala> val res = rep.sortByKey()
res: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[14] at sortByKey at <console>:33
scala> res.collect.foreach(println)
(1,200000)
(2,68000)
(3,202000)
(4,10000)
(5,31500)
(6,86600)
(7,92000)
(8,32000)
(9,75300)
(10,165000)
(11,120300)
(12,50200)
When compared to current and previous month's sales sales comparison
increased / decreased howmuch percentage increased / decreased
every month sales has to be compared with it's previous sales
cartesian join needed
scala> val res2 = res
res2: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[14] at sortByKey at <console>:33
scala> val cr = res.cartesian(res2)
cr: org.apache.spark.rdd.RDD[((Int, Int), (Int, Int))] = CartesianRDD[15] at cartesian at <console>:37
scala> cr.take(5).foreach(println)
((1,200000),(1,200000))
((1,200000),(2,68000))
((1,200000),(3,202000))
((1,200000),(4,10000))
((1,200000),(5,31500))
scala> val cr2 = cr.map{ x =>
| val t1 = x._1
| val t2 = x._2
| val m1 = t1._1
| val tot1 = t1._2
| val m2 = t2._1
| val tot2 = t2._2
| (m1,m2,tot1,tot2)
| }
cr2: org.apache.spark.rdd.RDD[(Int, Int, Int, Int)] = MapPartitionsRDD[16] at map at <console>:39
scala> cr2.collect.foreach(println)
(1,1,200000,200000)
(1,2,200000,68000)
(1,3,200000,202000)
(1,4,200000,10000)
(1,5,200000,31500)
(1,6,200000,86600)
(1,7,200000,92000)
(1,8,200000,32000)
(1,9,200000,75300)
(1,10,200000,165000)
(1,11,200000,120300)
(1,12,200000,50200)
(2,1,68000,200000)
(2,2,68000,68000)
(2,3,68000,202000)
(2,4,68000,10000)
(2,5,68000,31500)
(2,6,68000,86600)
(2,7,68000,92000)
(2,8,68000,32000)
(2,9,68000,75300)
(2,10,68000,165000)
(2,11,68000,120300)
(2,12,68000,50200)
(3,1,202000,200000)
(3,2,202000,68000)
(3,3,202000,202000)
(3,4,202000,10000)
(3,5,202000,31500)
(3,6,202000,86600)
(3,7,202000,92000)
(3,8,202000,32000)
(3,9,202000,75300)
(3,10,202000,165000)
(3,11,202000,120300)
(3,12,202000,50200)
(4,1,10000,200000)
(4,2,10000,68000)
(4,3,10000,202000)
(4,4,10000,10000)
(4,5,10000,31500)
(4,6,10000,86600)
(4,7,10000,92000)
(4,8,10000,32000)
(4,9,10000,75300)
(4,10,10000,165000)
(4,11,10000,120300)
(4,12,10000,50200)
(5,1,31500,200000)
(5,2,31500,68000)
(5,3,31500,202000)
(5,4,31500,10000)
(5,5,31500,31500)
(5,6,31500,86600)
(5,7,31500,92000)
(5,8,31500,32000)
(5,9,31500,75300)
(5,10,31500,165000)
(5,11,31500,120300)
(5,12,31500,50200)
(6,1,86600,200000)
(6,2,86600,68000)
(6,3,86600,202000)
(6,4,86600,10000)
(6,5,86600,31500)
(6,6,86600,86600)
(6,7,86600,92000)
(6,8,86600,32000)
(6,9,86600,75300)
(6,10,86600,165000)
(6,11,86600,120300)
(6,12,86600,50200)
(7,1,92000,200000)
(7,2,92000,68000)
(7,3,92000,202000)
(7,4,92000,10000)
(7,5,92000,31500)
(7,6,92000,86600)
(7,7,92000,92000)
(7,8,92000,32000)
(7,9,92000,75300)
(7,10,92000,165000)
(7,11,92000,120300)
(7,12,92000,50200)
(8,1,32000,200000)
(8,2,32000,68000)
(8,3,32000,202000)
(8,4,32000,10000)
(8,5,32000,31500)
(8,6,32000,86600)
(8,7,32000,92000)
(8,8,32000,32000)
(8,9,32000,75300)
(8,10,32000,165000)
(8,11,32000,120300)
(8,12,32000,50200)
(9,1,75300,200000)
(9,2,75300,68000)
(9,3,75300,202000)
(9,4,75300,10000)
(9,5,75300,31500)
(9,6,75300,86600)
(9,7,75300,92000)
(9,8,75300,32000)
(9,9,75300,75300)
(9,10,75300,165000)
(9,11,75300,120300)
(9,12,75300,50200)
(10,1,165000,200000)
(10,2,165000,68000)
(10,3,165000,202000)
(10,4,165000,10000)
(10,5,165000,31500)
(10,6,165000,86600)
(10,7,165000,92000)
(10,8,165000,32000)
(10,9,165000,75300)
(10,10,165000,165000)
(10,11,165000,120300)
(10,12,165000,50200)
(11,1,120300,200000)
(11,2,120300,68000)
(11,3,120300,202000)
(11,4,120300,10000)
(11,5,120300,31500)
(11,6,120300,86600)
(11,7,120300,92000)
(11,8,120300,32000)
(11,9,120300,75300)
(11,10,120300,165000)
(11,11,120300,120300)
(11,12,120300,50200)
(12,1,50200,200000)
(12,2,50200,68000)
(12,3,50200,202000)
(12,4,50200,10000)
(12,5,50200,31500)
(12,6,50200,86600)
(12,7,50200,92000)
(12,8,50200,32000)
(12,9,50200,75300)
(12,10,50200,165000)
(12,11,50200,120300)
(12,12,50200,50200)
Here cartesian joins jan with all other 11 months
feb with all other 11 months
dec with all other 11 months
but our moto is comparing just current month with its previous mont
Need to compare current month (Oct) with its previous (Aug) only
so filter condition should be
currentMonth - previousMonth = 1
it will filter all others exect (Oct,Aug) ... (Dec,Nov)
scala> val cr3 = cr2.filter (x => x._1 - x._2 == 1)
cr3: org.apache.spark.rdd.RDD[(Int, Int, Int, Int)] = MapPartitionsRDD[18] at filter at <console>:41
scala> cr3.count
res16: Long = 11
scala> cr3.collect.foreach(println)
(2,1,68000,200000)
(3,2,202000,68000)
(4,3,10000,202000)
(5,4,31500,10000)
(6,5,86600,31500)
(7,6,92000,86600)
(8,7,32000,92000)
(9,8,75300,32000)
(10,9,165000,75300)
(11,10,120300,165000)
(12,11,50200,120300)
howmuch percentage growth for each month
scala> val finalres = cr3.map { x =>
| val m1 = x._1
| val m2 = x._2
| val tot1 = x._3
| val tot2 = x._4
| val pgrowth = ( (tot1 - tot2) * 100) / tot2
| (m1,m2,tot1,tot2,pgrowth)
| }
finalres: org.apache.spark.rdd.RDD[(Int, Int, Int, Int, Int)] = MapPartitionsRDD[19] at map at <console>:43
// howmuch percentage of sales increased when comparing with previous month's sales
scala> finalres.collect.foreach(println)
(2,1,68000,200000,-66)
(3,2,202000,68000,197)
(4,3,10000,202000,-95)
(5,4,31500,10000,215)
(6,5,86600,31500,174)
(7,6,92000,86600,6)
(8,7,32000,92000,-65)
(9,8,75300,32000,135)
(10,9,165000,75300,119)
(11,10,120300,165000,-27)
(12,11,50200,120300,-58)
Dec,Nov,DecSales,NovSales,SalesGrowth (+/-)
Quarterly sales report comparision instead of monthly
Q1,Q2,Q3,Q4
Spark SQL:
It is a library to process spark data objects using sql statements (mysql select)
Spark SQL follows MySQL based SQL syntaxes
Spark Core provides
SQLContext, HiveContext (DataWarehouse)
DataFrame,DataSets,Temp Table
If data is in Hive Table, we have to use HiveContext
SparkContext (sc)
SparkStreamingContext
SQLContext
HiveContext
import org.apache.spark.sql.SqlContext
val sqlCon = new SqlContext(sc)
From Spark 1.6 onwards, SqlContext is by default available in Spark Shell
while doing Spark Programming in IDE we need to create instance of SqlContext
Using SQLContext,
we can process Spark objects using select statements
Using HiveContext,
we can integrate Hive with Spark.
Hive is a datawarehouse environment in hadoop framework
Data is stored and managed at Hive but processed in Spark
All valid Hive Queries are available in HiveContext
Using HiveContext we can access entire Hive environment (hive tables) from Spark
HQL statement vs Hive
---------------------
If HQL is executed within Hive Environment, the statements will be converted into MapReduce Job then
MapReduce (converted from HQL) will be executed (performance issue, disk i/o, java issues, missing inmemory computing)
If same Hive is integrated within Spark an HQL is submitted from Spark
and HQL is submitted from Spark, it uses DAG and inmemory computing models with persits.
persistanc feature, inmemory computing, customized parallel processing
Spark SQL limitations:
is applicable only for structured data
If data is unstructured,
need to process, with Spark Core's RDD API and Spark MLLib, NLP algorithms, nltk is best compatible with Spark MLLib
import org.apache.spark.sql.hive.HiveContext
val hc = new HiveContext(sc)
val sqc = new SqlContext(sc)
Steps to work with SQLContext:
#1: val data = sc.textFile("/user/cloudera/Sparks/file1")
Load data into RDD
sample - file1
100,200,300
300,400,400
...
#2: Provide schema to the RDD (Create case class)
case class Rec(a:Int, b:Int, c:Int)
#3: Create a function to convert raw linto case object
(Function to provide schema)
def makeRec(line:String) = {
val w = line.split(",")
val a = w(0).toInt
val b = w(1),toInt
val c = w(2).toInt
val r = Rec(a,b,c)
r
}
(In order to work with SQL, we definitely need schema)
#4 : Transform each record into case object
val recs = data.map (x => makeRec(x))
#5: convert RDD into Data Frame
val df = recs.toDF (To Data Frame)
#6: Create table instance for the data frame
df.registerTempTable("samp")
before Spark 1.3 no Data Frame
Spark 1.5 onwards DataSet too
RDD --> DataFrame --> DataSets
#7. Play SQL statements
run select statements against 'samp' (temp table)
#8. Apply Select Statement of SQL on temp table
val r1 = sqc.sql("select a+b+c as tot from samp")
(returned object is not a temp table, returned object is data frame)
(r1 is dataframe)
r1
----
tot
----
600
900
when sql statement applied on temp table, returned object will be dataframe
To apply SQL statement on result set again we need to register as temp table
r1.registerTempTable("samp1")
emp:
id,name,sal,sex,dno
101,aaa,40000,m,1
......
import org.apachce.spark.sql.SqlContext
val sqc = new SqlContext(sc)
val data = sc.textFile("/user/cloudera/Sparks/emp")
case class Emp(id:Int, name:String, sal:Int, sex:String, dno:Int)
def toEmp(x:String) = {
val w = x.trim().split(",")
val id = w(0).toInt
val name = w(1)
val sal = w(2).toInt
val sex = w(3)
val dno = w(4).toInt
val e = Emp(id,name,sal,sex,dno)
e
}
val emps = data.map ( x => toEmp(x));
val df = emps.toDF
df.registerTempTable("emp")
val r1 = sqc.sql("select sex,sum(sal) as tot from emp group by sex");
val res2 = sqc.sql("select dno,sex, sum(sal) as tot, avg(sal) as avg,
max(sal) as max, min(sal) as min, count(*) as cnt from emp group by dno,sex");
dept:
------
11,marketing,hyd
...
emp (file #1), dept (file #2)
val data2 = sc.textFile("user/cloudera/Sparks/dept")
case class Dept(dno:Int, dname:String, city:String)
val dept = data2.map { x =>
val w = x.split(",")
val id = w(0).toInt
val name = w(1)
val city = w(2)
Dept(id,name,city)
}
val df2 = dept.toDF
df2.registerTempTable("departs")
val res = sqc.sql("select city,sum(sal) as tot from emp l join departs r on l.dno = r.dno group by city');
(object type of res is dataframe) res.persist to keep it memory
Table available in Hive. we are going to access and run queries against Hive tables within Spark enviroment
------------------------------------------------------------------------------------------------------------
One time investment:
copy hive-site.xml into /usr/spark/conf folder
if this file is not copied, spark cannot understand hive metastore location
import org.apache.spark.sql.hive.HiveContext
val hc = new HiveContext(sc)
hc.sql("create database mydb")
hc.sql("use mydb")
hc.sql("create table result1 (dno int, tot int)")
hc.sql("insert into table result1 select dno,sum(sal) from default.emp group by dno")
select * from abc:
hive will contact metasore is in RDBMS (default Derby - light weight)
reconfigure it into Oracle or wherever
hql along with sql
------------------
val r1 = sc.sql("...")
val r2 = hc.sql("...")
r1.registerTempTable("res1")
r2.registerTempTable("res2")
one data is in file (using sqlContext we handle it)
one data is in database (using HiveContext we handle it)
finally we do union, joins etc against both of them
Working with json using sqlContext:
-------------------------------------
json Serde (Serialization, Deserialization) needed in Hive
get_, json_ tuple
working with json using sqlContext
json1
------
{"name":"Ravi,"age":20,"sex":"M"}
{"name":"Vani","city":"hyd","sex":"F"}
The response of webservices will be in json
even log files too as json
sc.textFile("....txt") /// for regular normal file
going to use sqlContext
val jdf = sqc.read.json("/user/cloudera/Sparks/json1.json")
jdf --> df automatically converted
name age city sex
---------------------------
Ravi 20 null M
Vani null Hyd F
jdf - data frame
How to work with XML?
----------------------
i) using 3rd party library (ex : databrics)
ii) Integrated Spark with HIve using HiveContext and apply xml parsers such as
xpath(),xpath_string(),xpath_int()... etc
xml1
-------
<rec><name>Ravi</name><age>20</age></rec>
<rec><name>Rani</name><sex>f</sex></rec>
hc.sql("use mydb")
hc.sql("create table raw (line string)")
hc.sql("load data local inpath 'xml1' into table raw")
hc.sql("create table info (name string, age int, sex string)
row format delimited fields terminated by ','")
hc.sql("insert overwrite table info
select
xpath_string(line,'rec/name')
xpath_int(line,'rec/age')
xpath_string(line,'rec/sex' from raw")
Spark SQL:
----------
#1 import org.apache.spark.sql.SqlContext
-- val sqlContext = new SQLContext(sc)
//to convert RDDs into DFs implicitly
import sqlContext.implicitis._
#2 load data from file
#3 create schema (case class)
#4 transform each element into case class
#5 convert into DF
#6 register as temp table
#7 Play with SQL
sqlContext.sql("SELECT ...") // only select statements are allowed
scala> import sqlContext.implicits._
import sqlContext.implicits._
scala> case class Samp(a:Int, b:Int, c:Int)
defined class Samp
scala> val s1 = Samp(10,20,30)
s1: Samp = Samp(10,20,30)
scala> val s2 = Samp(1,2,3)
s2: Samp = Samp(1,2,3)
scala> val s3 = Samp(100,200,300)
s3: Samp = Samp(100,200,300)
scala> val s4 = Samp(1000,2000,3000)
s4: Samp = Samp(1000,2000,3000)
scala> val data = List(s1,s2,s3,s4)
data: List[Samp] = List(Samp(10,20,30), Samp(1,2,3), Samp(100,200,300), Samp(1000,2000,3000))
scala> val data = sc.parallelize(List(s1,s2,s3,s4))
data: org.apache.spark.rdd.RDD[Samp] = ParallelCollectionRDD[20] at parallelize at <console>:40
scala> data.collect.foreach(println)
Samp(10,20,30)
Samp(1,2,3)
Samp(100,200,300)
Samp(1000,2000,3000)
scala> val x = data.map (v => v.a)
x: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[21] at map at <console>:42
scala> x.collect.foreach(println)
10
1
100
1000
scala> val x = data.map(v => v.a + v.b + v.c)
x: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[22] at map at <console>:42
scala> x.collect.foreach(println)
60
6
600
6000
//once your RDD is having Schema you can convert it into Data Frame
RDD with Schema can be converted into DataFrame
catalyst optimizer
dataframe specialized APIs
optimized
can turn data frame into temp table and play with sql statements
scala> val df = data.toDF
df: org.apache.spark.sql.DataFrame = [a: int, b: int, c: int]
scala> df.collect.foreach(println)
[10,20,30]
[1,2,3]
[100,200,300]
[1000,2000,3000]
scala> val df = data.toDF
df: org.apache.spark.sql.DataFrame = [a: int, b: int, c: int]
scala> df.collect.foreach(println)
[10,20,30]
[1,2,3]
[100,200,300]
[1000,2000,3000]
scala> df.printSchema
root
|-- a: integer (nullable = false)
|-- b: integer (nullable = false)
|-- c: integer (nullable = false)
scala>
scala> df.show()
+----+----+----+
| a| b| c|
+----+----+----+
| 10| 20| 30|
| 1| 2| 3|
| 100| 200| 300|
|1000|2000|3000|
+----+----+----+
scala> df.take(10)
res25: Array[org.apache.spark.sql.Row] = Array([10,20,30], [1,2,3], [100,200,300], [1000,2000,3000])
scala> df.show(3)
+---+---+---+
| a| b| c|
+---+---+---+
| 10| 20| 30|
| 1| 2| 3|
|100|200|300|
+---+---+---+
only showing top 3 rows
//register table to play SQL
scala> df.registerTempTable("df")
scala> sqlContext.sql("select * from df")
res31: org.apache.spark.sql.DataFrame = [a: int, b: int, c: int]
scala> val df2 = sqlContext.sql("select * from df")
df2: org.apache.spark.sql.DataFrame = [a: int, b: int, c: int]
scala> df2.show()
+----+----+----+
| a| b| c|
+----+----+----+
| 10| 20| 30|
| 1| 2| 3|
| 100| 200| 300|
|1000|2000|3000|
+----+----+----+
scala> val df2 = sqlContext.sql("select a,b from df")
df2: org.apache.spark.sql.DataFrame = [a: int, b: int]
scala> df2.show()
+----+----+
| a| b|
+----+----+
| 10| 20|
| 1| 2|
| 100| 200|
|1000|2000|
+----+----+
scala> val df3 = sqlContext.sql("select a,b,c,a+b+c as tot from df")
df3: org.apache.spark.sql.DataFrame = [a: int, b: int, c: int, tot: int]
scala> df3.show()
+----+----+----+----+
| a| b| c| tot|
+----+----+----+----+
| 10| 20| 30| 60|
| 1| 2| 3| 6|
| 100| 200| 300| 600|
|1000|2000|3000|6000|
+----+----+----+----+
scala> df3.printSchema
root
|-- a: integer (nullable = false)
|-- b: integer (nullable = false)
|-- c: integer (nullable = false)
|-- tot: integer (nullable = false)
Transformation is very easy in SparkSQL
SQL needs data must have proper schema
If it is enterprise data (structured)
RDD APIs with functional programming simplifying a lot
but still people are thinking thats difficult
so SparkSQL came into picture
load your file into RDD
create schema
transform each element into that schema
convert into df
register df in temp table
custom functionalities still needs RDD APIs
Create emp file in local then copy it into hdfs:
[cloudera@quickstart ~]$ cat > emp
101,aaa,40000,m,11
102,bbbb,50000,f,12
103,ccc,90000,m,12
104,ddddd,100000,f,13
105,eee,20000,m,11
106,iiii,30000,f,12
107,jjjj,60000,m,13
108,kkkkk,90000,f,14
[cloudera@quickstart ~]$ hdfs dfs -copyFromLocal emp Sparks/emp
scala>
scala> val raw = sc.textFile("/user/cloudera/Sparks/emp")
raw: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/emp MapPartitionsRDD[38] at textFile at <console>:30
scala> raw.count
res36: Long = 8
scala> raw.collect.foreach(println)
101,aaa,40000,m,11
102,bbbb,50000,f,12
103,ccc,90000,m,12
104,ddddd,100000,f,13
105,eee,20000,m,11
106,iiii,30000,f,12
107,jjjj,60000,m,13
108,kkkkk,90000,f,14
scala> raw.take(1) // without schema
res38: Array[String] = Array(101,aaa,40000,m,11)
//we create a case class to apply schema to existing RDD
scala> case class Info(id:Int, name:String,sal:Int,sex:String,dno:Int)
// create a function to apply Info case class for each element
def toInfo (x:String) = {
val w = x.split(",")
val id = w(0).toInt
val name = w(1)
val sal = w(2).toInt
val sex = w(3)
val dno = w(4).toInt
val info = Info(id,name,sal,sex,dno)
info
}
scala> val rec = "401,Amar,7000,m,12"
rec: String = 401,Amar,7000,m,12
scala> val re = toInfo(rec)
re: Info = Info(401,Amar,7000,m,12)
scala> re
res39: Info = Info(401,Amar,7000,m,12)
scala> re.name
res41: String = Amar
scala> re.sex
res42: String = m
scala> re.sal
res43: Int = 7000
scala> re.dno
res44: Int = 12
scala> re.id
res45: Int = 401
scala> val infos = raw.map(x => toInfo(x))
infos: org.apache.spark.rdd.RDD[Info] = MapPartitionsRDD[40] at map at <console>:50
scala> infos.collect.foreach(println)
Info(101,aaa,40000,m,11)
Info(102,bbbb,50000,f,12)
Info(103,ccc,90000,m,12)
Info(104,ddddd,100000,f,13)
Info(105,eee,20000,m,11)
Info(106,iiii,30000,f,12)
Info(107,jjjj,60000,m,13)
Info(108,kkkkk,90000,f,14)
scala> infos.map( x => x.sal).sum
res49: Double = 480000.0
//now infos has Schema so its eligible to convert into Data Frame
scala> val dfinfo = infos.toDF
dfinfo: org.apache.spark.sql.DataFrame = [id: int, name: string, sal: int, sex: string, dno: int]
scala> dfinfo.show()
+---+-----+------+---+---+
| id| name| sal|sex|dno|
+---+-----+------+---+---+
|101| aaa| 40000| m| 11|
|102| bbbb| 50000| f| 12|
|103| ccc| 90000| m| 12|
|104|ddddd|100000| f| 13|
|105| eee| 20000| m| 11|
|106| iiii| 30000| f| 12|
|107| jjjj| 60000| m| 13|
|108|kkkkk| 90000| f| 14|
+---+-----+------+---+---+
scala> dfinfo.printSchema
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- sal: integer (nullable = false)
|-- sex: string (nullable = true)
|-- dno: integer (nullable = false)
scala> sqlContext.sql("select * from dfinfo where sex='m'").show()
+---+----+-----+---+---+
| id|name| sal|sex|dno|
+---+----+-----+---+---+
|101| aaa|40000| m| 11|
|103| ccc|90000| m| 12|
|105| eee|20000| m| 11|
|107|jjjj|60000| m| 13|
+---+----+-----+---+---+
scala> sqlContext.sql("select * from dfinfo where sex='f'").show()
+---+-----+------+---+---+
| id| name| sal|sex|dno|
+---+-----+------+---+---+
|102| bbbb| 50000| f| 12|
|104|ddddd|100000| f| 13|
|106| iiii| 30000| f| 12|
|108|kkkkk| 90000| f| 14|
+---+-----+------+---+---+
RDD way to find sum of sal for male and female employees
scala> infos
res60: org.apache.spark.rdd.RDD[Info] = MapPartitionsRDD[40] at map at <console>:50
scala> val pair = infos.map ( x => (x.sex,x.sal))
pair: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[61] at map at <console>:52
scala> val res = pair.reduceByKey(_+_)
res: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[62] at reduceByKey at <console>:54
scala> res.collect.foreach(println)
(f,270000)
(m,210000)
scala> var r = sqlContext.sql("select sex,sum(sal) as tot from dfinfo group by sex")
r: org.apache.spark.sql.DataFrame = [sex: string, tot: bigint]
scala> r.show()
+---+------+
|sex| tot|
+---+------+
| f|270000|
| m|210000|
+---+------+
// if it is RDD -> saveAsTextFile
// if it is DataFrame -> avro,orc,parquet,json etc
code is vastly reduced
optimized memory management
very faster execution
// we need to register this as temp table then only it will allow us to do sql queries against it
RDD way to filter records:
-------------------------
scala> infos.filter( x => x.sex.toLowerCase == "m").collect.foreach(println)
Info(101,aaa,40000,m,11)
Info(103,ccc,90000,m,12)
Info(105,eee,20000,m,11)
Info(107,jjjj,60000,m,13)
scala> infos.filter( x => x.sex.toLowerCase == "f").collect.foreach(println)
Info(102,bbbb,50000,f,12)
Info(104,ddddd,100000,f,13)
Info(106,iiii,30000,f,12)
Info(108,kkkkk,90000,f,14)
scala> dfinfo.registerTempTable("dfinfo")
code vastly reduced
optimized way of data transformation and memory management
Play with multiple databases
RDD style of multiple aggregations
In RDD style, for each sex group I want all 5 aggregations
later we will convert the same thing in SQL
scala> infos.collect.foreach(println)
Info(101,aaa,40000,m,11)
Info(102,bbbb,50000,f,12)
Info(103,ccc,90000,m,12)
Info(104,ddddd,100000,f,13)
Info(105,eee,20000,m,11)
Info(106,iiii,30000,f,12)
Info(107,jjjj,60000,m,13)
Info(108,kkkkk,90000,f,14)
scala> pair.collect.foreach(println)
(m,40000)
(f,50000)
(m,90000)
(f,100000)
(m,20000)
(f,30000)
(m,60000)
(f,90000)
scala> val pair = infos.map (x => (x.sex,x.sal))
pair: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[70] at map at <console>:52
scala> pair.collect.foreach(println)
(m,40000)
(f,50000)
(m,90000)
(f,100000)
(m,20000)
(f,30000)
(m,60000)
(f,90000)
scala> val grp = pair.groupByKey()
grp: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[71] at groupByKey at <console>:54
scala> grp.collect.foreach(println)
(f,CompactBuffer(50000, 100000, 30000, 90000))
(m,CompactBuffer(40000, 90000, 20000, 60000))
scala> val res = grp.map { x =>
| val sex = x._1
| val cb = x._2
| val tot = cb.sum
| val cnt = cb.size
| val avg = tot / cnt
| val max = cb.max
| val min = cb.min
| (sex,tot,cnt,avg,max,min)
| }
res: org.apache.spark.rdd.RDD[(String, Int, Int, Int, Int, Int)] = MapPartitionsRDD[72] at map at <console>:56
scala> res.collect.foreach(println)
(f,270000,4,67500,100000,30000)
(m,210000,4,52500,90000,20000)
select sex,sum(sal) as tot, count(*) as cnt, avg(sal) as avg,
max(sal) as max, min(sal) as min
from dfinfo group by sex
scala> sqlContext.sql("select sex,sum(sal) as tot, count(*) as cnt, avg(sal) as avg, max(sal) as max, min(sal) as min from dfinfo group by sex").show()
+---+------+---+-------+------+-----+
|sex| tot|cnt| avg| max| min|
+---+------+---+-------+------+-----+
| f|270000| 4|67500.0|100000|30000|
| m|210000| 4|52500.0| 90000|20000|
+---+------+---+-------+------+-----+
select sex,sum(sal) as tot, count(*) as cnt, avg(sal) as avg,
max(sal) as max, min(sal) as min
from dfinfo group by sex
scala> sqlContext.sql("select dno,sex,sum(sal) as tot, count(*) as cnt, avg(sal) as avg, max(sal) as max, min(sal) as min from dfinfo group by dno,sex").show()
+---+---+------+---+--------+------+------+
|dno|sex| tot|cnt| avg| max| min|
+---+---+------+---+--------+------+------+
| 11| m| 60000| 2| 30000.0| 40000| 20000|
| 12| f| 80000| 2| 40000.0| 50000| 30000|
| 12| m| 90000| 1| 90000.0| 90000| 90000|
| 13| f|100000| 1|100000.0|100000|100000|
| 13| m| 60000| 1| 60000.0| 60000| 60000|
| 14| f| 90000| 1| 90000.0| 90000| 90000|
+---+---+------+---+--------+------+------+
// multi grouping, multiple aggregations done
[cloudera@quickstart ~]$ cat > emp
101,aaa,40000,m,11
102,bbbb,50000,f,12
103,ccc,90000,m,12
104,ddddd,100000,f,13
105,eee,20000,m,11
106,iiii,30000,f,12
107,jjjj,60000,m,13
108,kkkk,90000,f,14
[cloudera@quickstart ~]$ cat > emp2
201,kiran,14,m,90000
202,mani,12,f,10000
203,giri,12,m,20000
204,girija,11,f,40000
[cloudera@quickstart ~]$ hdfs dfs -copyFromLocal emp Sparks/emp
[cloudera@quickstart ~]$ hdfs dfs -copyFromLocal emp2 Sparks/emp2
[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/emp
101,aaa,40000,m,11
102,bbbb,50000,f,12
103,ccc,90000,m,12
104,ddddd,100000,f,13
105,eee,20000,m,11
106,iiii,30000,f,12
107,jjjj,60000,m,13
108,kkkk,90000,f,14
[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/emp2
201,kiran,14,m,90000
202,mani,12,f,10000
203,giri,12,m,20000
204,girija,11,f,40000
we have 2 different files emp and emp2 both schema different between them
scala> raw
res71: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/emp MapPartitionsRDD[38] at textFile at <console>:30
scala> infos
res72: org.apache.spark.rdd.RDD[Info] = MapPartitionsRDD[40] at map at <console>:50
(2 table joining using Spark SQL)
scala> val raw2 = sc.textFile("/user/cloudera/Sparks/emp2")
raw2: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/emp2 MapPartitionsRDD[94] at textFile at <console>:30
scala> raw2.collect.foreach(println)
201,kiran,14,m,90000
202,mani,12,f,10000
203,giri,12,m,20000
204,girija,11,f,40000
scala> val infos2 = raw2.map { x =>
| val w = x.split(",")
| val id = w(0).toInt
| val name = w(1)
| val dno = w(2).toInt
| val sex = w(3)
| val sal = w(4).toInt
| Info(id,name,sal,sex,dno)
| }
infos2: org.apache.spark.rdd.RDD[Info] = MapPartitionsRDD[95] at map at <console>:48
scala> infos2.collect.foreach(println)
Info(201,kiran,90000,m,14)
Info(202,mani,10000,f,12)
Info(203,giri,20000,m,12)
Info(204,girija,40000,f,11)
scala> infos.collect.foreach(println)
Info(101,aaa,40000,m,11)
Info(102,bbbb,50000,f,12)
Info(103,ccc,90000,m,12)
Info(104,ddddd,100000,f,13)
Info(105,eee,20000,m,11)
Info(106,iiii,30000,f,12)
Info(107,jjjj,60000,m,13)
Info(108,kkkk,90000,f,14)
scala> dfinfo.show(2)
+---+----+-----+---+---+
| id|name| sal|sex|dno|
+---+----+-----+---+---+
|101| aaa|40000| m| 11|
|102|bbbb|50000| f| 12|
+---+----+-----+---+---+
only showing top 2 rows
scala> val dfinfo2 = infos2.toDF
dfinfo2: org.apache.spark.sql.DataFrame = [id: int, name: string, sal: int, sex: string, dno: int]
scala> dfinfo2.show(2)
+---+-----+-----+---+---+
| id| name| sal|sex|dno|
+---+-----+-----+---+---+
|201|kiran|90000| m| 14|
|202| mani|10000| f| 12|
+---+-----+-----+---+---+
only showing top 2 rows
scala> val df = sqlContext.sql("select * from dfinfo union all select * from dfinfo2").show()
+---+------+------+---+---+
| id| name| sal|sex|dno|
+---+------+------+---+---+
|101| aaa| 40000| m| 11|
|102| bbbb| 50000| f| 12|
|103| ccc| 90000| m| 12|
|104| ddddd|100000| f| 13|
|105| eee| 20000| m| 11|
|106| iiii| 30000| f| 12|
|107| jjjj| 60000| m| 13|
|108| kkkk| 90000| f| 14|
|201| kiran| 90000| m| 14|
|202| mani| 10000| f| 12|
|203| giri| 20000| m| 12|
|204|girija| 40000| f| 11|
+---+------+------+---+---+
scala> df.registerTempTable("df")
// combined aggregation of both tables (emp,emp2)
scala> sqlContext.sql("select sex,sum(sal) as tot from df group by sex").show()
+---+------+
|sex| tot|
+---+------+
| f|320000|
| m|320000|
+---+------+
Multi table joining (schema is different for left and right tables)
[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/dept
11,marketing,hyd
12,hr,del
13,finance,hyd
14,admin,del
15,accounts,hyd
[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/emp
101,aaa,40000,m,11
102,bbbb,50000,f,12
103,ccc,90000,m,12
104,ddddd,100000,f,13
105,eee,20000,m,11
106,iiii,30000,f,12
107,jjjj,60000,m,13
108,kkkk,90000,f,14
val raw3 = sc.textFile("/user/cloudera/Sparks/dept")
raw3: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/dept MapPartitionsRDD[114] at textFile at <console>:30
scala> raw3.collect.foreach(println)
11,marketing,hyd
12,hr,del
13,finance,hyd
14,admin,del
15,accounts,hyd
scala> case class Dept(dno:Int, dname:String,loc:String)
defined class Dept
scala> val dept = raw3.map { x =>
| val w = x.split(",")
| val dno = w(0).toInt
| val dname = w(1)
| val loc = w(2)
| Dept(dno,dname,loc)
| }
dept: org.apache.spark.rdd.RDD[Dept] = MapPartitionsRDD[115] at map at <console>:48
what is the salary budget of each city?
scala> dept.collect.foreach(println)
Dept(11,marketing,hyd)
Dept(12,hr,del)
Dept(13,finance,hyd)
Dept(14,admin,del)
Dept(15,accounts,hyd)
scala> infos.collect.foreach(println)
Info(101,aaa,40000,m,11)
Info(102,bbbb,50000,f,12)
Info(103,ccc,90000,m,12)
Info(104,ddddd,100000,f,13)
Info(105,eee,20000,m,11)
Info(106,iiii,30000,f,12)
Info(107,jjjj,60000,m,13)
Info(108,kkkk,90000,f,14)
scala> val deptdf = dept.toDF
deptdf: org.apache.spark.sql.DataFrame = [dno: int, dname: string, loc: string]
scala> deptdf.registerTempTable("dept")
Accessing Hive Tables using Spark:
----------------------------------
sqlContext
hiveContext
search for 'hive-site.xml' file in linux
su
Password: cloudera
[root@quickstart cloudera]# find / -name hive-site.xml
/home/cloudera/Desktop/hive-site.xml
/etc/hive/conf.dist/hive-site.xml
/etc/impala/conf.dist/hive-site.xml
copy hive-site.xml into /usr/lib/spark/conf
--------------------------------------------
[root@quickstart ~]# cp /etc/hive/conf.dist/hive-site.xml /usr/lib/spark/conf
//import package for hive in spark shell
scala> import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.HiveContext
// create the instance for HiveContext and pass sc (sqlContext)
scala> val hc = new HiveContext(sc)
hc: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@645a3f6d
// create hive database in spark
scala> hc.sql("create database myspark")
res109: org.apache.spark.sql.DataFrame = [result: string]
start hive and look for myspark database:
--------------------------------------
[cloudera@quickstart ~]$ hive
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> show databases;
OK
batch_may
default
myspark
sakthi
Time taken: 0.028 seconds, Fetched: 4 row(s)
hive>
scala> hc.sql("use myspark")
res111: org.apache.spark.sql.DataFrame = [result: string]
scala> hc.sql("create table samp(id int, name string, sal int, sex string, dno int) row format delimited fields terminated by ','")
res112: org.apache.spark.sql.DataFrame = [result: string]
hive> use myspark;
OK
Time taken: 0.015 seconds
hive> show tables;
OK
samp
Time taken: 0.018 seconds, Fetched: 1 row(s)
hive> describe samp;
OK
id int
name string
sal int
sex string
dno int
Time taken: 0.112 seconds, Fetched: 5 row(s)
load data into samp :
---------------------
scala> hc.sql("load data local inpath 'emp' into table samp")
res113: org.apache.spark.sql.DataFrame = [result: string]
scala> hc.sql("select * from samp").show()
+---+-----+------+---+---+
| id| name| sal|sex|dno|
+---+-----+------+---+---+
|101| aaa| 40000| m| 11|
|102| bbbb| 50000| f| 12|
|103| ccc| 90000| m| 12|
|104|ddddd|100000| f| 13|
|105| eee| 20000| m| 11|
|106| iiii| 30000| f| 12|
|107| jjjj| 60000| m| 13|
|108| kkkk| 90000| f| 14|
+---+-----+------+---+---+
scala> val res1 = hc.sql("select dno,sum(sal) as tot from samp group by dno")
res1: org.apache.spark.sql.DataFrame = [dno: int, tot: bigint]
scala> res1.take(5)
res116: Array[org.apache.spark.sql.Row] = Array([11,60000], [12,170000], [13,160000], [14,90000])
scala> res1.show()
+---+------+
|dno| tot|
+---+------+
| 11| 60000|
| 12|170000|
| 13|160000|
| 14| 90000|
+---+------+
// here map-reduce operation it will convert hql query into map-reduce java code then it will run .jar file (mappp... reduce...)
hive> select dno,sum(sal) from samp group by dno;
11 60000
12 170000
13 160000
14 90000
Hive provides partitioning table to avoid scanning. faster search query results
Create a json file in local linux :
-------------------------------------
cat > mydata.json
{"name":"Ravi","age":25}
{"name":"Rani","city":"Hyd"}
{"name":"Mani","age":24,"city":"Del"}
copy mydata.json into hdfs Sparks folder:
--------------------------------------------
[cloudera@quickstart ~]$ hdfs dfs -copyFromLocal mydata.json Sparks
display the content of mydata.json:
-----------------------------------
hdfs dfs -cat /user/cloudera/Sparks/mydata.json
{"name":"Ravi","age":25}
{"name":"Rani","city":"Hyd"}
{"name":"Mani","age":24,"city":"Del"}
import json into Hive table :
-------------------------------
Fetched: 4 row(s)
hive> use myspark;
OK
Time taken: 0.046 seconds
hive> use myspark;
OK
Time taken: 0.016 seconds
hive> create table raw(line string);
OK
Time taken: 0.734 seconds
hive> load data local inpath 'mydata.json' into table raw;
Loading data to table myspark.raw
Table myspark.raw stats: [numFiles=1, totalSize=92]
OK
Time taken: 0.636 seconds
hive> select * from raw;
OK
{"name":"Ravi","age":25}
{"name":"Rani","city":"Hyd"}
{"name":"Mani","age":24,"city":"Del"}
Time taken: 0.079 seconds, Fetched: 3 row(s)
hive> create table info (name string, age int, city string);
//create destination / target table
select get_json_object(line,'$.name') from raw;
OK
Ravi
Rani
Mani
Time taken: 0.059 seconds, Fetched: 3 row(s)
//fetch json elements from raw table using get_json_object:
hive> select get_json_object(line,'$.name'),get_json_object(line,'$.age') from raw;
OK
Ravi 25
Rani NULL
Mani 24
Time taken: 0.065 seconds, Fetched: 3 row(s)
hive> select get_json_object(line,'$.name'),get_json_object(line,'$.age'),get_json_object(line,'$.city') from raw;
OK
Ravi 25 NULL
Rani NULL Hyd
Mani 24 Del
Time taken: 0.058 seconds, Fetched: 3 row(s)
//fetch json elements from raw table using json_tuple
hive> select x.* from raw lateral view json_tuple(line,'name','age','city') x as name,age,city;
OK
Ravi 25 NULL
Rani NULL Hyd
Mani 24 Del
Time taken: 0.063 seconds, Fetched: 3 row(s)
// fetch raw table json elements and put them into info table:
hive> insert into table info select x.* from raw lateral view json_tuple(line,'name','age','city') x as name,age,city;
hive> select * from info;
OK
Ravi 25 NULL
Rani NULL Hyd
Mani 24 Del
Time taken: 0.076 seconds, Fetched: 3 row(s)
[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/mydata.json
{"name":"Ravi","age":25}
{"name":"Rani","city":"Hyd"}
{"name":"Mani","age":24,"city":"Del"}
using Spark way to do with json files:
------------------------------------------
[cloudera@quickstart ~]$ hdfs dfs -cat Sparks/mydata.json
scala> val jdf = sqlContext.read.json("/user/cloudera/Sparks/mydata.json")
jdf: org.apache.spark.sql.DataFrame = [age: bigint, city: string, name: string]
scala> jdf.show()
+----+----+----+
| age|city|name|
+----+----+----+
| 25|null|Ravi|
|null| Hyd|Rani|
| 24| Del|Mani|
+----+----+----+
scala> jdf.count
res119: Long = 3
scala> jdf.take(3)
res120: Array[org.apache.spark.sql.Row] = Array([25,null,Ravi], [null,Hyd,Rani], [24,Del,Mani])
scala> jdf.printSchema
root
|-- age: long (nullable = true)
|-- city: string (nullable = true)
|-- name: string (nullable = true)
read.json ==> Serialization, Deserialization happens automatically
// example to do with nested json
// Hive approach first
cat > mydata1.json
{"name":"Ravi","age":25,"wife":{"name":"Rani","age":24,"city":"hyd"},"city":"del"}
{"name":"Kiran","age":30,"wife":{"name":"Veni","qual":"BTech","city":"hyd"},"city":"hyd"}
[cloudera@quickstart ~]$ cat mydata1.json
{"name":"Ravi","age":25,"wife":{"name":"Rani","age":24,"city":"hyd"},"city":"del"}
{"name":"Kiran","age":30,"wife":{"name":"Veni","qual":"BTech","city":"hyd"},"city":"hyd"}
[cloudera@quickstart ~]$ hdfs dfs -copyFromLocal mydata1.json Sparks
[cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/Sparks/mydata1.json
{"name":"Ravi","age":25,"wife":{"name":"Rani","age":24,"city":"hyd"},"city":"del"}
{"name":"Kiran","age":30,"wife":{"name":"Veni","qual":"BTech","city":"hyd"},"city":"hyd"}
hive> create table jraw (line string);
OK
Time taken: 0.325 seconds
hive> load data local inpath 'mydata1.json' into table jraw;
Loading data to table myspark.jraw
Table myspark.jraw stats: [numFiles=1, totalSize=173]
OK
Time taken: 0.29 seconds
hive> select * from jraw;
OK
{"name":"Ravi","age":25,"wife":{"name":"Rani","age":24,"city":"hyd"},"city":"del"}
{"name":"Kiran","age":30,"wife":{"name":"Veni","qual":"BTech","city":"hyd"},"city":"hyd"}
Time taken: 0.054 seconds, Fetched: 2 row(s)
hive> create table raw2(name string, age int, wife string, city string);
OK
Time taken: 0.235 seconds
select x.* from jraw lateral view json_tuple(line,'name','age','wife','city') x as n,a,w,c;
OK
Ravi 25 {"name":"Rani","age":24,"city":"hyd"} del
Kiran 30 {"name":"Veni","qual":"BTech","city":"hyd"} hyd
Time taken: 0.071 seconds, Fetched: 2 row(s)
hive> insert into table raw2 select x.* from jraw lateral view json_tuple(line,'name','age','wife','city') x as n,a,w,c;
select * from raw2;
OK
Ravi 25 {"name":"Rani","age":24,"city":"hyd"} del
Kiran 30 {"name":"Veni","qual":"BTech","city":"hyd"} hyd
Time taken: 0.053 seconds, Fetched: 2 row(s)
hive> select name,get_json_object(wife,'$.name'), age,get_json_object(wife,'$.age'),get_json_object(wife,'$.qual'),city,get_json_object(wife,'$.city') from raw2;
OK
Ravi Rani 25 24 NULL del hyd
Kiran Veni 30 NULL BTech hyd hyd
Time taken: 0.063 seconds, Fetched: 2 row(s)
Spark approach to handle nested json:
--------------------------------------
scala> val couples = sqlContext.read.json("/user/cloudera/Sparks/mydata1.json")
couples: org.apache.spark.sql.DataFrame = [age: bigint, city: string, name: string, wife: struct<age:bigint,city:string,name:string,qual:string>]
scala> couples.show();
+---+----+-----+--------------------+
|age|city| name| wife|
+---+----+-----+--------------------+
| 25| del| Ravi| [24,hyd,Rani,null]|
| 30| hyd|Kiran|[null,hyd,Veni,BT...|
+---+----+-----+--------------------+
scala> couples.collect
res126: Array[org.apache.spark.sql.Row] = Array([25,del,Ravi,[24,hyd,Rani,null]], [30,hyd,Kiran,[null,hyd,Veni,BTech]])
scala> couples.collect.map (x => x(3))
res129: Array[Any] = Array([24,hyd,Rani,null], [null,hyd,Veni,BTech])
Hive with XML and Spark with XML
-------------------------------
Old semistructured is XML
Latest semistructured is json
In IT industry, Most of old data is in XML
Hive has powerful feature for XML parsers
Databrics provides 3rd party libraries
We can get help from HQL for XML parsing
Create an XML file:
-------------------
[cloudera@quickstart ~]$ cat > my1st.xml
<rec><name>Ravi</name><age>25</age></rec>
<rec><name>Rani</name><sex>F</sex></rec>
<rec><name>Giri</name><age>35</age><sex>M</sex></rec>
Copy it into hdfs:
-------------------
[cloudera@quickstart ~]$ hdfs dfs -copyFromLocal my1st.xml Sparks
org.apache.spark.sql.hive.HiveContext
val hc = new HiveContext(sc)
scala> hc
res131: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@645a3f6d
scala> hc.sql("use myspark")
res132: org.apache.spark.sql.DataFrame = [result: string]
scala> hc.sql("create table xraw(line string)")
res133: org.apache.spark.sql.DataFrame = [result: string]
scala> hc.sql("create table xinfo(name string, age int, city string) row format delimited fields terminated by ','")
res134: org.apache.spark.sql.DataFrame = [result: string]
hdfs location :
/user/hive/warehouse/mysparks/xraw
scala> hc.sql("load data local inpath 'my1st.xml' into table xraw")
res135: org.apache.spark.sql.DataFrame = [result: string]
scala> hc.sql("select * from xraw").show()
+--------------------+
| line|
+--------------------+
|<rec><name>Ravi</...|
|<rec><name>Rani</...|
|<rec><name>Giri</...|
+--------------------+
scala> hc.sql("select xpath_string(line,'rec/name') from xraw").show()
+----+
| _c0|
+----+
|Ravi|
|Rani|
|Giri|
+----+
scala> hc.sql("select xpath_string(line,'rec/age') from xraw").show()
+---+
|_c0|
+---+
| 25|
| |
| 35|
+---+
scala> hc.sql("select xpath_string(line,'rec/sex') from xraw").show()
+---+
|_c0|
+---+
| |
| F|
| M|
+---+
scala> val re = hc.sql("select xpath_string(line,'rec/name'), xpath_string(line,'rec/age'),xpath_string(line,'rec/sex') from xraw")
re: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string, _c2: string]
scala> re.show();
+----+---+---+
| _c0|_c1|_c2|
+----+---+---+
|Ravi| 25| |
|Rani| | F|
|Giri| 35| M|
+----+---+---+
scala> val re1 = hc.sql("select xpath_string(line,'rec/name'), xpath_int(line,'rec/age'),xpath_string(line,'rec/sex') from xraw")
re1: org.apache.spark.sql.DataFrame = [_c0: string, _c1: int, _c2: string]
scala> re1.show()
+----+---+---+
| _c0|_c1|_c2|
+----+---+---+
|Ravi| 25| |
|Rani| 0| F|
|Giri| 35| M|
+----+---+---+
// put all the results taken from xraw into xresults:
-------------------------------------------------------
scala> hc.sql("insert into table xresults select xpath_string(line,'rec/name'), xpath_int(line,'rec/age'),xpath_string(line,'rec/sex') from xraw")
res144: org.apache.spark.sql.DataFrame = []
scala> hc.sql("select * from xresults").show()
+----+---+---+
|name|age|sex|
+----+---+---+
|Ravi| 25| |
|Rani| 0| F|
|Giri| 35| M|
+----+---+---+
Hive Integration Advantage:
Speed because of inmemory computing
DataFrame:
Spark Data Objects -> RDD
Data Frame -> Temporary Table
-> SQL queries
SparkSQL provides 2 types of data objects:
1) Data Frame
2) Data Set
RDD Vs Data Frame Vs Data Set
RDD Data Frame DataSet
RDD APIs Yes No Yes
DF APIs No Yes No
DS APIs No No Yes
(Catalyst Optimizer) (Catalyst Optimizer + Tungston optimizer)
Data Frames are faster than RDDs because of Catalyst optimizer
DataSets are faster than RDDs and Data Frames (Catalyst Optimizer + Tungston Optimizer)
Both RDDs and Data Frames uses inmemory computing
Inmemory computing is very much faster than traditional Disk I/O computing
CPU cache - Frequently used data will be cached to get more performance
Tungston uses CPU caches along with inmemory computing
(L1, L2, L3, L4 caching)
Computing using CPU caches is very faster than inmemory computing.
Speed of CPU is greater than speed of in memory and is greater than speed of disk computing
MapReduce (Disk Computing)
RDDs (In memory computing)
DataSets ( along with In memory, CPU caching) -- speed
DataSets are more than 50% speed than traditional RDDs.
Spark In memory cache is speed but DataSets are combining with RDDs inmemory computing
+ L1,L2,L3,L4 cpu caching
Data Set Example:
----------------
scala> import sqlContext.implicits._
import sqlContext.implicits._
scala> case class Sample(a:Int, b:Int)
defined class Sample
scala> case class Sample(a:Int, b:Int)
defined class Sample
scala> val rdd = sc.parallelize(List(Sample(10,20),Sample(1,2),Sample(5,6),Sample(100,200),Sample(1000,2000)))
rdd: org.apache.spark.rdd.RDD[Sample] = ParallelCollectionRDD[261] at parallelize at <console>:36
scala> rdd.collect.foreach(println)
Sample(10,20)
Sample(1,2)
Sample(5,6)
Sample(100,200)
Sample(1000,2000)
scala> val df = rdd.toDF
df: org.apache.spark.sql.DataFrame = [a: int, b: int]
scala> df.printSchema()
root
|-- a: integer (nullable = false)
|-- b: integer (nullable = false)
scala> df.show()
+----+----+
| a| b|
+----+----+
| 10| 20|
| 1| 2|
| 5| 6|
| 100| 200|
|1000|2000|
+----+----+
scala> import sqlContext.implicits._
import sqlContext.implicits._
scala> case class Sample(a:Int, b:Int)
defined class Sample
scala> case class Sample(a:Int, b:Int)
defined class Sample
scala> val rdd = sc.parallelize(List(Sample(10,20),Sample(1,2),Sample(5,6),Sample(100,200),Sample(1000,2000)))
rdd: org.apache.spark.rdd.RDD[Sample] = ParallelCollectionRDD[306] at parallelize at <console>:39
scala> rdd.collect.foreach(println)
Sample(10,20)
Sample(1,2)
Sample(5,6)
Sample(100,200)
Sample(1000,2000)
scala> val df = rdd.toDF
df: org.apache.spark.sql.DataFrame = [a: int, b: int]
scala> df.printSchema()
root
|-- a: integer (nullable = false)
|-- b: integer (nullable = false)
scala> df.show()
+----+----+
| a| b|
+----+----+
| 10| 20|
| 1| 2|
| 5| 6|
| 100| 200|
|1000|2000|
+----+----+
scala> df.show()
+----+----+
| a| b|
+----+----+
| 10| 20|
| 1| 2|
| 5| 6|
| 100| 200|
|1000|2000|
+----+----+
scala> df.take(3)
res164: Array[org.apache.spark.sql.Row] = Array([10,20], [1,2], [5,6])
scala> df.take(3).foreach(println)
[10,20]
[1,2]
[5,6]
scala> df.select("a","b").show()
+----+----+
| a| b|
+----+----+
| 10| 20|
| 1| 2|
| 5| 6|
| 100| 200|
|1000|2000|
+----+----+
scala> df.select(df("a"),df("a")+1,df("b"),df("b")+1).show()
+----+-------+----+-------+
| a|(a + 1)| b|(b + 1)|
+----+-------+----+-------+
| 10| 11| 20| 21|
| 1| 2| 2| 3|
| 5| 6| 6| 7|
| 100| 101| 200| 201|
|1000| 1001|2000| 2001|
+----+-------+----+-------+
scala> df.filter(df("a") >= 100).show();
+----+----+
| a| b|
+----+----+
| 100| 200|
|1000|2000|
+----+----+
scala> df.filter(df("a")>100).show()
+----+----+
| a| b|
+----+----+
|1000|2000|
+----+----+
scala> val df = sqlContext.read.json("/user/cloudera/Sparks/mydata.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, city: string, name: string]
scala> df.show()
+----+----+----+
| age|city|name|
+----+----+----+
| 25|null|Ravi|
|null| Hyd|Rani|
| 24| Del|Mani|
+----+----+----+
scala> df.printSchema()
root
|-- age: long (nullable = true)
|-- city: string (nullable = true)
|-- name: string (nullable = true)
scala> df.select("name").show()
+----+
|name|
+----+
|Ravi|
|Rani|
|Mani|
+----+
scala> df.select("name","city","age").show()
+----+----+----+
|name|city| age|
+----+----+----+
|Ravi|null| 25|
|Rani| Hyd|null|
|Mani| Del| 24|
+----+----+----+
scala> df.select("name","city").show()
+----+----+
|name|city|
+----+----+
|Ravi|null|
|Rani| Hyd|
|Mani| Del|
+----+----+
scala> df.select(df("name"),df("age")+1).show();
+----+---------+
|name|(age + 1)|
+----+---------+
|Ravi| 26|
|Rani| null|
|Mani| 25|
+----+---------+
scala> df.select(df("age"),df("age")+100).show()
+----+-----------+
| age|(age + 100)|
+----+-----------+
| 25| 125|
|null| null|
| 24| 124|
+----+-----------+
df.filter(df("age")>21).show();
df.groupBy("age").count().show()
scala> val data = sc.textFile("/user/cloudera/Sparks/emp")
data: org.apache.spark.rdd.RDD[String] = /user/cloudera/Sparks/emp MapPartitionsRDD[325] at textFile at <console>:37
scala> data.collect.foreach(println)
101,aaa,40000,m,11
102,bbbb,50000,f,12
103,ccc,90000,m,12
104,ddddd,100000,f,13
105,eee,20000,m,11
106,iiii,30000,f,12
107,jjjj,60000,m,13
108,kkkk,90000,f,14
scala> val emp = data.map { x =>
| val w = x.split(",")
| val id = w(0).toInt
| val name = w(1)
| val sal = w(2).toInt
| val sex = w(3)
| val dno = w(4).toInt
| Emp(id,name,sal,sex,dno)
| }
emp: org.apache.spark.rdd.RDD[Emp] = MapPartitionsRDD[326] at map at <console>:55
scala> emp.collect.foreach(println)
Emp(101,aaa,40000,m,11)
Emp(102,bbbb,50000,f,12)
Emp(103,ccc,90000,m,12)
Emp(104,ddddd,100000,f,13)
Emp(105,eee,20000,m,11)
Emp(106,iiii,30000,f,12)
Emp(107,jjjj,60000,m,13)
Emp(108,kkkk,90000,f,14)
scala> empdf.select("id","name","sal","sex","dno").show();
+---+-----+------+---+---+
| id| name| sal|sex|dno|
+---+-----+------+---+---+
|101| aaa| 40000| m| 11|
|102| bbbb| 50000| f| 12|
|103| ccc| 90000| m| 12|
|104|ddddd|100000| f| 13|
|105| eee| 20000| m| 11|
|106| iiii| 30000| f| 12|
|107| jjjj| 60000| m| 13|
|108| kkkk| 90000| f| 14|
+---+-----+------+---+---+
scala> empdf.select(empdf("sal"),empdf("sal")*10/100).show();
+------+------------------+
| sal|((sal * 10) / 100)|
+------+------------------+
| 40000| 4000.0|
| 50000| 5000.0|
| 90000| 9000.0|
|100000| 10000.0|
| 20000| 2000.0|
| 30000| 3000.0|
| 60000| 6000.0|
| 90000| 9000.0|
+------+------------------+
scala> empdf.groupBy(empdf("sex")).count.show()
+---+-----+
|sex|count|
+---+-----+
| f| 4|
| m| 4|
+---+-----+
// select sex,count(*) from emp group by sex;
scala> empdf.groupBy(empdf("sex")).agg(sum("sal")).show();
+---+--------+
|sex|sum(sal)|
+---+--------+
| f| 270000|
| m| 210000|
+---+--------+
// for each sex group how much is the total salary
// here we dealt with single group and single aggregation
scala> empdf.groupBy(empdf("sex")).agg(sum("sal"),max("sal")).show();
+---+--------+--------+
|sex|sum(sal)|max(sal)|
+---+--------+--------+
| f| 270000| 100000|
| m| 210000| 90000|
+---+--------+--------+
// single goruping but multiple aggregations
scala> empdf.groupBy(empdf("sex")).agg(sum("sal"),max("sal"),min("sal")).show();
+---+--------+--------+--------+
|sex|sum(sal)|max(sal)|min(sal)|
+---+--------+--------+--------+
| f| 270000| 100000| 30000|
| m| 210000| 90000| 20000|
+---+--------+--------+--------+
//group by multiple columns and multiple aggregations
scala> empdf.groupBy(empdf("dno"),empdf("sex")).agg(sum("sal"),max("sal"),min("sal")).show();
+---+---+--------+--------+--------+
|dno|sex|sum(sal)|max(sal)|min(sal)|
+---+---+--------+--------+--------+
| 11| m| 60000| 40000| 20000|
| 12| f| 80000| 50000| 30000|
| 12| m| 90000| 90000| 90000|
| 13| f| 100000| 100000| 100000|
| 13| m| 60000| 60000| 60000|
| 14| f| 90000| 90000| 90000|
+---+---+--------+--------+--------+
convert df into temp table then play with sql queries
dataSets --> catalyst optimizer + Tungston optimzer (cpu cache L1,L2,L3,L3)
frequent data will be cached to access frequent
SAP HANA is also inmemroy computing
HANA is not distributed
but SPark is largely distributed
inmemory + CPU Cache + GPU speed ==> super rocket speed
Quad
if you enable GPU computing, while using Single Core processer, we can enable 64 parallel processes
single core will be divided into 64 sub cores
if we use 4 Cores, we can run 64*4 = 256 parallel processes
if we have 4 core cpus it may act like 256 node cluster because of GPU enabling
Multi layer of GPU
Machine learning Algorithm
- Decision Tree
- Random Forest kindf of algorithm on 200 GB data
if we run R or Python --> it will need 2 days to run
Spark Execution is best suitable for future machine learning
Spark with Scala
Spark with Python
Lot of Rich Libraries are available
Machine Learning, GraphX Algorithms available
Data Set:
--------
a) DataSet without schema:
-----------------------------
scala> val ds = Seq(1,2,3).toDS()
ds: org.apache.spark.sql.Dataset[Int] = [value: int]
scala> ds.show();
+-----+
|value|
+-----+
| 1|
| 2|
| 3|
+-----+
scala> ds.printSchema
root
|-- value: integer (nullable = false)
scala> ds.map(x => x+10).show();
+-----+
|value|
+-----+
| 11|
| 12|
| 13|
+-----+
// more speeder operation with the help of tungston
Data Set with Schema:
---------------------
scala> case class Person(name:String, age:Long)
defined class Person
scala> val ds = Seq(Person("Andy",32)).toDS();
ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]
scala> ds.printSchema();
root
|-- name: string (nullable = true)
|-- age: long (nullable = false)
scala> ds.show();
+----+---+
|name|age|
+----+---+
|Andy| 32|
+----+---+
scala> val ds = Seq(Person("Andy",32),Person("Raja",28),Person("Ravi",29)).toDS();
ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]
scala> ds.show();
+----+---+
|name|age|
+----+---+
|Andy| 32|
|Raja| 28|
|Ravi| 29|
+----+---+
Play with Json and DataSet:
-----------------------------
create a json file in local linux:
[cloudera@quickstart ~]$ cat >sample.json
{"name":"Hari","age":30}
{"name":"Latha","age":25}
{"name":"Mani","age":23}
copy it into hdfs:
[cloudera@quickstart ~]$ hdfs dfs -copyFromLocal sample.json Sparks
display the content using cat:
-----------------------------
[cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/Sparks/sample.json
{"name":"Hari","age":30}
{"name":"Latha","age":25}
{"name":"Mani","age":23}
RDD Exa: (Reading json file using Spark)
scala> val rdd = sqlContext.read.json("/user/cloudera/Sparks/sample.json")
rdd: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> rdd.foreach(println)
[30,Hari]
[25,Latha]
[23,Mani]
scala> rdd.printSchema
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
DataSet with json
-----------------
scala> case class Person(name:String, age:Long)
defined class Person
scala> Person
res206: Person.type = Person
scala> val ds = sqlContext.read.json("/user/cloudera/Sparks/sample.json").as[Person] // as[Person] creates dataSet
ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]
scala> ds.printSchema
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
scala> ds.show();
+-----+---+
| name|age|
+-----+---+
| Hari| 30|
|Latha| 25|
| Mani| 23|
+-----+---+
Word count example:
------------------
val lines = sqlContext.read.text("......file.txt")
val words = lines.flatMap(_.split(",")).filter(_ != "")
val counts = words.groupBy(_.toLowerCase).map(w => w._1, w._2.size))
dataSet approach:
-------------------
val counts = words.groupBy(_.toLowerCase).count()
Spark Streaming:
----------------
flume / storm / kafka / spark
Types of Big Data Applications:
-------------------------------
Online Live Streaming Micro Batching Batch
Online
-----------
User Interactive
End user's transactions are
recorded here
ATM cash withdrawal
All User Interactive applications
Batch
--------
Non interactive
automated
Bank statement generations
periodically scheduled
Generate reports
All Non interactive applications
Online --> RDBMS
1 Lakh transactions / Minute attacking --> RDBMS can handle
Java,C#,Python any app --> RDBMS
Trillions of Crores of Transactions / Minute attacking -> RDBMS can't handle
NoSQL came into picture to handle trillions of transactions / minute
OLTP - OnLine Transaction Process (Online)
OLAP - Online Analytical Process (Batch)
RDBMS can't bear too much of work load
NoSQL supports heavy load and online processing
MongoDB
Cassandra - for open app and don't want to involve hadoop
Neo4j
Hadoop allows only sequencial reading style
from beginning of block to end of the block
record by record reading - sequential reading
Random Reading in Hadoop is not possible
Hbase -> row level access --> Random reading
If app needs Hadoop process go with Hbase
Go with Cassandra if you dont want to use Hadoop
Single RDBMS can solve all our problems
But single NoSQL db can't solve all our problems
for fixing different kind of problems, different kind of NoSQL DBs needed
Graph DB -> Friend of Friend of Friend --> 3rd level friend in Facebook --> Neo4j
100% schemaless -> MongoDB
Faster Random access with K,V pair -> PNUT
Faster Random access along with columns -> Hbase
Big Table (Google Big Table white Paper)
table is having 40 Lakh columns (4 Million columns)
we can't keep 40 Lakh columns in single RDBMS
RDBMs max columns count is 1024
select * from table is faster than select <column123>,<column235> from table
because it has to identify the beginning and ending position
Columnar stores
Row Key (associated with column family)
Column Family (associated with column name)
Column Name (associated with Column value)
Column Family (Table)
RDBMS and JOINs
----------------
We will be having multile tables in RDBMS
Employee, Department, Manager, Attendance Tables -
we will be doing JOINs to perform combing multi table info to get desired results
Big Data with JOINs
-------------------
In Big Data Left side table (Table A) may have 1 Crore records
Right side table (Table B) may have 10 Lakh records
It will be very difficult to do joins with this much huge number of rows.
(10 Lakh X 1 Crore comparision - very difficult)
'
Big Big Tables with OLTP joins are very bad
Cassandra kind of systems are eliminating this kind of joines
All these independent tables (Employee,Department,Manager,Attendance) will be kept as column families
Single Cassandra or HBase will be having all abovesaid 4 tables as Column families
This process is called Denormalisation
Generally if we use RDBMS as our backend, Denormalization is very bad for online systems
because in a single table (RDBMS) can have max of 1024 columns
so if we need more than 10K columns means we definitely need 100s of tables.
100 tables x 100 columns or (1024 columns x 10 tables at max)
Employee table (250 columns)
Department (120)
Attendance(150)
Manager(150)
and some more tables with 200s of columns
if we denormalize all tables and put them into a single table that is impossible
if total number of columns of all our tables exceeds 1024 columns we can't denormalize it in RDBMS
If I have denormalized table, i dont need to do any joins
In OLTP systems denormalization is not possible because of max allowed column length of a single table
schemaless (structureless , flexible schema, dynamic schema)
Row by row number of columns are different
Fresher
Experienced
Married
Different kind of rows have different kind of columns
Products in Amazon:
Book -> Author, Publisher, Date, pages, ISBN
Toy -> Mfd Date, Material, Age
Picle -> Expiry date, Mfd Date, Veg,NonVeg,etc.,
Each and every product will be having its own attributes
it's very difficult to put all kind of products in a single table (RDBMS)
NoSQL Transacton DBs are good to handle multiple number of columns for each rows for each products
Streaming:
To capture more number of transactions at High rate of speed
without user's knowledge, more number of data are captured and collected and stored and analyzed
(User event logs)
Security Camera:
CAM is capuring and recording
today morning 10 MB
tomorrow 1 GB
next month 1 TB .. streaming means keep on capturing and writing
Logs
Web log
DB log
OS Logging
App log
Log file will keep increasing. Without user interaction log info will be keep on capturing, writing
Logs are used to analyze server problems
But nowadays, logs usage is to understand user behaviour
which pages user visited
last night my bike got repair and i went home by bus
This morning, Google sent message to me : This morning's first bus is at 7:15 AM
Bharath Matrimony:
seeking for matches
10:30 AM
Match 1: Rani
Looking Rani's profile
10:40 AM
Match 2: Veni
Looking Veni's profile
11:00 AM
Match 3: Kajol
Looking Kajol's profile
11:28 AM
Match 1: Rani (Again he revisiting)
how much time he viewed the profile : that profile atracted our guy
how many time he revisited the same profile again :
Recommendation engine using machine learning
Giri's nearest neighbour is Vani
so, show Vani's info as recommendation for our guy
Taste, preference recommendation
all recommendation engines algorithm takes the user logs and analyze and finding the recommendation
Amazon, Bharath Matrimony, FlipKart, NetFlix etc.,
all user's log info will be helpful for making recommendations
Static non streaming:
downloaded file size will be always same (movie)
static (not a streaming)
Streaming:
Flume
Storm
Kafka
Spark Streaming
each and every tools developed to solve different kind of purpose
Flume for Log analysis
Kafka for Transaction processing with small analysis
Storm : Transaction processing with Heavy analysis
Flume:
Faster stream capute applications
not part of hadoop
separate cluster
independent system
faster stream capture system
100% delivery guarantee is not there
No guarantee for each and every event will reach destination (target)
Streaming
Source (app)
------>
Flume
---->
JMS, Hadoop (Destination)
1000 Events generated
Flume
800 events delivered at destination
Flume Agent (Source, Channel, Sinks)
Source (App) - event will be immediately captured and buffered in Channel
Event based , volumme based thresold
3000 events --> transferred to ---> Hadoop
for fault tolerance flume recommends channel1 and channel2
C1 (3000 events) C2 (3000 events) --->
it will take some time to write into C1 and C2
delayed process - fails here
100% guarantee for proper successfull delivery - a big No
Flume is not recommended for sensitive, commercial app (online banking, credit card transactions)
Good for doing Bahubali 2 sentiment analysis (data taken from twitter)
twitter, youtube, facebook user sentiment analysis (flume is good at this)
Banking transactions, credit card processing - Flume is very bad at this
Storm is not alternate to the flume
But Kafka is alternate to the flume
Kafka can stream the things + it can act as messaging system
Kafka is a streaming and Message (brokerage) system
Kafka has the solution for flume's issue
Kafka gives 100% delivery guarantee
Kafka handles high loads of transactions
if 10 different sources are attacking flume, it wil be slow
if 100s of 1000s of sources are attacking Kafka it can handle
Kafka is very powerful
Kafka promises 100% one time delivery
Kafka is well capable of handling 'n' number of sources
Linked can handle 3 Trillion events (transactions) per 10 seconds using high end systems (their secrets)
Messaging - Brokerage systems
Publishers
Kafka
Subscriber
Buyer -> Kafka -> Seller
Buyer informs to Mediator
Mediator broadcast the msg to Seller
Before Kafka
Websphere MQ
RabbitMQ
Tibco
WebMethods
JMS (Java Messaging System)
Web services
App1 to App2 communications
App1 (C++) to App2 (C++) communications
App1 (Java) to App2 (Java) communications
Web services:
App to App communications (java to .net) (.net to python)
CRM
Seibel
Finance
Oracle
Sales
SAP
Web services help to communicate between different kind of apps written in different platforms, different languages
Apps:
Sales, Inventory, Finance, HR
Sales Finance
Products sold (Products outflow)
Cash In (cash Inflow) Cash total update here
If Sales App talks with Finance that's good for updating cash matters
Sales App (C#) talks with Finance App (Java) via web services
What ever money received via Sales App will be updated in Finance App.
Everything running smooth.
But what if Finance App is down.
Whatever Sales cash update wont be reflected in Finance App due to 2nd app is down.
Que system came
priority
App1 try to communicate App2 (but that time App2 is down)
But que system keeps App1's update
Once App2 is up immediately Que system delivers App1's updates to App2
JMS
Message Queing, Message Brokerages
WebSphereMQ, RabbitMQ, JMS
All these Queing, Messaging Broking systems can't handle heavy loads of transactions, events
How many nodes we can connect in Hadoop?
Unlimited
Kafka is a largely distributed system
RabbitMQ is also Distributed
Kafka has topics, RabbitMQ has ques
in Kafka single topic can be kept into multiple nodes
But in RabbitMQ a single que can be kept into single node
4 different que can be istributed in 4 different nodes in RabbidMQ
but 1 que can't be distributed in 4 nodes
RDDs are logical but partitions are physical
partition can be replicated in multiple machines
HDFS files are logical but blocks are physical
One year ago:
App1 is the source
App2,3,4, are destinations
meaning client requirement is App1 wants to communicate with App2,App3,App4 only
As of Now:
App1 is the source
App2.....App100 are destinations
meaning App1 wants to communicate with App2... App100
After 3 years:
App1 is the source:
App2... App10000 are destinations
meaning App1 wants to communicate with App2... App10000
in traditional approach lots and lots of code change needed to talk between App1 with all the other Apps.
But in Kafka, it simplified everything. Without writing lots and lots of code.
App1 can talk with any number of destinations
After sometime
A1 ... A2...A10000
X1
Y1
Z1 ... A2.. A10000
Lots and lots of Sources and lots and lots of Destinations are supported in Kafka
so, Kafka doesn't allow direct communication between source and destinations
In old style,
One app will talk with other app.
But in Kafka, App never directly talk with other app (source to dest)
But Source will talk with Broker and it doesn't know who is going to consume
Destination will talk with Broker and it doesn't know who delivered the message
App1 -- > Broker --> App2
Scaling
Old : Vertical scaling
increating RAM, disk, processor on the same system
existing 8GB now add one more 8GB
existing 1 TB hdd now add more 2TB
We can't afford it
for a laptop, we can't put 1 TB
-- availability, compatibility issues
New : Horizontal scaling
increasing nodes
I want 100 TB
added 100 nodes with 1 TB each
Application scalability
Jio Number of apps are more
Beginning
1 source 3 targets
after some time
100 sources 1000 targets
because of new apps, i dont need to change existing apps.
Large scaling
1000s of apps as source and 1000s of apps as destinations
WebSphere can't handle Lakh transaction, events / second
High loads of transactions, events support
Kafka ==> Streaming + Messaging --> Kafka is Superb
Live analytics Kafka is bad but Storm is good at this
Storm is very good at Live analytics
Kafka is the replacement for Flume
But Kafka is not the replacement for Storm
Storm can do Live analytics (big complex algorithm)
Kafka is the Message Broker,
Buyer dont know who is best seller
Seller dont know who is going to buy
Flume
can do only streaming, can't act as brokerage
100% delivery gurantee is not there
kafka
streaming + messaging
very large broker apps
1000s of sources and 1000s of target apps
high transaction load support
it can't perform live analytics on captured event
it can capture the object, but it can't say entered person is male or female
My app should detect : entered object as Male or Female
Kafka can capture any event - but it can't analyze the input itself
Credit card transactions :
Kafka can capture the transaction but it can't decide genuine / fraud transaction
Storm:
ML algorithm / any given algorithm can be played on that captured event
It can't capture streams in faster way
if the rate of speed is slow, Storm can help
If the rate of capturing speed is very fast, Storm can't handle.
Storm is good at doing Live Analytics
Kafka can't do Live analytics but It can capture streaming very fast
Storm can do live analytics but it can't capture streaming very fast
Both integration is required in the industry
Micro batching -> Spark Streaming purpose
Spark Streaming is not the replacement for Kafka
Batch:
User Non-Interactive applications
at a particular interval, one batch program will trigger and it fetch data from external resource
it will generate report or it will do its own updation :: Batch
Monthly once, Monthly Twice, weekly once, daily once, hourly, seconds
Micro Batching is hourly once - to - within Few Seconds
for every 5 seconds, for every 10 seconds how much money deposited
Hadoop is good at batch processing, but it needs data must be available in HDFS (file system)
for every 15 minutes, 20 minutes --> Hadoop can do batch processing very good
for every 5 seconds, 10 seconds -> Spark Streaming is good
capuring live events -> Kafka
Live Analytics against captured events -> Storm
Micro Batching every 15 minutes --> Spark Streaming
Batch Analytics -> Hadoop + Spark
Live Transaction Capturing - NoSQL
One particular NoSQL is not enough to fix all your problems
Cassandra - No need of Hadoop
One particular .Org wants to migrate their existing RDBMS to NoSQL for their OLTP
they have to use different kind of NoSQL DBs
Document Storage
MongoDB, Couch DB
Columnar
Cassandra
Graph Storage:
Neo4j
Key,Value
PNUT
Each particular NoSQL is expert it that particular area.
Live Analytics -> Storm
Storm + Trident -> Micro batch was there earlier
But Storm + Trident -> Application development is complex needs more time to developed
If Number of Sources is > 100s Spark Streaming is Slower
Kafka + Spark Streaming (Benefits)
Faster Capturing of events, messaging
Micro Batching
Credit card point of view:
All the transactions will be captured by Kafka and stored in Brokers
Every 10 seconds How much money deposited?
Facebook feeds, Twitter tweets - sentiment analysis
word frequency - Spark Streaming can do
Good word / bad word : live analytics - Storm needed
Twitter Tweets
Facebook Feeds --> Storm --> Kafka --> Spark Streaming --> Kafka
Fast capture
Live Analytics
Micro Batch
Kafka,Storm,Spark Streaming,NoSQL
Hadoop -> provide solution for batch processing (all offline data)
Hadoop Architect Vs Big Data Architect
End to End journery of Big data applications
NoSQL stores all transactions
kafka capture them and put it into Broker
Storm do Live Analytics
Spark Streaming do Micro Batches
Final result will be put it into NoSQL (HBase)
1) Credit card transaction is done
Kafka will capture it
2) Passes to Storm and Storm will analyze whether it is fraud or genuine transaction
3) Storm API will write results from Storm into Kafka
Because other applications will interact with Kafka only but not with Storm. So, need to pass result into Kafka
so, other applications will access Storm Result from Kafka Clusture
4) Spark Streaming will stream Kafka Result of Storm and performs micro batching.
for every 10,15 seconds, how many fraud, genuins transactions happened
5) Spark streaming rewrite the results into Kafka again
Hadoop, other applications will access Micro Batch results and Storm Results
So, Hadoop can perform Batch Process when needed
Industry is developing Automation development But Industry is not automated
Attendance system:
Separate Attendance system for Male and Females
Swiping - on behalf of you, your friend is swiping for you
one object entered
Kafka will capture the object (with the help of IoT integration with Kafka)
IoT APIs will send object to Kafka
Object is available in Kafka Topic
Storm will fetch Topic and find weather the object is human or not
Human is Male / Female (Storm performing this)
(Kafka can't do the following)
1st object is Dog - leave ignore it - Label is other
2nd object is Human -
Verify Male or Female
Storm is performing this finding - Label is Male / Female and rewrite it into Kafka Topic
Spark Streaming will fetch Storm results taken from Kafka Topic,
for every 1 Hour or every 30 minutes :
How many persons are Male?
How many persons are Female?
Spark Streaming is doing that micro batching to find count of males, females entered
CM Meeting
If more number of people are outflowing means he can identify meeting is boring
Demonitization
RBI is interested for every 15 seconds, how much money is deposited
It needs Micro Batching
Individual Bank doesn't need Micro Batching But RBI needs it
Male Inflow / Outflow, Female InFlow / Outflow
Hadoop at the end of the day, Hadoop is interested about total number of males, females per day
Hadoop processed Batch and batch results will be sent back to Kafka
Hadoop is one of the solution in Big Data
Online Transaction
Online Streaming
Live Analytics
Micro Batching
Batch Analytics
If your environment supports all the above 5 areas, you have fullfledged big data environment
Hadoop is Senior than SAP HANA
SAP Hana has live projects
Big Data Live projects means just POCs
Max and Max migrating from existing RDBMS to Big Data
Teradata, Oracle --> Sqooping to Big Data
Hadoop:
-------
Transaction must be recorded in Database, File
later sqoop import
Known and unknown batch process
result will be kept into Hive
These results are not accessed by any other system
Big Data:
Everything
Transactions
Streaming
Live Analytics
Micro Batching
Batch
Dash Board viewing
spark Streaming:
-----------------
is used to stream data from sources. source can be a file a file, network port, (remote host)
ex. for remote host : twitter tweets, facebook feeds
or from any other applications or other messaging systems like : JMS, Kafka
Purpose of Spark Streaming:
Micro Batching
streams data from sources and batch them (micro batching) and performs micro batch analytics
[Sources] --> [Spark Streaming] -> Prepare batches -> [Batches] (Buffering) -> Spark Core
Analytics is performed by Spark Core
Results can be written in given target
Target can be : HDFS, Kafka Topic, other systems (NoSQL, MongoDB, Cassandra, AWS)
The Spark Streaming data object are called DSStreams - Descretized Data Streams
Continuous Series of RDDs (DSStream)
Micro batching operation will be applied on each RDD of DSStream
For every period of given interval these RDDs will be build under one DSStream.
ex : Microbatching period is 10 seconds.
for every 10 seconds one RDD will be produced under DSStream.
As Streaming job is running, these RDDs will be keep generating.
These independent RDDs will be processed by Spark Core.
DSStream1
[ 10s | 10s | 10s | 10s | 10s | 10s ]
for every 10s one RDD will be created
RDD6 | RDD5 | RDD4 | RDD3 | RDD2 | RDD1
Micro batch period is 10seconds
We are applying our business logic on DSStream
Logic / Transformation will be applied for each RDDs independently.
For every 10 seconds 1 RDD will be created and the same RDD will be processed based on the given Transformation / business logic independently
01-10 seconds -- RDD7
11-20 seconds -- RDD6
21-30 seconds -- RDD5
31-40 seconds -- RDD4
41-50 seconds -- RDD3
51-60 seconds -- RDD2
61-70 seconds -- RDD1
Before 70 seconds, the streaming process started. for every 10 seconds one new RDD generated
Once streaming job started, it never stopped until unless manually stopped it or manually killed it
Streaming job never be stopped
Each RDD - we can call them as Batch
Who will be processing the RDD?
Spark Core
Spark Core will process individual RDDs (meaning each and every Micro Batches of 10 seconds interval)
Batch, Streaming, RDDs
Basic Context Object is : sc -> SparkContext
later : SQLContext,HiveContext
But here for Streaming -> SparkStreamingContext
ssc - sparkStreamingContext is used to create DSStreams
val ssc = new SparkStreamingContext(sc,10);
10 --> Microbatch period in seconds
for every 10 seconds worth of streamed data will buffered at some worker node of Spark Cluster and prepared as Batch (RDD of DSStream)
DSStream -> Continuous Series of RDDs.
Once batch is prepared this will be submitted into Spark Core
As Spark Core is processing the batch SparkStreaming keep collecting the data from sources and prepares next batches
One batch is prepared and send it to SparkCore
Again one more batch will be prepared and send it to SparkCore for processing
It is very bad idea if we set Batch Period of 1 Hour
Security System:
Live Analytics Storm is best
for every 5, 10 seconds : from where the hacker is attacking?
Immediately Storm will catch it
Data science guys will use CNN, ANN kind of algorithm
Algorithm on Live can be executed by Storm
Storm has the Special Architecture, even though complex algorithm Storm can execute it within fraction of seconds
whenever one object entered into a class room
weather object entered is human or non human (dog)
If it is human : Male / Female
that algorithm is not simple like sex,sum(sal) from emp
it will be complex, complicated algorithm
It will be executed within a fraction of seconds on Live
Only Storm has that capability
For every 5,10 seconds - which location is attacked by hackers?
that's called micro batching
SparkStreaming wont perform Micro batching analytics
It prepares only micro batches
that is in the form of DSStream (Continuous series of RDDs)
Ofcourse we are applying algorithm on DSStream, automtaically that will be applied on Each and every RDDs of the DSStream
As SparkCore is processing some of the RDDs, still SparkStreaming wont be idle, it will be keep on collecting and preparing other que of next batches
Batch period :
val ssc = new SparkStreamingContext(sc,10);
val ds1 = ssc.textFileStream("
ssc.socketTextStream (localhost,99999) -- netcat
listening and capuring from that port
it will capture the stream but it wont pass it to SparkCore immediately
It will keep on collecting the stream for 10seconds, then it will pass micro batch to SparkCore after 10 seconds only
within 10 seconds, some 100 events
in the mean time, SparkStream will buffer those 100 events which are worthy of 10 seconds into some of the node in Spark Clusters
later it will pass into Core Engine
then it will do process and apply transformation (business logic) on that RDDs
1st second - 1st --> I love you
5th second - 2nd --> you love me
9th second - 3rd --> He loves you
all the above batched togher to make single RDD after 10 seconds,
ds1 ->
SparkCore doesn't have direct access with particular RDD
it has access only with DSStream
Flatten -> split by space
val ds2 = ds1.flatMap(x => x.split(" "))
Flat map will do Array Of Array into Single Array.
Flattens Multi (nested) collection into single collection
Apply a regular experssion over there continuous white spaces should be removed
result of this d2 is continous series of RDDs.
if we take one particular RDD, the data will be as Array
like : Array (I,love,you,you,love,me,he,loves,you)
all together is 1 RDD1
SparkCore is splitting and performaing wordcount now in the mean time SparkString
wont be idle, it will keep on collecting events happening live
keep collecting next 10seconds of data and buffering them
whenever we perform some transformation / filter on RDD what we will get?
we will get one more RDD
But here we performed transformation over DSStream, what we will get ?
we will get one more DSStream
Object Type of DataStream
val ds3 = ds2.map ( x => (x,1))
(I,1 )
(love,1 )
(you,1 )
(you,1 )
(love,1 )
(me,1 )
(he,1 )
(loves,1)
(you,1 )
from Array into key,value Tuple
ds3 is Array of Tuples
we have create a pair RDD
ds3 is one more DataStream
val ds4 = ds3.reduceByKey(_+_);
( I,1 )
( love,2 )
( you,3 )
( me,1 )
( he,1 )
( loves,1)
SparkCore result generated for the 1st batch of DSStream (1st RDD)
SparkStream is keep on generating micro batches and send them into SparkCore
Like above, SparkCore will be processing Each batch and send the Result of wordcount (Transformation) back
For each batch the SparkCore will do processing and results aggregation
Who performed processing?
Spark Core
SparkStreaming :
Independently collect the events and preparing them as batch for every given interval and pass them to SparkCore
SparkCore is doing process
ds4.print()
Steps involved
#1. Context Object creation - within the content we specify micro batch period
#2. DSStream preparation -- root should connect with sourc
#3. Output operation
#4. Start the streaming
How to start Streaming?
ssc.start
Static file - textFileStream
port - SocketStream
Kafka - KafkaStream - need KafkaUtils - need to embed related libraries
JMS - Java Messaging System - related libraries we have to embed
RDD which is not taking input from other RDD which is called Root RDD
ds1,ds2,ds3,ds4
root should connect with source
source can be file, port, other application, other streaming system like Kafka, JMS, Flume
Once DSStream is prepared nothing will happen, we need to specify output operation
output into file, hdfs,
ds4.saveAsTextFile("...hdfs location")
ds4.saveAsParquetFile("...")
just want to see the output on console : that's also output
ds4.print()
want to write into Kafka topic,
Producer related code we need to embed
Target can be anything
Result can be rewritten into Kafka,
Adv : Other 3rd party application can fetch the results generated by SparkCore
Kafka should be available into all other applications of an Organization
They will apply consumer APIs (Target)
val ssc = new SparkStreamingContext(sc,10)
// 10 is micro batch period (sc,Minute(10)) o
val ds1 = ssc.SocketTextStream(localhost,99999)
// ds1 - DataStream
val ds2 = ds1.flatMap(x => x.split(" "))
\\w+ : Continous white spaces
val dspair = ds2.map (X => (x,1))
val dsres = dspair.reduceByKey(_+_)
dsres.print();
ssc.start // to start StreamingContext
for every 10 seconds, we will get different results
if source is not generating any event, the batch will be empty (input)
if batch is empty output will also empty
But the job will keep on runing
One batch will be collected and one batch will performed and printed
apply machine learning algorithm , business logic, transformation
K-Means, Supervised , Unsupervised , Linear Regression
Algorithm is in your hand apply whatever
Windowing and Sliding:
----------------------
Faculty is interested in Daily attendance (1 Day) - Micro batch
Principal is interested in visiting once in 2 days - Sliding
Once He visits, He wanted to see 4 days attendance details - Windowing
Microbatch (10 seconds), Sliding (20 seconds ), Windowing (40 seconds)
Sliding #1
Microbatch #1 : 01 to 10
Microbatch #2 : 11 to 20
Sliding #2:
Microbatch #3 : 21 to 30
Microbatch #4 : 31 to 40
Sliding #3:
Microbatch #5 : 41 to 50
Microbatch #6 : 51 to 60
Sliding #4:
Microbatch #7 : 61 to 70
Microbatch #8 : 71 to 80
Sliding #5:
Microbatch #9 : 81 to 90
Microbatch #10 : 91 to 100
Windowing:
Windowing #1
Sliding #1
Microbatch #1 : 01 to 10
Microbatch #2 : 11 to 20
Sliding #2:
Microbatch #3 : 21 to 30
Microbatch #4 : 31 to 40
Windowing #2:
Sliding #2:
Microbatch #3 : 21 to 30
Microbatch #4 : 31 to 40
Sliding #3:
Microbatch #5 : 41 to 50
Microbatch #6 : 51 to 60
Windowing #3:
Sliding #3:
Microbatch #5 : 41 to 50
Microbatch #6 : 51 to 60
Sliding #4:
Microbatch #7 : 61 to 70
Microbatch #8 : 71 to 80
Windowing #4:
Sliding #4:
Microbatch #7 : 61 to 70
Microbatch #8 : 71 to 80
Sliding #5:
Microbatch #9 : 81 to 90
Microbatch #10 : 91 to 100
--------------
--------------------
------------------------
--------------------------
-----------------------
---------------------
These ---- dashed lines represents windowing
Faculty is responsible for taking attendances once in a day (daily attendance) - MicroBatch
Principal is responsible for monitoring Faculty's attendance once in 2 days -- Sliding
Principal is interested in looking into 4 days attendance (Windowing)
Whenever the SparkCore processed particular RDD, immediately that RDD will be destroyed
we need to persist 4 RDDS to do windowing
Windowing is set of micro batch
Sliding is interval to perform window batch
dspair.reduceByKeyAndWindow(_+_,20,40)
20 -> sliding, 40 -> windowing
reduceByKeyAndWindow is valid only for dsStream API
but we need to keep dspair.persist
without persist we can't do windowing
While persisting, keep always the window worth of data only.
Don't keep all RDDs as persist
Once window got performed that will be cherished
val ssc
val ds1 = scc
val ds2 = ds1.
val dspair = ds2.map (...
dspair.persist
val rep1 = dspair.reduceByKey
val rep2 = dspair.reduceByKeyAndWindow(_+_,20,40)
rep1.print()
rep2.print()
ssc.start()
for every 20 seconds 1st output
for every 40 seconds 2nd output
when window period and sliding period are same ...
__________
___________
___________
___________
__________
Here it wont keep entire dsStream as persist, it will always keep recent 40 seconds into persists
remaining unnecessary data will be removed.
reduceByKeyAndWindow is only avaiable with dsStream
a. cumulative operations
b. sliding interval
c. window interval
default is seconds
sliding period is >= micro batching period
windowing period is >= sliding period
socket typing
--------------------------------------------------------------------
Algorithm difficulties
Optimization for existing hadoop
100 times faster
Processing power, time, code : Shrinks
tinier code, increase readability
expressiveness
fast
computation against disk - MapReduce
Computation against cached data in Memory - Spark
Directly interact with data using local machine
ScaleUp / ScaleOut
Fault Tolerant
Unify Big Data - Batch, Stream (Real time), MLLib
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Basics").getOrCreate()
df = spark.read.json("people.json")
df.printSchema()
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
df.describe().show()
+-------+------------------+-------+
|summary| age| name|
+-------+------------------+-------+
| count| 2| 3|
| mean| 24.5| null|
| stddev|7.7781745930520225| null|
| min| 19| Andy|
| max| 30|Michael|
+-------+------------------+-------+
df.columns
['age', 'name']
df.columns[0]
'age'
df.columns[-1]
'name'
df.describe()
DataFrame[summary: string, age: string, name: string]
from pyspark.sql.types import (StructField,StringType,
IntegerType,StructType)
data_schema = [StructField("age",IntegerType(),True),
StructField("name",StringType(),True)]
final_struct = StructType(fields=data_schema)
df = spark.read.json("people.json",schema=final_struct)
df.printSchema()
root
|-- age: integer (nullable = true)
|-- name: string (nullable = true)
df.describe().show()
+-------+------------------+-------+
|summary| age| name|
+-------+------------------+-------+
| count| 2| 3|
| mean| 24.5| null|
| stddev|7.7781745930520225| null|
| min| 19| Andy|
| max| 30|Michael|
+-------+------------------+-------+
df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
df["age"]
Column<b'age'>
type(df["age"])
pyspark.sql.column.Column
df.select("age")
DataFrame[age: int]
df.select("age").show()
+----+
| age|
+----+
|null|
| 30|
| 19|
+----+
df.head(2)
[Row(age=None, name='Michael'), Row(age=30, name='Andy')]
df.head(2)[0]
Row(age=None, name='Michael')
df.head(2)[1]
Row(age=30, name='Andy')
df.select("age").show()
+----+
| age|
+----+
|null|
| 30|
| 19|
+----+
df.select(["age","name"])
DataFrame[age: int, name: string]
df.select(["age","name"]).show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
df.withColumn("nameAge",df["age"]).show()
+----+-------+-------+
| age| name|nameAge|
+----+-------+-------+
|null|Michael| null|
| 30| Andy| 30|
| 19| Justin| 19|
+----+-------+-------+
df.withColumn("DoubleAge",df["age"]*2).show()
+----+-------+---------+
| age| name|DoubleAge|
+----+-------+---------+
|null|Michael| null|
| 30| Andy| 60|
| 19| Justin| 38|
+----+-------+---------+
df.withColumnRenamed("age","my_new_age").show()
+----------+-------+
|my_new_age| name|
+----------+-------+
| null|Michael|
| 30| Andy|
| 19| Justin|
+----------+-------+
df.createOrReplaceTempView("people")
results = spark.sql("SELECT * FROM people")
results.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
results = spark.sql("SELECT * FROM people WHERE age = 30")
results.show()
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+
spark.sql("SELECT * FROM people WHERE age = 30").show()
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ops').getOrCreate()
df = spark.read.csv("AAPL.csv",inferSchema=True,header=True)
df.head(2)[0]
Row(Date=datetime.datetime(2018, 2, 13, 0, 0), Open=161.949997, High=164.75, Low=161.649994, Close=164.339996, Adj Close=164.339996, Volume=32549200)
df.printSchema()
root
|-- Date: timestamp (nullable = true)
|-- Open: double (nullable = true)
|-- High: double (nullable = true)
|-- Low: double (nullable = true)
|-- Close: double (nullable = true)
|-- Adj Close: double (nullable = true)
|-- Volume: integer (nullable = true)
df.show()
+-------------------+----------+----------+----------+----------+----------+--------+
| Date| Open| High| Low| Close| Adj Close| Volume|
+-------------------+----------+----------+----------+----------+----------+--------+
|2018-02-13 00:00:00|161.949997| 164.75|161.649994|164.339996|164.339996|32549200|
|2018-02-14 00:00:00|163.039993|167.539993|162.880005|167.369995|167.369995|40644900|
|2018-02-15 00:00:00|169.789993|173.089996| 169.0|172.990005|172.990005|51147200|
|2018-02-16 00:00:00|172.360001|174.820007|171.770004|172.429993|172.429993|40176100|
|2018-02-20 00:00:00|172.050003|174.259995|171.419998|171.850006|171.850006|33930500|
|2018-02-21 00:00:00|172.830002|174.119995|171.009995|171.070007|171.070007|37471600|
|2018-02-22 00:00:00|171.800003|173.949997|171.710007| 172.5| 172.5|30991900|
|2018-02-23 00:00:00|173.669998|175.649994|173.539993| 175.5| 175.5|33812400|
|2018-02-26 00:00:00|176.350006|179.389999|176.210007|178.970001|178.970001|38162200|
|2018-02-27 00:00:00|179.100006|180.479996|178.160004|178.389999|178.389999|38928100|
|2018-02-28 00:00:00|179.259995|180.619995|178.050003|178.119995|178.119995|37782100|
|2018-03-01 00:00:00|178.539993|179.779999|172.660004| 175.0| 175.0|48802000|
|2018-03-02 00:00:00|172.800003|176.300003|172.449997|176.210007|176.210007|38454000|
|2018-03-05 00:00:00|175.210007|177.740005|174.520004|176.820007|176.820007|28401400|
|2018-03-06 00:00:00|177.910004| 178.25|176.130005|176.669998|176.669998|23788500|
|2018-03-07 00:00:00|174.940002|175.850006|174.270004|175.029999|175.029999|31703500|
|2018-03-08 00:00:00|175.479996|177.119995|175.070007|176.940002|176.940002|23774100|
|2018-03-09 00:00:00|177.960007| 180.0|177.389999|179.979996|179.979996|32185200|
|2018-03-12 00:00:00|180.289993|182.389999|180.210007|181.720001|181.720001|32162900|
+-------------------+----------+----------+----------+----------+----------+--------+
df.head(3)[2]
Row(Date=datetime.datetime(2018, 2, 15, 0, 0), Open=169.789993, High=173.089996, Low=169.0, Close=172.990005, Adj Close=172.990005, Volume=51147200)
df.filter("Close = 172.5").show()
+-------------------+----------+----------+----------+-----+---------+--------+
| Date| Open| High| Low|Close|Adj Close| Volume|
+-------------------+----------+----------+----------+-----+---------+--------+
|2018-02-22 00:00:00|171.800003|173.949997|171.710007|172.5| 172.5|30991900|
+-------------------+----------+----------+----------+-----+---------+--------+
df.filter("Close > 175").show()
+-------------------+----------+----------+----------+----------+----------+--------+
| Date| Open| High| Low| Close| Adj Close| Volume|
+-------------------+----------+----------+----------+----------+----------+--------+
|2018-02-23 00:00:00|173.669998|175.649994|173.539993| 175.5| 175.5|33812400|
|2018-02-26 00:00:00|176.350006|179.389999|176.210007|178.970001|178.970001|38162200|
|2018-02-27 00:00:00|179.100006|180.479996|178.160004|178.389999|178.389999|38928100|
|2018-02-28 00:00:00|179.259995|180.619995|178.050003|178.119995|178.119995|37782100|
|2018-03-02 00:00:00|172.800003|176.300003|172.449997|176.210007|176.210007|38454000|
|2018-03-05 00:00:00|175.210007|177.740005|174.520004|176.820007|176.820007|28401400|
|2018-03-06 00:00:00|177.910004| 178.25|176.130005|176.669998|176.669998|23788500|
|2018-03-07 00:00:00|174.940002|175.850006|174.270004|175.029999|175.029999|31703500|
|2018-03-08 00:00:00|175.479996|177.119995|175.070007|176.940002|176.940002|23774100|
|2018-03-09 00:00:00|177.960007| 180.0|177.389999|179.979996|179.979996|32185200|
|2018-03-12 00:00:00|180.289993|182.389999|180.210007|181.720001|181.720001|32162900|
+-------------------+----------+----------+----------+----------+----------+--------+
df.filter("Close > 175").select("High").show()
+----------+
| High|
+----------+
|175.649994|
|179.389999|
|180.479996|
|180.619995|
|176.300003|
|177.740005|
| 178.25|
|175.850006|
|177.119995|
| 180.0|
|182.389999|
+----------+
df.filter("Close > 175").select(["High","Low","Volume"]).show()
+----------+----------+--------+
| High| Low| Volume|
+----------+----------+--------+
|175.649994|173.539993|33812400|
|179.389999|176.210007|38162200|
|180.479996|178.160004|38928100|
|180.619995|178.050003|37782100|
|176.300003|172.449997|38454000|
|177.740005|174.520004|28401400|
| 178.25|176.130005|23788500|
|175.850006|174.270004|31703500|
|177.119995|175.070007|23774100|
| 180.0|177.389999|32185200|
|182.389999|180.210007|32162900|
+----------+----------+--------+
df.filter(df["Close"] > 175).show()
+-------------------+----------+----------+----------+----------+----------+--------+
| Date| Open| High| Low| Close| Adj Close| Volume|
+-------------------+----------+----------+----------+----------+----------+--------+
|2018-02-23 00:00:00|173.669998|175.649994|173.539993| 175.5| 175.5|33812400|
|2018-02-26 00:00:00|176.350006|179.389999|176.210007|178.970001|178.970001|38162200|
|2018-02-27 00:00:00|179.100006|180.479996|178.160004|178.389999|178.389999|38928100|
|2018-02-28 00:00:00|179.259995|180.619995|178.050003|178.119995|178.119995|37782100|
|2018-03-02 00:00:00|172.800003|176.300003|172.449997|176.210007|176.210007|38454000|
|2018-03-05 00:00:00|175.210007|177.740005|174.520004|176.820007|176.820007|28401400|
|2018-03-06 00:00:00|177.910004| 178.25|176.130005|176.669998|176.669998|23788500|
|2018-03-07 00:00:00|174.940002|175.850006|174.270004|175.029999|175.029999|31703500|
|2018-03-08 00:00:00|175.479996|177.119995|175.070007|176.940002|176.940002|23774100|
|2018-03-09 00:00:00|177.960007| 180.0|177.389999|179.979996|179.979996|32185200|
|2018-03-12 00:00:00|180.289993|182.389999|180.210007|181.720001|181.720001|32162900|
+-------------------+----------+----------+----------+----------+----------+--------+
df.filter(df["Close"] > 175).select(["Volume"]).show()
+--------+
| Volume|
+--------+
|33812400|
|38162200|
|38928100|
|37782100|
|38454000|
|28401400|
|23788500|
|31703500|
|23774100|
|32185200|
|32162900|
+--------+
df.filter((df["Close"] > 175) & (df["Volume"] >= 38454000)).show()
+-------------------+----------+----------+----------+----------+----------+--------+
| Date| Open| High| Low| Close| Adj Close| Volume|
+-------------------+----------+----------+----------+----------+----------+--------+
|2018-02-27 00:00:00|179.100006|180.479996|178.160004|178.389999|178.389999|38928100|
|2018-03-02 00:00:00|172.800003|176.300003|172.449997|176.210007|176.210007|38454000|
+-------------------+----------+----------+----------+----------+----------+--------+
df.filter(df["Low"] == 178.160004).show()
+-------------------+----------+----------+----------+----------+----------+--------+
| Date| Open| High| Low| Close| Adj Close| Volume|
+-------------------+----------+----------+----------+----------+----------+--------+
|2018-02-27 00:00:00|179.100006|180.479996|178.160004|178.389999|178.389999|38928100|
+-------------------+----------+----------+----------+----------+----------+--------+
result = df.filter(df["Low"] == 178.160004).collect()
result[0]
Row(Date=datetime.datetime(2018, 2, 27, 0, 0), Open=179.100006, High=180.479996, Low=178.160004, Close=178.389999, Adj Close=178.389999, Volume=38928100)
result[0].asDict()
{'Adj Close': 178.389999,
'Close': 178.389999,
'Date': datetime.datetime(2018, 2, 27, 0, 0),
'High': 180.479996,
'Low': 178.160004,
'Open': 179.100006,
'Volume': 38928100}
result[0].asDict().keys()
dict_keys(['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'])
result[0].asDict()["Volume"]
38928100
--------------------------------------------------------------------
No comments:
Post a Comment