Monday, 25 May 2020

import only what you need in PySpark - Example (Best Practices)

#dont use import * -- not recommended
from pyspark.sql.functions import * 
df1.select('doj',year('doj'),hour('doj').alias('hour'), month(df1.doj).alias('Month'),minute("doj").alias("Minute")).show()

+-------------------+---------+----+-----+------+
|                doj|year(doj)|hour|Month|Minute|
+-------------------+---------+----+-----+------+
|2014-12-23 23:34:45|     2014|  23|   12|    34|
|               null|     null|null| null|  null|
|2010-01-01 12:34:22|     2010|  12|    1|    34|
+-------------------+---------+----+-----+------+

#best practice - import only what you need
from pyspark.sql.functions import col,year,hour,month,minute
df1.select('doj',year('doj'),hour('doj').alias('hour'), month(df1.doj).alias('Month'),minute("doj").alias("Minute")).show()

No comments:

Post a Comment

Flume - Simple Demo

// create a folder in hdfs : $ hdfs dfs -mkdir /user/flumeExa // Create a shell script which generates : Hadoop in real world <n>...