Monday, 25 May 2020

builtins vs * - avoid using * in Pyspark

Builtins example:

from pyspark.sql.functions import col  #we didnt put * it will work
from builtins import max

myList = [1,2,5,3,22]
print(max(myList))

22

after restarting kernel:
myList = [1,2,5,3,22]
print(max(myList))

22

error here:

from pyspark.sql.functions import * #we put * so it will make ambiguity

myList = [1,2,5,3,22]
print(max(myList))  # this max will be overridden by sql function max

Flume - Simple Demo

// create a folder in hdfs : $ hdfs dfs -mkdir /user/flumeExa // Create a shell script which generates : Hadoop in real world <n>...