Saturday, 23 May 2020

PyCharm with PySpark - sample program

PYTHONPATH = $SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.8.1-src.zip
$SPARK_HOME = /home/hadoop/spark-3.0.0-preview2-bin-hadoop3.2

environmental variables

PYTHONUNBUFFERED=1;PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.8.1-src.zip;$SPARK_HOME=/home/hadoop/spark-3.0.0-preview2-bin-hadoop3.2


File - New - Project
File - New - Python Package
go inside the package:
New - Python file  (demo.py, programming.py)
demo.py:

from pyspark.sql import SparkSession


def createsparkdriver():
    spark = SparkSession.builder.master("local").appName("demoApp").getOrCreate()
    return spark


programming.py:

from demo import createsparkdriver

if __name__ == "__main__":
    spark = createsparkdriver()
    df = spark.read.format("json").option("multiline", True).load("hdfs://localhost:9000/SparkFiles/orgs.json")
    df.show()
    spark.stop()
Right click - Run



Interactive one:

from demo import createsparkdriver

if __name__ == "__main__":
    spark = createsparkdriver()

    file_format = input("Enter the file format\t : ")
    file_path = input("Enter the input file path\t : ")

    df = spark.read.format(file_format).option("multiline", True).load(file_path)
    df.show()
    spark.stop()

Run it:

Enter the file format : json
Enter the input file path : hdfs://localhost:9000/SparkFiles/orgs.json


No comments:

Post a Comment

Flume - Simple Demo

// create a folder in hdfs : $ hdfs dfs -mkdir /user/flumeExa // Create a shell script which generates : Hadoop in real world <n>...