Getting started with PySpark and running your first application

Introduction to Spark

Spark is a unified engine for distributed data processing. It supports both on premise and cloud installation. Applications written in Spark can automatically scale to all the available nodes of a cluster to parallelize heavy data processing tasks, be it for Data Engineering, Data Science, Machine Learning or Deep Learning.

Spark consists of following four main libraries (as shown in Figure below)

Spark SQL (Dataframes and Datasets)
Spark Streaming (Structured Streaming)
Machine Learning (MLlib)
Graph Processing (GraphX)

Each of the above four components are separate from Spark’s core fault-tolerant engine, which provides the APIs (in either JAVA, Python, Scala or SQL) to write Spark applications that converts it into a DAG (Directed Acyclic Graph) that is executed by the core engine.

Spark code is decomposed into highly compact bytecode that is executed by the JVMs running on each worker node on the spark cluster

Spark provides in-memory storage on worker nodes in the cluster for intermediate computations, making it much faster than Hadoop MapReduce, which does not make use of the worker node memory.

Spark provides support for following programming languages:

Scala
SQL
Python
Java
R

There are four main characteristics of the design philosophy of Spark:

Speed
Ease of Use
Modularity
Extensibility

It is so widely acclaimed that the original creators of Spark even got an ACM award for their paper describing Spark as a Unified Engine for Big Data Processing.

Who should use it?

Data Scientists

Spark can easily perform tasks like Clustering, Build Machine Learning models using MLLib library and even Deep Learning.

Apache Spark 2.4 introduced a new gang scheduler, as part of project Hydrogen, to accommodate the fault tolerant requirements for training and scheduling deep learning modes in a distributed manner and Spark 3.0 even added support for worker nodes with GPUs.

Data Engineers

Data engineers often need to process large amounts of raw data so that it can be made usable for further analysis by Data Scientists. Spark provides a simple way to utilize the power of parallel computations to process large amounts of data and hides all the complexity of distribution and fault tolerance.

Spark 2.x introduced evolutionary streaming model called continuous applications with Structured Streaming, which enables data engineers to build complex data pipelines to perform ETL from both real time and static data sources.

Catalyst optimizer for SQL and Tungsten for compact code generation has enabled much better performance in Spark 2.x and Spark 3.0

PySpark Installation – Local

Although the real benefit of spark comes from using it on a cluster of computers, for learning purposes it is very straightforward to install it locally on windows, mac or linux. For up to date instructions please follow the steps mentioned in spark documentation. I will describe the way I installed PySpark it on my windows computer below. You can even choose to work in Spark using Scala, R , SQL or Java. And if you want an easy to use cluster based installation, DataBricks is the best option to go with

Getting up and running on Windows

Pre-requisites– Anaconda is already installed on your computer.

Step 1:- Open Command prompt on your windows machine

Step 2:- Create a new conda virtual environment named spark (you can name anything you want)

conda create -n spark python=3

Step 3:- Activate the newly created virtual environment

conda activate spark

Step 4:- Install pyspark using pip install

pip install pyspark

Step 5:- Once successfully installed you can check by typing pyspark on your command prompt, you should get the below output

And that is it! It’s that simple.

Now for experimentation, I prefer to use Jupyter notebook. You can use VS code or any other editor of your choice. If you also want to use Jupyter notebook, you can install Jupyter lab if not already installed in your spark virtual environment.

pip install jupyterlab

Now you can start your Jupyter notebook by below command and it will automatically open a new browser tab. Please follow other tutorials if you are new to Jupyter notebooks.

jupyter lab

I use the following code to test my installation(this code might not be as per Spark3.0 but it works and tests the spark installation by calculating the value of pi). You can also run it in your notebook cell

%%time

import findspark 
findspark.init()

import pyspark 
sc = pyspark.SparkContext(appName="myAppName")

import random 
NUM_SAMPLES = 1000 
def inside(p): 
    x, y = random.random(), random.random() 
    return x*x + y*y < 1 
count = sc.parallelize(range(0, NUM_SAMPLES)).filter(inside).count() 
print ("Pi is roughly %f" % (4.0 * count / NUM_SAMPLES))

Understanding Spark High Level Architecture

Spark Application

It is the topmost abstraction layer of a spark program. The spark application creates the spark driver, which manages the parallel operations on the spark cluster via the Spark Session. The driver uses Spark session to accesses the cluster manager and spark executor.

Spark Driver

Spark Driver is the main orchestrator of the overall program execution. It creates the Spark session and performs the following duties:

Coordinates with cluster manager to request hardware resources(CPU, Mem etc) for the Spark’s executors(JVMs)
Converts all Spark operations into DAGs (Directed Acyclic Graphs)
Schedules the DAGs
Distributes the individual tasks in the DAGs across Spark executors
After resource allocation is done, it communicates directly with the Spark Executors

Spark Session

Spark session is the single unified entry point to all of Spark’s functionality. Through this you can create JVM runtime parameters, defined DataFrames and Datasets, read from data sources, access catalog metadata and issue Spark SQL queries.

Since Spark 2.0, it subsumed all the previous entry points to Spark like SparkContext, SQLContext, StreamingContext etc. and became the single entry point for all of Spark’s functionality. However for backward compatibility you can still access these contexts directly.

Spark session is automatically created when you start the spark shell and can be accessed using a global variable called sc. For standalone applications you can create Spark session using high level APIs

Cluster Manager

Spark supports following four cluster managers, which are responsible for managing and allocating cluster resources to the Spark application:

Built-in Standalone cluster manager
Apache Hadoop YARN
Apache Mesos
Kubernetes

Spark may add support for a new cluster manager in future as and when they become popular

Spark Executor

This is the work horse of the Spark application. It runs on each worker node and is responsible for executing individual tasks. Primarily it performs following duties:

Communicate with Driver program
Execute tasks on workers

Each executor has the same number of executor cores as the CPU’s logical processor count. It means it can execute that many tasks in parallel on a single worker node. When you run locally the number of executor cores you will see will be the number of logical processor on your machine.

Understanding Data Partitions

In order to understand Spark’s execution strategy, we need to understand how spark treats the data it processes. In a cluster setup and big data applications, the data are stored in multiple partitions distributed across the physical cluster.

Spark treats each partition as a high-level logical data abstraction i.e. as a DataFrame in memory. Then uses this logical Data model to allocate individual spark executor that reads and processes data closest to it in the network, observing data locality

Understanding Jobs, Stages and Tasks

Below diagram shows the hierarchy of execution steps Spark application takes to execute the high level commands coded in your application.

Jobs

First of all, Spark Driver converts the Spark application into one ore more jobs depending on the number of high level APIs the Spark application has called for. Then it transforms each job into a DAG, which is like the execution plan generated by Spark

Stages

Then as part of the DAG nodes, different stages are created based on what operations can be performed serially or parallelly. Most of the times a stage boundary is governed by the computation boundary i.e. if data transfer is needed among spark executors to continue the next operation.

Tasks

Each Stage consists of Spark Tasks (a single unit of execution), which are then assigned to individual Spark executor core, which in turn works on a single partition of data. Therefore, for example, an executor with 12 cores can have 12 or more tasks working on 12 or more partitions in parallel, unleashing the unlimited processing power of a Spark application.

Running your first parallel computation

Now that we have covered basic Spark concepts and completed local installation, we are now ready to write our first Spark application. Let’s fire up our Jupyter notebook and get cracking!

In this example we will use a Kaggle Dataset for car sales provided by Car Dekho. You can download the csv file from Kaggle at below link. You might need to create a Kaggle account, which is free if you don’t already have one

Vehicle dataset

After downloading the data file I renamed it to replace spaces with “_”.

The purpose of this example is the calculate the count of cars on sale for each fuel type, pretty simple. Below is the code to do that with comments

#Create a spark Session and import required functions
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = (SparkSession
        .builder
        .appName('CarAnalysis')
        .getOrCreate())

#Read the cars dataset
cars_df = (spark.read.format("csv")
           .option("header","true")
           .option("inferSchema","true")
           .load('CAR_DETAILS_FROM_CAR_DEKHO.csv'))

#Take a peek at the dataset (first 10 rows)
cars_df.show(10,truncate=False)

#Aggregate to the count of cars for sale for each fuel type
cars_count_by_fuel_type = (cars_df
                          .select("fuel")
                          .where(col("fuel").isNotNull())
                          .groupBy("fuel")
                          .count()
                          .orderBy(desc("count"))
                          .show())

As expected there are slightly more number of Diesel cars for sale because Diesel is cheaper in India where Car Dekho is based.

Now this seemed pretty straightforward and you can easily do this in Pandas as well. But the power of Spark will be felt when you will have millions of records in your dataset. Spark will parallelize its aggregate operation in this example to run on multiple worker nodes.

Spark UI

After running the above application open the Spark UI using below URL in your browser

http://localhost:4040/

You will see the details on how Spark processed your application. How many Jobs, Stages and Tasks it was divided into. What the DAG nodes looked like and how long did it take to execute each stage. I will cover this UI in detail in upcoming posts. But for now you can play around it and see the options available. Below is the screenshot of the DAG for our example applicaiton

Conclusion

In this article I covered on how to get started with Spark, PySpark in particular. We covered local installation, basic architecture and execution concepts along with one example application. In upcoming posts I will dive deeper into Spark’s execution model and look at important high level Structured APIs like DataFrames and Datasets. Until then Sparkle!!

Let me know your feedback in the comments below.

One Reply to “Getting started with PySpark and running your first application”

S says:

December 14, 2023 at 2:34 am

Thank you for simplifying the concepts and explaining so well! Looking forward to your post on how to interpret Spark UI

Getting started with PySpark and running your first application

Introduction to Spark

Who should use it?

PySpark Installation – Local

Understanding Spark High Level Architecture

Spark Application

Spark Driver

Spark Session

Cluster Manager

Spark Executor

Understanding Data Partitions

Understanding Jobs, Stages and Tasks

Jobs

Stages

Tasks

Running your first parallel computation

Spark UI

Conclusion

Like this:

Published by Raghav

One Reply to “Getting started with PySpark and running your first application”

Leave a Reply Cancel reply

Introduction to Spark

Who should use it?

PySpark Installation – Local

Understanding Spark High Level Architecture

Spark Application

Spark Driver

Spark Session

Cluster Manager

Spark Executor

Understanding Data Partitions

Understanding Jobs, Stages and Tasks

Jobs

Stages

Tasks

Running your first parallel computation

Spark UI

Conclusion

Share this:

Like this:

Published by Raghav

One Reply to “Getting started with PySpark and running your first application”

Leave a Reply Cancel reply