DEV Community

Cover image for PySpark & Jupyter Notebooks Deployed On Kubernetes
Keyvan Soleimani
Keyvan Soleimani

Posted on

PySpark & Jupyter Notebooks Deployed On Kubernetes

PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment.

Jupyter Notebook is used to create interactive notebook documents that can contain live code, equations, visualizations, media and other computational outputs. Jupyter Notebook is often used by programmers, data scientists and students to document and demonstrate coding workflows or simply experiment with code.

Kubernetes is an open-source container orchestration system for automating software deployment, scaling, and management. Originally designed by Google.


The kubernetes control-plane & worker nodes addresses are :

192.168.56.115 192.168.56.116 192.168.56.117 
Enter fullscreen mode Exit fullscreen mode

Image description
Kubernetes cluster nodes :

Image description
you can install helm via the link helm :


Install Spark On Kubernetes via bitnami helm chart

The Steps :

you can install helm chart via the link helm chart :

Important: the spark version of helm chart must be the same as the PySpark version of jupyter

Install spark via helm chart (bitnami) :

Image description

$ helm repo add bitnami https://charts.bitnami.com/bitnami $ helm search repo bitnami $ helm install kayvan-release bitnami/spark --version 8.7.2 
Enter fullscreen mode Exit fullscreen mode

Deploy Jupyter workloads :

jupyter.yaml :

apiVersion: apps/v1 kind: Deployment metadata: name: jupiter-spark namespace: default spec: replicas: 1 selector: matchLabels: app: spark template: metadata: labels: app: spark spec: containers: - name: jupiter-spark-container image: docker.arvancloud.ir/jupyter/all-spark-notebook imagePullPolicy: IfNotPresent ports: - containerPort: 8888 env: - name: JUPYTER_ENABLE_LAB value: "yes" --- apiVersion: v1 kind: Service metadata: name: jupiter-spark-svc namespace: default spec: type: NodePort selector: app: spark ports: - port: 8888 targetPort: 8888 nodePort: 30001 --- apiVersion: v1 kind: Service metadata: name: jupiter-spark-driver-headless spec: clusterIP: None selector: app: spark 
Enter fullscreen mode Exit fullscreen mode
kubectl apply -f jupyter.yaml 
Enter fullscreen mode Exit fullscreen mode

the installed pods:

Image description
and Services (headless for statefull) :

Image description
Note: spark master url address is :

spark://kayvan-release-spark-master-0.kayvan-release-spark-headless.default.svc.cluster.local:7077 
Enter fullscreen mode Exit fullscreen mode

Open jupyter notebook and write some python codes based on pyspark and press shift + enter keys on each block to execute:

#import os #os.environ['PYSPARK_SUBMIT_ARGS']='pyspark-shell' #os.environ['PYSPARK_PYTHON']='/opt/bitnami/python/bin/python' #os.environ['PYSPARK_DRIVER_PYTHON']='/opt/bitnami/python/bin/python' from pyspark.sql import SparkSession spark = SparkSession.builder.master("spark://kayvan-release-spark-master-0.kayvan-release-spark-headless.default.svc.cluster.local:7077")\ .appName("Mahla").config('spark.driver.host', socket.gethostbyname(socket.gethostname()))\ .getOrCreate() 
Enter fullscreen mode Exit fullscreen mode

note:

socket.gethostbyname(socket.gethostname()) ---> returns the jupyter pod's ip address 
Enter fullscreen mode Exit fullscreen mode

Image description

Image description

Image description
enjoy from sending python codes to spark cluster on kubernetes via jupyter.

Note: of course you can work with pyspark single node installed on jupyter without kubernetes and when you will be sure that the code is correct, then send it via spark-submit or like above code to spark cluster on kubernetes.


Deploy on the Docker Desktop :

docker-compose.yml :

version: '3.6' services: spark-master: container_name: spark image: docker.arvancloud.ir/bitnami/spark:3.5.0 environment: - SPARK_MODE=master - SPARK_RPC_AUTHENTICATION_ENABLED=no - SPARK_RPC_ENCRYPTION_ENABLED=no - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no - SPARK_SSL_ENABLED=no - SPARK_USER=root - PYSPARK_PYTHON=/opt/bitnami/python/bin/python3 ports: - 127.0.0.1:8081:8080 - 127.0.0.1:7077:7077 networks: - spark-network spark-worker: image: docker.arvancloud.ir/bitnami/spark:3.5.0 environment: - SPARK_MODE=worker - SPARK_MASTER_URL=spark://spark:7077 - SPARK_WORKER_MEMORY=2G - SPARK_WORKER_CORES=2 - SPARK_RPC_AUTHENTICATION_ENABLED=no - SPARK_RPC_ENCRYPTION_ENABLED=no - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no - SPARK_SSL_ENABLED=no - SPARK_USER=root - PYSPARK_PYTHON=/opt/bitnami/python/bin/python3 networks: - spark-network jupyter: image: docker.arvancloud.ir/jupyter/all-spark-notebook:latest container_name: jupyter ports: - "8888:8888" environment: - JUPYTER_ENABLE_LAB=yes networks: - spark-network depends_on: - spark-master networks: spark-network: 
Enter fullscreen mode Exit fullscreen mode

run in cmd :

docker-compose up --scale spark-worker=2 
Enter fullscreen mode Exit fullscreen mode

Image description
Copy csv file to inside spark worker container :

docker cp file.csv spark-worker-1:/opt/file docker cp file.csv spark-worker-2:/opt/file 
Enter fullscreen mode Exit fullscreen mode

Open jupyter notebook and write some python codes based on pyspark and press shift + enter keys on each block to execute:

from pyspark.sql import SparkSession # Create a Spark session spark = SparkSession.builder.appName("YourAppName")\ .master("spark://8fa1bd982ade:7077").getOrCreate() 
Enter fullscreen mode Exit fullscreen mode
data = spark.read.csv("/opt/file/file.csv", header=True) data.limit(3).show() 
Enter fullscreen mode Exit fullscreen mode
spark.stop() 
Enter fullscreen mode Exit fullscreen mode

Image description

Image description
Note again: you can work with pyspark single node installed on jupyter without spark cluster and when you will be sure that the code is correct, then send it via spark-submit or like above code to spark cluster on docker desktop.


Execute on the pyspark that installed on jupyter, no on spark cluster

Copy csv file to inside jupyter container :

docker cp file.csv jupyter:/opt/file 
Enter fullscreen mode Exit fullscreen mode
from pyspark.sql import SparkSession # Create a Spark session spark = SparkSession.builder.appName("YourAppName").getOrCreate() data = spark.read.csv("/opt/file/file.csv", header=True) data.limit(3).show() spark.stop() 
Enter fullscreen mode Exit fullscreen mode

and also you can practice on single node pyspark in jupyter :

Image description

Congratulation 🍹

Top comments (0)