PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment.
Jupyter Notebook is used to create interactive notebook documents that can contain live code, equations, visualizations, media and other computational outputs. Jupyter Notebook is often used by programmers, data scientists and students to document and demonstrate coding workflows or simply experiment with code.
Kubernetes is an open-source container orchestration system for automating software deployment, scaling, and management. Originally designed by Google.
The kubernetes control-plane & worker nodes addresses are :
192.168.56.115 192.168.56.116 192.168.56.117
you can install helm via the link helm :
Install Spark On Kubernetes via bitnami helm chart
The Steps :
you can install helm chart via the link helm chart :
Important: the spark version of helm chart must be the same as the PySpark version of jupyter
Install spark via helm chart (bitnami) :
$ helm repo add bitnami https://charts.bitnami.com/bitnami $ helm search repo bitnami $ helm install kayvan-release bitnami/spark --version 8.7.2
Deploy Jupyter workloads :
jupyter.yaml :
apiVersion: apps/v1 kind: Deployment metadata: name: jupiter-spark namespace: default spec: replicas: 1 selector: matchLabels: app: spark template: metadata: labels: app: spark spec: containers: - name: jupiter-spark-container image: docker.arvancloud.ir/jupyter/all-spark-notebook imagePullPolicy: IfNotPresent ports: - containerPort: 8888 env: - name: JUPYTER_ENABLE_LAB value: "yes" --- apiVersion: v1 kind: Service metadata: name: jupiter-spark-svc namespace: default spec: type: NodePort selector: app: spark ports: - port: 8888 targetPort: 8888 nodePort: 30001 --- apiVersion: v1 kind: Service metadata: name: jupiter-spark-driver-headless spec: clusterIP: None selector: app: spark
kubectl apply -f jupyter.yaml
the installed pods:
and Services (headless for statefull) :
Note: spark master url address is :
spark://kayvan-release-spark-master-0.kayvan-release-spark-headless.default.svc.cluster.local:7077
Open jupyter notebook and write some python codes based on pyspark and press shift + enter keys on each block to execute:
#import os #os.environ['PYSPARK_SUBMIT_ARGS']='pyspark-shell' #os.environ['PYSPARK_PYTHON']='/opt/bitnami/python/bin/python' #os.environ['PYSPARK_DRIVER_PYTHON']='/opt/bitnami/python/bin/python' from pyspark.sql import SparkSession spark = SparkSession.builder.master("spark://kayvan-release-spark-master-0.kayvan-release-spark-headless.default.svc.cluster.local:7077")\ .appName("Mahla").config('spark.driver.host', socket.gethostbyname(socket.gethostname()))\ .getOrCreate()
note:
socket.gethostbyname(socket.gethostname()) ---> returns the jupyter pod's ip address
enjoy from sending python codes to spark cluster on kubernetes via jupyter.
Note: of course you can work with pyspark single node installed on jupyter without kubernetes and when you will be sure that the code is correct, then send it via spark-submit or like above code to spark cluster on kubernetes.
Deploy on the Docker Desktop :
docker-compose.yml :
version: '3.6' services: spark-master: container_name: spark image: docker.arvancloud.ir/bitnami/spark:3.5.0 environment: - SPARK_MODE=master - SPARK_RPC_AUTHENTICATION_ENABLED=no - SPARK_RPC_ENCRYPTION_ENABLED=no - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no - SPARK_SSL_ENABLED=no - SPARK_USER=root - PYSPARK_PYTHON=/opt/bitnami/python/bin/python3 ports: - 127.0.0.1:8081:8080 - 127.0.0.1:7077:7077 networks: - spark-network spark-worker: image: docker.arvancloud.ir/bitnami/spark:3.5.0 environment: - SPARK_MODE=worker - SPARK_MASTER_URL=spark://spark:7077 - SPARK_WORKER_MEMORY=2G - SPARK_WORKER_CORES=2 - SPARK_RPC_AUTHENTICATION_ENABLED=no - SPARK_RPC_ENCRYPTION_ENABLED=no - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no - SPARK_SSL_ENABLED=no - SPARK_USER=root - PYSPARK_PYTHON=/opt/bitnami/python/bin/python3 networks: - spark-network jupyter: image: docker.arvancloud.ir/jupyter/all-spark-notebook:latest container_name: jupyter ports: - "8888:8888" environment: - JUPYTER_ENABLE_LAB=yes networks: - spark-network depends_on: - spark-master networks: spark-network:
run in cmd :
docker-compose up --scale spark-worker=2
Copy csv file to inside spark worker container :
docker cp file.csv spark-worker-1:/opt/file docker cp file.csv spark-worker-2:/opt/file
Open jupyter notebook and write some python codes based on pyspark and press shift + enter keys on each block to execute:
from pyspark.sql import SparkSession # Create a Spark session spark = SparkSession.builder.appName("YourAppName")\ .master("spark://8fa1bd982ade:7077").getOrCreate()
data = spark.read.csv("/opt/file/file.csv", header=True) data.limit(3).show()
spark.stop()
Note again: you can work with pyspark single node installed on jupyter without spark cluster and when you will be sure that the code is correct, then send it via spark-submit or like above code to spark cluster on docker desktop.
Execute on the pyspark that installed on jupyter, no on spark cluster
Copy csv file to inside jupyter container :
docker cp file.csv jupyter:/opt/file
from pyspark.sql import SparkSession # Create a Spark session spark = SparkSession.builder.appName("YourAppName").getOrCreate() data = spark.read.csv("/opt/file/file.csv", header=True) data.limit(3).show() spark.stop()
and also you can practice on single node pyspark in jupyter :
Congratulation 🍹
Top comments (0)