When we work with Spark we usually want to first prototype to see if everything works as expected, before we start up big machines.
I spent an afternoon googling and starting and stopping the docker container to finally configure some lines of code.
So I want to share my basic local setup here, so maybe it will help someone to save some time.
When looking for a docker image with spark and jupyter we find the pyspark-notebook.
In my case I need to access AWS, so I need some additional libaries for the docker image.
To add them, I created a new Dockerfile
based on the pyspark-notebook.
The additional libraries needed are boto3
for AWS and python-dotenv
to access environment variables.
I decided to install boto3 with apt-get as this will be installed on the operating system level. Make sure to add -y
if the operating system is asking something during the install process, we will answer with yes
.
The dotenv is added via a requirements.txt so it will installed via pip, the python package manager.
Normally for the notebooks you need to have a token, but when we develop locally, we want to access the jupyter-notebook quickly and stay on the same site, without having to lookout for the new token everytime we change something.
So we need an custom configuration for that:
{ "NotebookApp": { "allow_root": true, "token": "" } }
In the Dockerfile we copy everything we need into to /home/jovyan/
directory. After some more googling I found out that this user jovyan stands for jupyter like environment. Just in case you where also wondering.
The final Dockerfile looks like this:
FROM jupyter/pyspark-notebook USER root # add needed packages RUN apt-get update && apt-get install python3-boto3 -y # Install Python requirements COPY requirements.txt /home/jovyan/ RUN pip install -r /home/jovyan/requirements.txt COPY jupyter_lab_config.json /home/jovyan/
In the docker-compose.yaml
we
- need to map the ports,
- map the volumes to save the notebook locally, otherwise everything would be lost, once we shut down the container and point to the env file.
- tell Docker where the
.env
file is located - tell Docker to build the Dockerfile in the same folder, instead of using an image.
The final docker-compose.yaml
looks like this:
version: "3.7" services: # jupyterlab with pyspark pyspark: #image: jupyter/pyspark-notebook build: . env_file: - .env environment: JUPYTER_ENABLE_LAB: "yes" ports: - "8888:8888" volumes: - ./data:/home/jovyan/work # docker run --rm -p 10000:8888 -e JUPYTER_ENABLE_LAB=yes -v "$PWD":/home/jovyan/work jupyter/pyspark-notebook
To start the container use docker-compose up
, if you changed something in the config use docker-compose up --force-recreate --build
to make sure the changes are build.
Have fun.
You can find the code also here.
Top comments (0)