Posted on Sep 2, 2019

Need help dockerizing Spark

Need Help

I have been working on docker,where I have to run the spark application.
I tried using docker repository spark images but ran into issues, so I tried doing my own.

It worked out but every run its downloading spark and i am losing previously ran job logs.

My requirments

Is it possible to have seperate spark image and supply app.jar to it.
Instead of writing logs in docker can I direct it to host file system.

Docker file

FROM alpine ENV SPARK_VERSION=2.2.0 ENV HADOOP_VERSION=2.7 RUN apk add tar RUN apk add aria2 RUN mkdir spark RUN cd spark WORKDIR /spark #copy app.properties to docker COPY app.properties . # copy /home/exa9/SparkSubmit/App/target/App-0.0.1-SNAPSHOT.jar ADD target/App-0.0.1-SNAPSHOT.jar app.jar #Downloading Apache Spark and extracting RUN aria2c -x16 http://archive.apache.org/dist/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz RUN apk add --no-cache curl bash openjdk8-jre \ && tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz WORKDIR /spark/spark-2.2.0-bin-hadoop2.7/bin CMD ./spark-submit --class com.Spark.Test.SparkApp.App --master local[*] /spark/app.jar /spark/app.properties

Top comments (1)

Shawon Ashraf • Sep 6 '19 • Edited

You can mount a directory as a volume to your container and store the logs there. That way your logs will remain free from side effects. As for the spark re-download issue, you've to find another way to include the spark binary. Since you're writing a Java application, using Maven or Gradle would've made that a lot easier and would've been just a build script away!