Examples
The following examples have the following spec
fields in common:
-
version
: the current version is "1.0" -
sparkImage
: the docker image that will be used by job, driver and executor pods. This can be provided by the user. -
mode
: onlycluster
is currently supported -
mainApplicationFile
: the artifact (Java, Scala or Python) that forms the basis of the Spark job. -
args
: these are the arguments passed directly to the application. In the examples below it is e.g. the input path for part of the public New York taxi dataset. -
sparkConf
: these list spark configuration settings that are passed directly tospark-submit
and which are best defined explicitly by the user. Since theSparkApplication
"knows" that there is an external dependency (the s3 bucket where the data and/or the application is located) and how that dependency should be treated (i.e. what type of credential checks are required, if any), it is better to have these things declared together. -
volumes
: refers to any volumes needed by theSparkApplication
, in this case an underlyingPersistentVolumeClaim
. -
driver
: driver-specific settings, including any volume mounts. -
executor
: executor-specific settings, including any volume mounts.
Job-specific settings are annotated below.
Pyspark: externally located artifact and dataset
--- apiVersion: spark.stackable.tech/v1alpha1 kind: SparkApplication metadata: name: example-sparkapp-external-dependencies namespace: default spec: sparkImage: productVersion: 3.5.0 mode: cluster mainApplicationFile: s3a://stackable-spark-k8s-jars/jobs/ny_tlc_report.py (1) args: - "--input 's3a://nyc-tlc/trip data/yellow_tripdata_2021-07.csv'" (2) deps: requirements: - tabulate==0.8.9 (3) sparkConf: (4) "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider" "spark.driver.extraClassPath": "/dependencies/jars/*" "spark.executor.extraClassPath": "/dependencies/jars/*" volumes: - name: job-deps (5) persistentVolumeClaim: claimName: pvc-ksv driver: config: volumeMounts: - name: job-deps mountPath: /dependencies (6) executor: replicas: 3 config: volumeMounts: - name: job-deps mountPath: /dependencies (6)
1 | Job python artifact (external) |
2 | Job argument (external) |
3 | List of python job requirements: these will be installed in the pods via pip |
4 | Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in s3) |
5 | the name of the volume mount backed by a PersistentVolumeClaim that must be pre-existing |
6 | the path on the volume mount: this is referenced in the sparkConf section where the extra class path is defined for the driver and executors |
Pyspark: externally located dataset, artifact available via PVC/volume mount
--- apiVersion: spark.stackable.tech/v1alpha1 kind: SparkApplication metadata: name: example-sparkapp-image namespace: default spec: image: docker.stackable.tech/stackable/ny-tlc-report:0.1.0 (1) sparkImage: productVersion: 3.5.0 mode: cluster mainApplicationFile: local:///stackable/spark/jobs/ny_tlc_report.py (2) args: - "--input 's3a://nyc-tlc/trip data/yellow_tripdata_2021-07.csv'" (3) deps: requirements: - tabulate==0.8.9 (4) sparkConf: (5) "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider" job: config: resources: cpu: min: "1" max: "1" memory: limit: "1Gi" driver: config: resources: cpu: min: "1" max: "1500m" memory: limit: "1Gi" executor: replicas: 3 config: resources: cpu: min: "1" max: "4" memory: limit: "2Gi"
1 | Job image: this contains the job artifact that will be retrieved from the volume mount backed by the PVC |
2 | Job python artifact (local) |
3 | Job argument (external) |
4 | List of python job requirements: these will be installed in the pods via pip |
5 | Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store) |
JVM (Scala): externally located artifact and dataset
--- apiVersion: spark.stackable.tech/v1alpha1 kind: SparkApplication metadata: name: example-sparkapp-pvc namespace: default spec: sparkImage: productVersion: 3.5.0 mode: cluster mainApplicationFile: s3a://stackable-spark-k8s-jars/jobs/ny-tlc-report-1.0-SNAPSHOT.jar (1) mainClass: org.example.App (2) args: - "'s3a://nyc-tlc/trip data/yellow_tripdata_2021-07.csv'" sparkConf: (3) "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider" "spark.driver.extraClassPath": "/dependencies/jars/*" "spark.executor.extraClassPath": "/dependencies/jars/*" volumes: - name: job-deps (4) persistentVolumeClaim: claimName: pvc-ksv driver: config: volumeMounts: - name: job-deps mountPath: /dependencies (5) executor: replicas: 3 config: volumeMounts: - name: job-deps mountPath: /dependencies (5)
1 | Job artifact located on S3. |
2 | Job main class |
3 | Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store, accessed without credentials) |
4 | the name of the volume mount backed by a PersistentVolumeClaim that must be pre-existing |
5 | the path on the volume mount: this is referenced in the sparkConf section where the extra class path is defined for the driver and executors |
JVM (Scala): externally located artifact accessed with credentials
--- apiVersion: spark.stackable.tech/v1alpha1 kind: SparkApplication metadata: name: example-sparkapp-s3-private spec: sparkImage: productVersion: 3.5.0 mode: cluster mainApplicationFile: s3a://my-bucket/spark-examples.jar (1) mainClass: org.apache.spark.examples.SparkPi (2) s3connection: (3) inline: host: test-minio port: 9000 accessStyle: Path credentials: (4) secretClass: s3-credentials-class sparkConf: (5) spark.hadoop.fs.s3a.aws.credentials.provider: "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider" (6) spark.driver.extraClassPath: "/dependencies/jars/hadoop-aws-3.2.0.jar:/dependencies/jars/aws-java-sdk-bundle-1.11.375.jar" spark.executor.extraClassPath: "/dependencies/jars/hadoop-aws-3.2.0.jar:/dependencies/jars/aws-java-sdk-bundle-1.11.375.jar" executor: replicas: 3
1 | Job python artifact (located in an S3 store) |
2 | Artifact class |
3 | S3 section, specifying the existing secret and S3 end-point (in this case, MinIO) |
4 | Credentials referencing a secretClass (not shown in is example) |
5 | Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources… |
6 | …in this case, in an S3 store, accessed with the credentials defined in the secret |
JVM (Scala): externally located artifact accessed with job arguments provided via configuration map
--- apiVersion: v1 kind: ConfigMap metadata: name: cm-job-arguments (1) data: job-args.txt: | s3a://nyc-tlc/trip data/yellow_tripdata_2021-07.csv (2)
--- apiVersion: spark.stackable.tech/v1alpha1 kind: SparkApplication metadata: name: ny-tlc-report-configmap namespace: default spec: sparkImage: productVersion: 3.5.0 mode: cluster mainApplicationFile: s3a://stackable-spark-k8s-jars/jobs/ny-tlc-report-1.1.0.jar (3) mainClass: tech.stackable.demo.spark.NYTLCReport volumes: - name: cm-job-arguments configMap: name: cm-job-arguments (4) args: - "--input /arguments/job-args.txt" (5) sparkConf: "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider" driver: config: volumeMounts: - name: cm-job-arguments (6) mountPath: /arguments (7) executor: replicas: 3 config: volumeMounts: - name: cm-job-arguments (6) mountPath: /arguments (7)
1 | Name of the configuration map |
2 | Argument required by the job |
3 | Job scala artifact that requires an input argument |
4 | The volume backed by the configuration map |
5 | The expected job argument, accessed via the mounted configuration map file |
6 | The name of the volume backed by the configuration map that will be mounted to the driver/executor |
7 | The mount location of the volume (this will contain a file /arguments/job-args.txt ) |