Examples

The following examples have the following spec fields in common:

  • version: the current version is "1.0"

  • sparkImage: the docker image that will be used by job, driver and executor pods. This can be provided by the user.

  • mode: only cluster is currently supported

  • mainApplicationFile: the artifact (Java, Scala or Python) that forms the basis of the Spark job.

  • args: these are the arguments passed directly to the application. In the examples below it is e.g. the input path for part of the public New York taxi dataset.

  • sparkConf: these list spark configuration settings that are passed directly to spark-submit and which are best defined explicitly by the user. Since the SparkApplication "knows" that there is an external dependency (the s3 bucket where the data and/or the application is located) and how that dependency should be treated (i.e. what type of credential checks are required, if any), it is better to have these things declared together.

  • volumes: refers to any volumes needed by the SparkApplication, in this case an underlying PersistentVolumeClaim.

  • driver: driver-specific settings, including any volume mounts.

  • executor: executor-specific settings, including any volume mounts.

Job-specific settings are annotated below.

Pyspark: externally located artifact and dataset

--- apiVersion: spark.stackable.tech/v1alpha1 kind: SparkApplication metadata: name: example-sparkapp-external-dependencies namespace: default spec: sparkImage: productVersion: 3.5.0 mode: cluster mainApplicationFile: s3a://stackable-spark-k8s-jars/jobs/ny_tlc_report.py (1) args: - "--input 's3a://nyc-tlc/trip data/yellow_tripdata_2021-07.csv'" (2) deps: requirements: - tabulate==0.8.9 (3) sparkConf: (4) "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider" "spark.driver.extraClassPath": "/dependencies/jars/*" "spark.executor.extraClassPath": "/dependencies/jars/*" volumes: - name: job-deps (5) persistentVolumeClaim: claimName: pvc-ksv driver: config: volumeMounts: - name: job-deps mountPath: /dependencies (6) executor: replicas: 3 config: volumeMounts: - name: job-deps mountPath: /dependencies (6)
1 Job python artifact (external)
2 Job argument (external)
3 List of python job requirements: these will be installed in the pods via pip
4 Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in s3)
5 the name of the volume mount backed by a PersistentVolumeClaim that must be pre-existing
6 the path on the volume mount: this is referenced in the sparkConf section where the extra class path is defined for the driver and executors

Pyspark: externally located dataset, artifact available via PVC/volume mount

--- apiVersion: spark.stackable.tech/v1alpha1 kind: SparkApplication metadata: name: example-sparkapp-image namespace: default spec: image: docker.stackable.tech/stackable/ny-tlc-report:0.1.0 (1) sparkImage: productVersion: 3.5.0 mode: cluster mainApplicationFile: local:///stackable/spark/jobs/ny_tlc_report.py (2) args: - "--input 's3a://nyc-tlc/trip data/yellow_tripdata_2021-07.csv'" (3) deps: requirements: - tabulate==0.8.9 (4) sparkConf: (5) "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider" job: config: resources: cpu: min: "1" max: "1" memory: limit: "1Gi" driver: config: resources: cpu: min: "1" max: "1500m" memory: limit: "1Gi" executor: replicas: 3 config: resources: cpu: min: "1" max: "4" memory: limit: "2Gi"
1 Job image: this contains the job artifact that will be retrieved from the volume mount backed by the PVC
2 Job python artifact (local)
3 Job argument (external)
4 List of python job requirements: these will be installed in the pods via pip
5 Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store)

JVM (Scala): externally located artifact and dataset

--- apiVersion: spark.stackable.tech/v1alpha1 kind: SparkApplication metadata: name: example-sparkapp-pvc namespace: default spec: sparkImage: productVersion: 3.5.0 mode: cluster mainApplicationFile: s3a://stackable-spark-k8s-jars/jobs/ny-tlc-report-1.0-SNAPSHOT.jar (1) mainClass: org.example.App (2) args: - "'s3a://nyc-tlc/trip data/yellow_tripdata_2021-07.csv'" sparkConf: (3) "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider" "spark.driver.extraClassPath": "/dependencies/jars/*" "spark.executor.extraClassPath": "/dependencies/jars/*" volumes: - name: job-deps (4) persistentVolumeClaim: claimName: pvc-ksv driver: config: volumeMounts: - name: job-deps mountPath: /dependencies (5) executor: replicas: 3 config: volumeMounts: - name: job-deps mountPath: /dependencies (5)
1 Job artifact located on S3.
2 Job main class
3 Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store, accessed without credentials)
4 the name of the volume mount backed by a PersistentVolumeClaim that must be pre-existing
5 the path on the volume mount: this is referenced in the sparkConf section where the extra class path is defined for the driver and executors

JVM (Scala): externally located artifact accessed with credentials

--- apiVersion: spark.stackable.tech/v1alpha1 kind: SparkApplication metadata: name: example-sparkapp-s3-private spec: sparkImage: productVersion: 3.5.0 mode: cluster mainApplicationFile: s3a://my-bucket/spark-examples.jar (1) mainClass: org.apache.spark.examples.SparkPi (2) s3connection: (3) inline: host: test-minio port: 9000 accessStyle: Path credentials: (4) secretClass: s3-credentials-class sparkConf: (5) spark.hadoop.fs.s3a.aws.credentials.provider: "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider" (6) spark.driver.extraClassPath: "/dependencies/jars/hadoop-aws-3.2.0.jar:/dependencies/jars/aws-java-sdk-bundle-1.11.375.jar" spark.executor.extraClassPath: "/dependencies/jars/hadoop-aws-3.2.0.jar:/dependencies/jars/aws-java-sdk-bundle-1.11.375.jar" executor: replicas: 3
1 Job python artifact (located in an S3 store)
2 Artifact class
3 S3 section, specifying the existing secret and S3 end-point (in this case, MinIO)
4 Credentials referencing a secretClass (not shown in is example)
5 Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources…​
6 …​in this case, in an S3 store, accessed with the credentials defined in the secret

JVM (Scala): externally located artifact accessed with job arguments provided via configuration map

--- apiVersion: v1 kind: ConfigMap metadata: name: cm-job-arguments (1) data: job-args.txt: | s3a://nyc-tlc/trip data/yellow_tripdata_2021-07.csv (2)
--- apiVersion: spark.stackable.tech/v1alpha1 kind: SparkApplication metadata: name: ny-tlc-report-configmap namespace: default spec: sparkImage: productVersion: 3.5.0 mode: cluster mainApplicationFile: s3a://stackable-spark-k8s-jars/jobs/ny-tlc-report-1.1.0.jar (3) mainClass: tech.stackable.demo.spark.NYTLCReport volumes: - name: cm-job-arguments configMap: name: cm-job-arguments (4) args: - "--input /arguments/job-args.txt" (5) sparkConf: "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider" driver: config: volumeMounts: - name: cm-job-arguments (6) mountPath: /arguments (7) executor: replicas: 3 config: volumeMounts: - name: cm-job-arguments (6) mountPath: /arguments (7)
1 Name of the configuration map
2 Argument required by the job
3 Job scala artifact that requires an input argument
4 The volume backed by the configuration map
5 The expected job argument, accessed via the mounted configuration map file
6 The name of the volume backed by the configuration map that will be mounted to the driver/executor
7 The mount location of the volume (this will contain a file /arguments/job-args.txt)