Examples
The following examples have the following spec
fields in common:
-
version
: the current version is "1.0" -
sparkImage
: the docker image that will be used by job, driver and executor pods. This can be provided by the user. -
mode
: onlycluster
is currently supported -
mainApplicationFile
: the artifact (Java, Scala or Python) that forms the basis of the Spark job. -
args
: these are the arguments passed directly to the application. In the examples below it is e.g. the input path for part of the public New York taxi dataset. -
sparkConf
: these list spark configuration settings that are passed directly tospark-submit
and which are best defined explicitly by the user. Since theSparkApplication
"knows" that there is an external dependency (the s3 bucket where the data and/or the application is located) and how that dependency should be treated (i.e. what type of credential checks are required, if any), it is better to have these things declared together. -
volumes
: refers to any volumes needed by theSparkApplication
, in this case an underlyingPersistentVolumeClaim
. -
driver
: driver-specific settings, including any volume mounts. -
executor
: executor-specific settings, including any volume mounts.
Job-specific settings are annotated below.
Pyspark: externally located artifact and dataset
---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
name: example-sparkapp-external-dependencies
namespace: default
spec:
version: "1.0"
sparkImage: docker.stackable.tech/stackable/pyspark-k8s:3.3.0-stackable0.0.0-dev
mode: cluster
mainApplicationFile: s3a://stackable-spark-k8s-jars/jobs/ny_tlc_report.py (1)
args:
- "--input 's3a://nyc-tlc/trip data/yellow_tripdata_2021-07.csv'" (2)
deps:
requirements:
- tabulate==0.8.9 (3)
sparkConf: (4)
"spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"
"spark.driver.extraClassPath": "/dependencies/jars/*"
"spark.executor.extraClassPath": "/dependencies/jars/*"
volumes:
- name: job-deps (5)
persistentVolumeClaim:
claimName: pvc-ksv
driver:
volumeMounts:
- name: job-deps
mountPath: /dependencies (6)
executor:
instances: 3
volumeMounts:
- name: job-deps
mountPath: /dependencies (6)
1 | Job python artifact (external) |
2 | Job argument (external) |
3 | List of python job requirements: these will be installed in the pods via pip |
4 | Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in s3) |
5 | the name of the volume mount backed by a PersistentVolumeClaim that must be pre-existing |
6 | the path on the volume mount: this is referenced in the sparkConf section where the extra class path is defined for the driver and executors |
Pyspark: externally located dataset, artifact available via PVC/volume mount
---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
name: example-sparkapp-image
namespace: default
spec:
version: "1.0"
image: docker.stackable.tech/stackable/ny-tlc-report:0.1.0 (1)
sparkImage: docker.stackable.tech/stackable/pyspark-k8s:3.3.0-stackable0.0.0-dev
mode: cluster
mainApplicationFile: local:///stackable/spark/jobs/ny_tlc_report.py (2)
args:
- "--input 's3a://nyc-tlc/trip data/yellow_tripdata_2021-07.csv'" (3)
deps:
requirements:
- tabulate==0.8.9 (4)
sparkConf: (5)
"spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"
job:
resources:
cpu:
min: "1"
max: "1"
memory:
limit: "1Gi"
driver:
resources:
cpu:
min: "1"
max: "1500m"
memory:
limit: "1Gi"
executor:
instances: 3
resources:
cpu:
min: "1"
max: "4"
memory:
limit: "2Gi"
1 | Job image: this contains the job artifact that will be retrieved from the volume mount backed by the PVC |
2 | Job python artifact (local) |
3 | Job argument (external) |
4 | List of python job requirements: these will be installed in the pods via pip |
5 | Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store) |
JVM (Scala): externally located artifact and dataset
---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
name: example-sparkapp-pvc
namespace: default
spec:
version: "1.0"
sparkImage: docker.stackable.tech/stackable/spark-k8s:3.3.0-stackable0.0.0-dev
mode: cluster
mainApplicationFile: s3a://stackable-spark-k8s-jars/jobs/ny-tlc-report-1.0-SNAPSHOT.jar (1)
mainClass: org.example.App (2)
args:
- "'s3a://nyc-tlc/trip data/yellow_tripdata_2021-07.csv'"
sparkConf: (3)
"spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"
"spark.driver.extraClassPath": "/dependencies/jars/*"
"spark.executor.extraClassPath": "/dependencies/jars/*"
volumes:
- name: job-deps (4)
persistentVolumeClaim:
claimName: pvc-ksv
driver:
volumeMounts:
- name: job-deps
mountPath: /dependencies (5)
executor:
instances: 3
volumeMounts:
- name: job-deps
mountPath: /dependencies (5)
1 | Job artifact located on S3. |
2 | Job main class |
3 | Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store, accessed without credentials) |
4 | the name of the volume mount backed by a PersistentVolumeClaim that must be pre-existing |
5 | the path on the volume mount: this is referenced in the sparkConf section where the extra class path is defined for the driver and executors |
JVM (Scala): externally located artifact accessed with credentials
---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
name: example-sparkapp-s3-private
spec:
version: "1.0"
sparkImage: docker.stackable.tech/stackable/spark-k8s:3.3.0-stackable0.0.0-dev
mode: cluster
mainApplicationFile: s3a://my-bucket/spark-examples.jar (1)
mainClass: org.apache.spark.examples.SparkPi (2)
s3connection: (3)
inline:
host: test-minio
port: 9000
accessStyle: Path
credentials: (4)
secretClass: s3-credentials-class
sparkConf: (5)
spark.hadoop.fs.s3a.aws.credentials.provider: "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider" (6)
spark.driver.extraClassPath: "/dependencies/jars/hadoop-aws-3.2.0.jar:/dependencies/jars/aws-java-sdk-bundle-1.11.375.jar"
spark.executor.extraClassPath: "/dependencies/jars/hadoop-aws-3.2.0.jar:/dependencies/jars/aws-java-sdk-bundle-1.11.375.jar"
executor:
instances: 3
1 | Job python artifact (located in an S3 store) |
2 | Artifact class |
3 | S3 section, specifying the existing secret and S3 end-point (in this case, MinIO) |
4 | Credentials referencing a secretClass (not shown in is example) |
5 | Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources… |
6 | …in this case, in an S3 store, accessed with the credentials defined in the secret |
JVM (Scala): externally located artifact accessed with job arguments provided via configuration map
---
apiVersion: v1
kind: ConfigMap
metadata:
name: cm-job-arguments (1)
data:
job-args.txt: |
s3a://nyc-tlc/trip data/yellow_tripdata_2021-07.csv (2)
---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
name: ny-tlc-report-configmap
namespace: default
spec:
version: "1.0"
sparkImage: docker.stackable.tech/stackable/spark-k8s:3.3.0-stackable0.0.0-dev
mode: cluster
mainApplicationFile: s3a://stackable-spark-k8s-jars/jobs/ny-tlc-report-1.1.0.jar (3)
mainClass: tech.stackable.demo.spark.NYTLCReport
volumes:
- name: cm-job-arguments
configMap:
name: cm-job-arguments (4)
args:
- "--input /arguments/job-args.txt" (5)
sparkConf:
"spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"
driver:
volumeMounts:
- name: cm-job-arguments (6)
mountPath: /arguments (7)
executor:
instances: 3
volumeMounts:
- name: cm-job-arguments (6)
mountPath: /arguments (7)
1 | Name of the configuration map |
2 | Argument required by the job |
3 | Job scala artifact that requires an input argument |
4 | The volume backed by the configuration map |
5 | The expected job argument, accessed via the mounted configuration map file |
6 | The name of the volume backed by the configuration map that will be mounted to the driver/executor |
7 | The mount location of the volume (this will contain a file /arguments/job-args.txt ) |