Stackable Operator for Apache Spark
This is an operator manages Apache Spark on Kubernetes clusters. Apache Spark is a powerful open-source big data processing framework that allows for efficient and flexible distributed computing. Its in-memory processing and fault-tolerant architecture make it ideal for a variety of use cases, including batch processing, real-time streaming, machine learning, and graph processing.
Getting Started
Follow the Getting started guide to get started with Apache Spark using the Stackable Operator. The guide will lead you through the installation of the Operator and running your first Spark application on Kubernetes.
How the Operator works
The Stackable Operator for Apache Spark reads a SparkApplication custom resource which you use to define your spark job/application. The Operator creates the relevant Kubernetes resources for the job to run.
Custom resources
The Operator manages two custom resource kinds: The SparkApplication and the SparkHistoryServer.
The SparkApplication resource is the main point of interaction with the Operator. Unlike other Stackable Operator custom resources, the SparkApplication does not have roles. An exhaustive list of options is given on the CRD reference page.
The SparkHistoryServer does have a single node
role. It is used to deploy a
Spark history server. It reads data from an
S3 bucket that you configure. Your applications need to write their logs to the same bucket.
Kubernetes resources
For every SparkApplication deployed to the cluster the Operator creates a Job, A ServiceAccout and a few ConfigMaps.
The Job runs spark-submit
in a Pod which then creates a Spark driver Pod. The driver creates its own Executors based
on the configuration in the SparkApplication. The Job, driver and executors all use the same image, which is configured
in the SparkApplication resource.
The two main ConfigMaps are the <name>-driver-pod-template
and <name>-executor-pod-template
which define how the
driver and executor Pods should be created.
The Spark history server deploys like other Stackable-supported applications: A Statefulset is created for every role group. A role group can have multiple replicas (Pods). A ConfigMap supplies the necessary configuration, and there is a service to connect to.
RBAC
The Spark-Kubernetes RBAC documentation describes
what is needed for spark-submit
jobs to run successfully: minimally a role/cluster-role to allow the driver pod to
create and manage executor pods.
However, to add security, each spark-submit
job launched by the spark-k8s operator will be assigned its own
ServiceAccount.
When the spark-k8s operator is installed via Helm, a cluster role named spark-k8s-clusterrole
is created with
pre-defined permissions.
When a new Spark application is submitted, the operator creates a new service account with the same name as the
application and binds this account to the cluster role spark-k8s-clusterrole
created by Helm.
Integrations
You can read and write data from s3 buckets, load custom job dependencies. Spark also supports easy integration with Apache Kafka which is also supported on the Stackable Data Platform. Have a look at the demos below to see it in action.
Demos
The data-lakehouse-iceberg-trino-spark demo connects multiple components and datasets into a data Lakehouse. A Spark application with structured streaming is used to stream data from Apache Kafka into the Lakehouse.
In the spark-k8s-anomaly-detection-taxi-data demo Spark is used to read training data from S3 and train an anomaly detection model on the data. The model is then stored in a Trino table.