ADR022: Spark history server

Status: accepted
Deciders:
- Andrew Kenworthy
- Sebastian Bernauer
- Razvan Mihai
Date: 2022-08-23

Context and Problem Statement

Monitoring Spark applications usually involves two types of components: a history server and a metrics collection server. In this document we will concern ourselves with the history server. Using metrics for application monitoring requires a completely different set of protocols and technologies and is outside the scope of this document.

The Spark history server is part of the Apache Spark distribution and it provides a web interface where application developers can visualize events that took place during the lifetime of a job as well as the outcome of various job phases.

In this document, we propose a way to add the Spark history server to the Kubernetes resources that are managed by the Spark-on-Kubernetes operator. We also describe how Spark application definitions can leverage the history server and how the resource definitions make use of standard Stackable descriptors for S3 storage.

Even though using a history server is optional when running Spark applications on the Stackable platform, it is highly recommended to make use of it in production environments.

Decision Drivers

Ease of use / intuitive / checked
Support multiple history servers per operator installation
Integrate with the Stackable platform resource handling for S3 storage

History Server Definition

A history server is a separate entity, completely decoupled from any `SparkApplication`s. The Spark-on-k8s operator makes use of it only if available and if the Spark application developer has requested it. See the chapter below, on how `SparkApplication`s can request logging events to an existing history server.

The most important part of the history server definition is the location of the event files. This is specified in the logFileDirectory entry of the resource specification as shown below.

  ...
  logFileDirectory:
    s3:
      prefix: eventlogs/
      bucket: # S3BucketDef
        bucketName: spark-eventlogs
        connection:
          reference: eventlogs-s3-connection
  ...

Here the event files are stored in the S3 bucket spark-eventlogs with the eventlogs/ prefix. The complete definition of the specified S3 connection is resolved by the operator from the eventlogs-s3-connection resource (not shown here).

All Spark history properties can be defined in the properties section. The operator will make them available to the server at runtime. The operator makes no effort to validate these properties so extreme care must be taken making use of this section. In particular, properties concerning the event file storage can conflict with the definition provided in the logFileDirectory section and it’s not recommended to set them here at all.

Here is a complete example of a history server definition:

# Namespaced
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkHistoryServer
metadata:
  name: history
  namespace: default
spec:
  version: 3.3.0-stackable0.1.0
  sparkConf:
    - # compaction
    - # retention/ttl
    - # update interval
    - ...
  cleaner: true # option of bool; default=false: sets spark.history.fs.cleaner.enabled=true
  # complex enum with variant s3 (hdfs later on)
  logFileDirectory:
    s3:
      prefix: eventlogs/
      bucket: # S3BucketDef
        inline:
          bucketName: spark-eventlogs
          connection:
            inline:
              host: minio
              port: 9000
              accessStyle: Path
              credentials:
                secretClass: history-server-s3-credentials

History Server Usage

Spark applications don’t have to use a history server but it’s highly recommended to do so in production environments. The section eventLogs of a SparkApplication instructs the operator to configure the application such that it can be monitored using the provided (and existing) history server.

The example below shows the relevant configuration snippet of an application definition:

---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: spark
  namespace: default
spec:
  version: "1.0"
  sparkImage: docker.stackable.tech/stackable/spark-k8s:3.3.0-stackable0.1.0
  logFileDirectory:
    s3:
      prefix: eventlogs/
      bucket: # S3BucketDef
        inline:
          bucketName: spark-eventlogs
          connection:
            inline:
              host: minio
              port: 9000
              accessStyle: Path
              credentials:
                secretClass: eventlogs-s3-credentials
  s3Connection: # Optional. Used for normal data operations.
    inline:
      host: minio
      port: 9000
      accessStyle: Path
      credentials:
        secretClass: application-s3-credentials
...

In the example above, the application will log its events to the bucket specifed in logFileDirectory. In addition, the application processes data from a S3 bucket configured within the s3connection section of the specification.

The operator will read the s3Connection and set up the fs.s3a.aws.credentials.provider and co (endpoint, accesskey, secretkey, tls, path-style access - basically all attributes of S3Connection) settings. Afterwards if logFileDirectory attribute is set: Set fs.s3a.bucket.<logFileDirectory-bucket-name(here spark-eventlogs)>.aws.credentials.provider and co to overwrite endpoint and credentials for the logging bucket. Set spark.eventLog.enabled property to true and will construct the spark.eventLog.dir from s3a://<logFileDirectory-bucket-name(here spark-eventlogs)>/<logFileDirectory-prefix(here eventlogs/)>.

the credentials used by the HistoryServer do not have to be shared with `SparkApplication`s.

Advantages

Fully flexible solution, which allows the logDir to be on a different S3 ednpoint than the data.
If they are on the same endpojnt, a single S3BucketDef can be shared between HistoryServer and SparkApplication for ease of use.
HDFS and/or other distributed filesystems can be added non-breaking later on