Stackable Operator for Apache Superset
The Stackable Operator for Apache Superset is an operator that can deploy and manage Apache Superset clusters on Kubernetes. Superset is a data exploration and visualization tool that connects to data sources via SQL. Store your data in Apache Druid or Trino, and manage your Druid and Trino instances with the Stackable Operators for Apache Druid or Trino. This operator helps you manage your Superset instances on Kubernetes efficiently.
Getting started
Get started using Superset with Stackable Operator by following the Getting started. It guides you through installing the Operator alongside a PostgreSQL database, connecting to your Superset instance and analyzing some preloaded example data.
Resources
The Operator manages three custom resources: The SupersetCluster, SupersetDB and DruidConnection. It creates a number of different Kubernetes resources based on the custom resources.
Custom resources
The SupersetCluster is the main resource for the configuration of the Superset instance. The resource defines only one
role, the node
. The various configuration options are explained in the
Usage guide. It helps you tune your cluster to your needs by configuring
resource usage, security,
logging and more.
When a SupersetCluster is first deployed, a SupersetDB resource is created. The SupersetDB resource is a wrapper resource for the SQL database that is used by Superset for its metadata. The resource contains some configuration but also keeps track of whether the database has been initialized or not. It is not deleted automatically if a SupersetCluster is deleted, and so can be reused.
DruidConnection resources link a Superset and Druid instance. It lets you define this connection in the familiar way of deploying a resource (instead of configuring the connection via the Superset UI or API). The operator configures the connection between Druid and the Superset instance.
Kubernetes resources
Based on the custom resources you define, the Operator creates ConfigMaps, StatefulSets and Services.
The diagram above depicts all the Kubernetes resources created by the operator, and how they relate to each other. The Jobs created for the SupersetDB and DruidConnnection resources are not shown.
For every role group you define, the Operator creates a
StatefulSet with the amount of replicas defined in the RoleGroup. Every Pod in the StatefulSet has two containers: the
main container running Superset and a sidecar container gathering metrics for Monitoring. The
Operator creates a Service for the node
role as well as a single service per role group.
ConfigMaps are created, one per RoleGroup and also one for the SupersetDB. Both ConfigMaps contains two files:
log_config.py
and superset_config.py
which contain logging and general Superset configuration respectively.
Dependencies
Superset requires an SQL database in which to store its metadata, dashboards and users. The Stackable platform does not have its own Operator for an SQL database but the Getting started guides you through installing an example database with a Superset instance that you can use to get started.
Connecting to data sources
Superset does not store its own data, instead it connects to other products where data is stored. On the Stackable Platform the two commonly used choices are Apache Druid and Trino. For Druid there is a way to connect a Druid instance declaratively with a custom resource. For Trino this is on the roadmap. Have a look at the demos linked below for examples of using Superset with Druid or Trino.
Demos
Many of the Stackable demos use Superset in the stack for data visualization and explaration. The demos come in two main variants.
With Druid
The nifi-kafka-druid-earthquake-data and nifi-kafka-druid-water-level-data demos show Superset connected to Druid, exploring earthquake and water level data respectively.
With Trino
The spark-k8s-anomaly-detection-taxi-data, trino-taxi-data, trino-iceberg and data-lakehouse-iceberg-trino-spark demos all use a Trino instance on top of S3 storage that hold data to analyze. Superset is connected to Trino to analyze a variety of different datasets.