Stackable Operator for Apache Druid
The Stackable Operator for Apache Druid is an operator that can deploy and manage Apache Druid clusters on Kubernetes.Apache Druid is an open-source, distributed data store designed to quickly process large amounts of data in real-time. It enables users to ingest, store, and query massive amounts of data in real-time, a great tool for handling high-volume data processing and analysis. This operator provides several resources and features to manage Druid clusters efficiently.
Getting Started
To get started with the Stackable Operator for Apache Druid, follow the Getting Started guide. It guides you through the installation of the Operator and its dependencies (ZooKeeper, HDFS, an SQL database) and the steps to query your first sample data.
Resources
The Operator is installed along with the DruidCluster CustomResourceDefinition, which supports five roles: Router, Coordinator, Broker, MiddleManager and Historical. These roles correspond to Druid processes.
The Operator watches DruidCluster objects and creates multiple Kubernetes resources for each DruidCluster based on its configuration.
For every RoleGroup a StatefulSet is created. Each StatefulSet can contain multiple replicas (Pods). Each Pod has at least two containers: the main Druid container and a preparation container which just runs once at startup. If Log aggregation is enabled, there is a sidecar container for logging too. For every Role and RoleGroup the Operator creates a Service.
A ConfigMap is created for each RoleGroup containing 3 files: jvm.config
and runtime.properties
files generated
from the DruidCluster configuration (See Usage guide for more information), plus a log4j2.properties
file used for Log aggregation. For the whole DruidCluster a discovery ConfigMap
is created which contains information on how to connect to the Druid cluster.
Dependencies and other Operators to connect to
The Druid Operator has the following dependencies:
-
A deep storage backend is required to persist data. Use either HDFS with the Stackable Operator for Apache HDFS or S3.
-
An SQL database to store metadata.
-
Apache ZooKeeper via the Stackable Operator for Apache ZooKeeper. Apache ZooKeeper is used by Druid for internal communication between processes.
-
The Stackable Commons Operator provides common CRDs such as S3 resources CRDs.
-
The Stackable Secret Operator is required for things like S3 access credentials or LDAP integration.
-
The Stackable Listener Operator exposes the pods to the outside network.
Have a look at the getting started guide for an example of a minimal working setup.
The getting started guide sets up a fully working Druid cluster, but the S3 deep storage backend as well as the metadata SQL database are required external components and need to be set up by you as prerequisites for a production setup.
Druid works well with other Stackable supported products, such as Apache Kafka for data ingestion Trino for data processing or Superset for data visualization. OPA can be connected to create authorization policies. Have a look at the Usage guide for more configuration options and have a look at the demos for complete data pipelines you can install with a single command.
Demos
stackablectl supports installing Demos with a single command. The demos are complete data piplines which showcase multiple components of the Stackable platform working together and which you can try out interactively. Both demos below include Druid as part of the data pipeline:
Waterlevel Demo
The nifi-kafka-druid-water-level-data demo uses data from PEGELONLINE to visualize water levels in rivers and coastal regions of Germany from historic and real time data.
Earthquake Demo
The nifi-kafka-druid-earthquake-data demo ingests earthquake data into a similar pipeline as is used in the waterlevel demo.