Usage
Druid requires a Zookeeper to run, as well as a database. If HDFS is used as the backend-storage (so-called "deep storage") then the HDFS operator is required as well.
Setup Prerequisites
SQL Database
Druid requires a MySQL or Postgres database.
For testing purposes, you can spin up a PostgreSQL database with the bitnami PostgreSQL helm chart. Add the bitnami repository:
helm repo add bitnami https://charts.bitnami.com/bitnami
And set up the Postgres database:
helm install druid bitnami/postgresql \
--version=11 \
--set auth.username=druid \
--set auth.password=druid \
--set auth.database=druid
Creating a Druid Cluster
With the prerequisites fulfilled, the CRD for this operator must be created:
kubectl apply -f /etc/stackable/druid-operator/crd
Then a cluster can be deployed using the example below.
apiVersion: druid.stackable.tech/v1alpha1
kind: DruidCluster
metadata:
name: simple-druid
spec:
image:
productVersion: 24.0.0
stackableVersion: 0.3.0
clusterConfig:
deepStorage:
hdfs:
configMapName: simple-hdfs
directory: /druid
metadataStorageDatabase:
dbType: postgresql
connString: jdbc:postgresql://druid-postgresql/druid
host: druid-postgresql # this is the name of the Postgres service
port: 5432
user: druid
password: druid
zookeeperConfigMapName: simple-zk
brokers:
roleGroups:
default:
replicas: 1
coordinators:
roleGroups:
default:
replicas: 1
historicals:
roleGroups:
default:
replicas: 1
middleManagers:
roleGroups:
default:
replicas: 1
routers:
roleGroups:
default:
replicas: 1
Please note that the version you need to specify is not only the version of Druid which you want to roll out, but has to be amended with a Stackable version as shown. This Stackable version is the version of the underlying container image which is used to execute the processes. For a list of available versions please check our image registry. It should generally be safe to simply use the latest image version that is available.
The Router is hosting the web UI, a NodePort
service is created by the operator to access the web UI. Connect to the simple-druid-router
NodePort
service and follow the druid documentation on how to load and query sample data.
Using S3
The Stackable Platform uses a common set of resource definitions for s3 across all operators. In general, you can configure an S3 connection or bucket inline, or as a reference to a dedicated object.
In Druid, S3 can be used for two things:
-
Ingesting data from a bucket
-
Using it as a backend for deep storage
You can specify just a connection/bucket for just one of these or for both, but Druid only supports a single S3 connection under the hood, so if two connections are specified, they must be the same. This is easiest if a dedicated S3 Connection Resource is used - not defined inline but as a dedicated object.
TLS for S3 is not yet supported.
S3 for ingestion
To ingest data from s3, you need to specify at least a host to connect to, but there are more settings that can be set:
spec:
clusterConfig:
ingestion:
s3connection:
host: yourhost.com (1)
port: 80 # optional (2)
credentials: # optional (3)
...
1 | The S3 host, not optional |
2 | Port, optional, defaults to 80 |
3 | Credentials to use. Since these might be bucket-dependent, they can instead be given in the ingestion job. Specifying the credentials here is explained below. |
S3 deep storage
Druid can use S3 as a backend for deep storage:
spec:
clusterConfig:
deepStorage:
s3:
bucket:
inline:
bucketName: my-bucket (1)
connection:
inline:
host: test-minio (2)
port: 9000 (3)
credentials: (4)
...
1 | Bucket name. |
2 | Bucket host. |
3 | Optional bucket port. |
4 | Credentials explained below. |
It is also possible to configure the bucket connection details as a separate Kubernetes resource and only refer to that object from the DruidCluster
like this:
spec:
clusterConfig:
deepStorage:
s3:
bucket:
reference: my-bucket-resource (1)
1 | Name of the bucket resource with connection details. |
The resource named my-bucket-resource
is then defined as shown below:
---
apiVersion: s3.stackable.tech/v1alpha1
kind: S3Bucket
metadata:
name: my-bucket-resource
spec:
bucketName: my-bucket-name
connection:
inline:
host: test-minio
port: 9000
credentials:
... (explained below)
This has the advantage that bucket configuration can be shared across `DruidClusters`s (and other stackable CRDs) and reduces the cost of updating these details.
S3 Credentials
No matter if a connection is specified inline or as a separate object, the credentials are always specified in the same way. You will need a Secret
containing the access key ID and secret access key, a SecretClass
and then a reference to this SecretClass
where you want to specify the credentials.
The Secret
:
apiVersion: v1
kind: Secret
metadata:
name: s3-credentials
labels:
secrets.stackable.tech/class: s3-credentials-class (1)
stringData:
accessKey: YOUR_VALID_ACCESS_KEY_ID_HERE
secretKey: YOUR_SECRET_ACCES_KEY_THATBELONGS_TO_THE_KEY_ID_HERE
1 | This label connects the Secret to the SecretClass . |
The SecretClass
:
apiVersion: secrets.stackable.tech/v1alpha1
kind: SecretClass
metadata:
name: s3-credentials-class
spec:
backend:
k8sSearch:
searchNamespace:
pod: {}
Referencing it:
...
credentials:
secretClass: s3-credentials-class
...
HDFS deep storage
Druid can use HDFS as a backend for deep storage:
spec:
clusterConfig:
deepStorage:
hdfs:
configMapName: simple-hdfs (1)
directory: /druid (2)
...
1 | Name of the HDFS cluster discovery config map. Can be supplied manually for a cluster not provided by Stackable. Needs to contain the core-site.xml and hdfs-site.xml . |
2 | The directory where to store the druid data. |
Security
The Druid cluster can be secured and protected in multiple ways.
Encryption
TLS encryption is supported for internal cluster communication (e.g. between Broker and Coordinator) as well as for external communication (e.g. between the Browser and the Router Web UI).
spec:
clusterConfig:
tls:
serverAndInternalSecretClass: tls (1)
1 | Name of the SecretClass that is used to encrypt internal and external communication. |
A Stackable Druid cluster is always encrypted per default. In order to disable this default behavior you can set spec.clusterConfig.tls.serverAndInternalSecretClass: null .
|
Authentication
TLS
The access to the Druid cluster can be limited by configuring client authentication (mutual TLS) for all participants. This means that processes acting as internal clients (e.g. a Broker) or external clients (e.g. a Browser) have to authenticate themselves with valid certificates in order to communicate with the Druid cluster.
spec:
clusterConfig:
authentication:
- authenticationClass: druid-tls-auth (1)
1 | Name of the AuthenticationClass that is used to encrypt and authenticate communication. |
The AuthenticationClass
may or may not have a SecretClass
configured:
---
apiVersion: authentication.stackable.tech/v1alpha1
kind: AuthenticationClass
metadata:
name: druid-mtls-authentication-class
spec:
provider:
# Option 1
tls:
clientCertSecretClass: druid-mtls (1)
# Option 2
tls: {} (2)
1 | If a client SecretClass is provided in the AuthenticationClass (here druid-mtls ), these certificates will be used for encryption and authentication. |
2 | If no client SecretClass is provided in the AuthenticationClass , the spec.clusterConfig.tls.serverAndInternalSecretClass will be used for encryption and authentication. It cannot be explicitly set to null in this case. |
Authorization with Open Policy Agent (OPA)
Druid can connect to an Open Policy Agent (OPA) instance for authorization policy decisions. You need to run an OPA instance to connect to, for which we refer to the OPA Operator docs. How you can write RegoRules for Druid is explained below.
Once you have defined your rules, you need to configure the OPA cluster name and endpoint to use for Druid authorization requests. Add a section to the spec
for OPA:
spec:
clusterConfig:
authorization:
opa:
configMapName: simple-opa (1)
package: my-druid-rules (2)
1 | The name of your OPA cluster (simple-opa in this case) |
2 | The RegoRule package to use for policy decisions. The package should contain an allow rule. This is optional and will default to the name of the Druid cluster. |
Defining RegoRules
For a general explanation of how rules are written, we refer to the OPA documentation. Inside your rule you will have access to input from Druid. Druid provides this data to you to base your policy decisions on:
{
"user": "someUsername", (1)
"action": "READ", (2)
"resource": {
"type": "DATASOURCE", (3)
"name": "myTable" (4)
}
}
1 | The authenticated identity of the user that wants to perform the action |
2 | The action type, can be either READ or WRITE . |
3 | The resource type, one of STATE , CONFIG and DATASOURCE . |
4 | In case of a datasource this is the table name, for STATE this will simply be STATE , the same for CONFIG . |
For more details consult the Druid Authentication and Authorization Model.
Connecting to Druid from other Services
The operator creates a ConfigMap
with the name of the cluster which contains connection information. Following our example above (the name of the cluster is simple-druid
) a ConfigMap
with the name simple-druid
will be created containing 3 keys:
-
DRUID_ROUTER
with the format<host>:<port>
, which points to the router processes HTTP endpoint. Here you can connect to the web UI, or use REST endpoints such as/druid/v2/sql/
to query data. More information in the Druid Docs. -
DRUID_AVATICA_JDBC
contains a JDBC connect string which can be used together with the Avatica JDBC Driver to connect to Druid and query data. More information in the Druid Docs. -
DRUID_SQALCHEMY
contains a connection string used to connect to Druid with SQAlchemy, in - for example - Apache Superset.
Monitoring
The managed Druid instances are automatically configured to export Prometheus metrics. See Monitoring for more details.
Configuration & Environment Overrides
The cluster definition also supports overriding configuration properties and environment variables, either per role or per role group, where the more specific override (role group) has precedence over the less specific one (role).
Overriding certain properties which are set by operator (such as the HTTP port) can interfere with the operator and can lead to problems. |
Configuration Properties
For a role or role group, at the same level of config
, you can specify: configOverrides
for the runtime.properties
. For example, if you want to set the druid.server.http.numThreads
for the router to 100 adapt the routers
section of the cluster resource like so:
routers:
roleGroups:
default:
config: {}
configOverrides:
runtime.properties:
druid.server.http.numThreads: "100"
replicas: 1
Just as for the config
, it is possible to specify this at role level as well:
routers:
configOverrides:
runtime.properties:
druid.server.http.numThreads: "100"
roleGroups:
default:
config: {}
replicas: 1
All override property values must be strings.
For a full list of configuration options we refer to the Druid Configuration Reference.
Environment Variables
In a similar fashion, environment variables can be (over)written. For example per role group:
routers:
roleGroups:
default:
config: {}
envOverrides:
MY_ENV_VAR: "MY_VALUE"
replicas: 1
or per role:
routers:
envOverrides:
MY_ENV_VAR: "MY_VALUE"
roleGroups:
default:
config: {}
replicas: 1
Storage for data volumes
Druid uses S3 or HDFS deep storage, so no extra PersistentVolumeClaims have to be specified.
Resource Requests
Stackable operators handle resource requests in a sligtly different manner than Kubernetes. Resource requests are defined on role or group level. See Roles and role groups for details on these concepts. On a role level this means that e.g. all workers will use the same resource requests and limits. This can be further specified on role group level (which takes priority to the role level) to apply different resources.
This is an example on how to specify CPU and memory resources using the Stackable Custom Resources:
---
apiVersion: example.stackable.tech/v1alpha1
kind: ExampleCluster
metadata:
name: example
spec:
workers: # role-level
config:
resources:
cpu:
min: 300m
max: 600m
memory:
limit: 3Gi
roleGroups: # role-group-level
resources-from-role: # role-group 1
replicas: 1
resources-from-role-group: # role-group 2
replicas: 1
config:
resources:
cpu:
min: 400m
max: 800m
memory:
limit: 4Gi
In this case, the role group resources-from-role
will inherit the resources specified on the role level. Resulting in a maximum of 3Gi
memory and 600m
CPU resources.
The role group resources-from-role-group
has maximum of 4Gi
memory and 800m
CPU resources (which overrides the role CPU resources).
For Java products the actual used Heap memory is lower than the specified memory limit due to other processes in the Container requiring memory to run as well. Currently, 80% of the specified memory limits is passed to the JVM. |
For memory only a limit can be specified, which will be set as memory request and limit in the Container. This is to always guarantee a Container the full amount memory during Kubernetes scheduling.
If no resources are configured explicitly, the Druid operator uses following defaults:
brokers:
roleGroups:
default:
config:
resources:
cpu:
min: '200m'
max: "4"
memory:
limit: '2Gi'
The default values are most likely not sufficient to run a proper cluster in production. Please adapt according to your requirements. |
For more details regarding Kubernetes CPU limits see: Assign CPU Resources to Containers and Pods.
Historical Resources
In addition to the cpu and memory resources described above, historical Pods also accept a storage
resource with the following properties:
-
segmentCache
- used to set the maximum size allowed for the historical segment cache locations. See the Druid documentation regarding druid.segmentCache.locations. The operator creates anemptyDir
and sets themax_size
of the volume to be the value of thecapacity
property. In addition Druid is configured to keep 7% volume size free. By default, if nosegmentCache
is configured, the operator will create anemptyDir
with a size of1G
andfreePercentage
of5
.
Example historical configuration with storage resources:
historicals:
roleGroups:
default:
config:
resources:
storage:
segmentCache:
# The amount of free space to subtract from the capacity setting below.
freePercentage: 7
emptyDir:
# The maximum size of the volume used to store the segment cache
capacity: 2g