ADR019: Trino catalog definitions

Status: accepted
Deciders:
- Felix Hennig
- Malte Sander
- Sebastian Bernauer
- Sönke Liebau
- Teo Klestrup Röijezon
Date: 17.05.2022

Context and Problem Statement

Trino allows user to specify multiple catalogs to connect to a variety of different data-sources. We need to agree on a mechanism to

Specifying Trino catalog definitions (this ADR)
Connect a catalog definition to an Trino cluster (ADR020: Trino catalog usage)

Decision Drivers

Multiple different types of connectors must be supported, e.g. Hive, Iceberg, Oracle and PostgreSQL.
In case of catalogs that use distributed file-systems such as HDFS or S3 the access needs to be configured.

Considered Options

TrinoCatalog CRD with discovery ConfigMaps from same namespace
TrinoCatalog CRD with discovery ConfigMaps from potentially other namespaces

Decision Outcome

Chosen option: "TrinoCatalog CRD with discovery ConfigMaps from same namespace". This option lines up with the way we want to handle discovery ConfigMaps: The operator of the service connecting to (e.g. hdfs) watches for HdfsDirectory objects and provides us with a discovery ConfigMap in the target namespace. We will start only implementing the Hive connector and support more connectors in the future.

Pros and Cons of the Options

A TrinoCatalog has a top-level complex enum to distinguish between the different connector types. This way every connector can define it’s own set of attributes that it supports.

TrinoCatalog CRD with discovery ConfigMaps from same namespace

Here all references to discovery ConfigMaps such as HDFS or Hive only are a string that contains the name of the ConfigMap. The ConfigMap must reside in the same namespace as the TrinoCatalog object

---
# Pseudo code!
TrinoCatalog
metadata:
    name: trino-catalog
    namespace: default
spec:
    hive:
        metastore: # mandatory
            configMap: my-hive-metastore
        s3: # S3ConnectionDef, optional
            inline:
                host: minio
            # OR
            reference: my-minio-connection
        hdfs: # optional
            configMap: my-hdfs # will provide hdfs-site.xml
            impersonation: true # optional, defaults to false
            # there is no kerberos or wireEncryption attribute, as the information about kerberos comes from the discovery configmap
    # OR
    iceberg: {} # Attributes need to be defined later on when we support iceberg
    # OR
    postgresql: {} # Attributes need to be defined later on when we support postgresql
    # OR [...]

Looking at the example of hdfs the hdfs discovery ConfigMap will be created by the hdfs-operator. That can be the case because we are running in the same namespace as hdfs or we place a HdfsDirectory object into the Trino namespace. The hdfs-operator then detects the HdfsDirectory object and places a discovery ConfigMap into the Trino namespace. (This is similar to the way ZooKeeper’s ZNodes currently work)

Good, because it’s simple and we don’t have to worry about cross-namespace access
Bad, because it prohibits the usage of an HDFS in a different namespace than the TrinoCatalog namespace. This can be solved by letting the hdfs-operator put the same discovery configmap into multiple namespaces (including the one with the TrinoCatalog)

TrinoCatalog CRD with discovery ConfigMaps from potentially other namespaces

Here all references to discovery ConfigMaps such as HDFS or Hive are a tuple of the name and the namespace of the ConfigMap. The namespace is optional, if not provided the same namespace from the TrinoCatalog will be used. The ConfigMap can reside in a different namespace as the TrinoCatalog object

---
# Pseudo code!
TrinoCatalog
metadata:
    name: trino-catalog
    namespace: default
spec:
    hive:
        metastore: # mandatory
            configMap:
                name: my-hive-metastore
                namespace: default # optional
        s3: # S3ConnectionDef, optional
            inline:
                host: minio
            # OR
            reference:
                name: my-minio-connection
                namespace: default # optional
        hdfs: # optional
            configMap: # will provide hdfs-site.xml
                name: my-hdfs
                namespace: default # optional
            impersonation: true # optional, defaults to false
            # there is no kerberos or wireEncryption attribute, as the information about kerberos comes from the discovery configmap
    # OR
    iceberg: {} # Attributes need to be defined later on when we support iceberg
    # OR
    postgresql: {} # Attributes need to be defined later on when we support postgresql
    # OR [...]

Good, because it allows easy cross-namespace access
Bad, because it’s more complicated
Bad, because we can’t simply mount the ConfigMap (e.g. with hdfs-site.xml) but instead somehow need to "transfer" it between different namespaces and watch the original ConfigMap.