Stackable Operator for Apache Hive
This is an operator for Kubernetes that can manage Apache Hive metastores. The Apache Hive metastore (HMS) was originally developed as part of Apache Hive. It stores information on the location of tables and partitions in file and blob storages such as Apache HDFS and S3 and is now used by other tools besides Hive as well to access tables in files. This Operator does not support deploying Hive itself, but Trino is recommended as an alternative query engine.
Getting started
Follow the Getting started guide which will guide you through installing the Stackable Hive Operator and its dependencies. It walks you through setting up a Hive metastore and connecting it to a demo Postgres database and a Minio instance to store data in.
Afterwards you can consult the Usage guide to learn more about tailoring your Hive metastore configuration to your needs, or have a look at the demos for some example setups with either Trino or Spark.
Operator model
The Operator manages the HiveCluster custom resource. The cluster implements a single metastore
role.
For every role group the Operator creates a ConfigMap and StatefulSet which can have multiple replicas (Pods). Every role group is accessible through its own Service, and there is a Service for the whole cluster.
The Operator creates a service discovery ConfigMap for the Hive metastore instance. The discovery ConfigMap contains information on how to connect to the HMS.
Required external component: An SQL database
The Hive metastore requires a database to store metadata. Consult the required external components page for an overview of the supported databases and minimum supported versions.
Demos
Three demos make use of the Hive metastore.
The spark-k8s-anomaly-detection-taxi-data and trino-taxi-data use the HMS to store metadata information about taxi data. The first demo then analyzes the data using Apache Spark and the second one using Trino.
The data-lakehouse-iceberg-trino-spark demo is the biggest demo available. It uses both Spark and Trino for analysis.
Why is the Hive query engine not supported?
Only the metastore is supported, not Hive itself. There are several reasons why running Hive on Kubernetes may not be an optimal solution. The most obvious reason is that Hive requires YARN as an execution framework, and YARN assumes much of the same role as Kubernetes - i.e. assigning resources. For this reason we provide Trino as a query engine in the Stackable Data Platform instead of Hive. Trino still uses the Hive Metastore, hence the inclusion of this operator as well. Trino should offer all the capabilities Hive offers including a lot of additional functionality, such as connections to other data sources.
Additionally, Tables in the HMS can also be accessed from Apache Spark.