Stackable Operator for Apache HDFS

The Stackable Operator for Apache HDFS is used to set up HFDS in high-availability mode. It depends on the Stackable Operator for Apache ZooKeeper to operate a ZooKeeper cluster to coordinate the active and standby NameNodes.

This operator only works with images from the Stackable repository

Roles

Three roles of the HDFS cluster are implemented:

DataNode - responsible for storing the actual data.
JournalNode - responsible for keeping track of HDFS blocks and used to perform failovers in case the active NameNode fails. For details see: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
NameNode - responsible for keeping track of HDFS blocks and providing access to the data.

Kubernetes objects

The operator creates the following K8S objects per role group defined in the custom resource.

Service - ClusterIP used for intra-cluster communication.
ConfigMap - HDFS configuration files like core-site.xml, hdfs-site.xml and log4j.properties are defined here and mounted in the pods.
StatefulSet - where the replica count, volume mounts and more for each role group is defined.

In addition, a NodePort service is created for each pod labeled with hdfs.stackable.tech/pod-service=true that exposes all container ports to the outside world (from the perspective of K8S).

In the custom resource you can specify the number of replicas per role group (NameNode, DataNode or JournalNode). A minimal working configuration requires:

2 NameNodes (HA)
1 JournalNode
1 DataNode (should match at least the dfsReplication factor)

Supported Versions

The Stackable Operator for Apache HDFS currently supports the following versions of HDFS:

3.2.2
3.3.1
3.3.3
3.3.4

Docker image

docker pull docker.stackable.tech/stackable/hadoop:<version>